Word_2_vector.. (aka word embeddings)

Word 2 vector:

  • word 2 vector is a way to take a big set of text and convert into a matrix with a word at
    each row.
  • It is a shallow neural-network(2 layers)
  • Two options/training methods (

CBOW(Continuous-bag-of-words assumption)

  • — a text is represented as the bag(multiset) of its words
  • — disregards grammar
  • — disregards word order but keeps multiplicity
  • — Also used in computer vision

skip-gram() — it is a generalization

of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts x_i based on x_(i-(n-1)), ....,x_(i-1) .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.

Bias-vs-Variance Tradeoff:

  • — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make

Smoothing Techniques:

  • — Problems of balance weight between infrequent n-grams.
  • — Unseen n-grams by default get 0.0 without smoothing.
  • — Use pseudocounts for unseen n-grams.(generally motivated by
    bayesian reasoning on the sub n-grams, for n < original n)

  • — Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.

  • — Could be useful for languages with less strict subject-verb-object order than English.

Alternative link

  • Depends on Distributional Hypothesis
  • Vector representations of words called “word embeddings”
  • Basic motivation is that compared to audio, visual domains, the word/text domain treats
    them as discrete symbols, encoding them as sparse dataset. Vector based representation
    works around these issues.
  • Also called as vector Space models
  • Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
    a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.

  • Trained using the Maximum Likelihood model

  • Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
  • However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
  • Therefore a logistic regression aka binary classification objective functionis used.
  • The way this is achieved is called negative sampling

Squeeze theorem

To quote from “The girl next door”
The first lesson of politics is “Always know whether the squeeze is worth the juice”. Now i was trying to finally make a genuine effort at understanding Central Limit theorem. Throughout my life(30 years), i have always been suspicious whenever statistics goes beyond the mean, median, mode, SD and Variance. (i.e to say, whenever any stat goes above first and second moments). Part of it because i never really learnt or rather never paid enough attention to convince myself of the theorems involved in reasoning with distributions. Anyways, i figured Central limit theorem would be a good place to start and in learning by teaching am summarizing what i’ve learnt so far.

It started off as i came across this post on HN and going through comments and critique realized the demo is more of a special case and while i did get that specific example(and sure of what CLT says) am still unsure of why Central limit theorem is true or how one formulate it in math terms. It is important for me to understand those, if i am ever to be able to question someone claiming some implication of CLT. Anyway, i came across the squeeze theorem in one of the HN comments and since it seems it’s part of the proof for CLT, I ended up reading and here’s the result of that.

Anyway, enough story. Let’s go onwards. So here goes straight from the wikipedia page:

There are three functions f,g,h defined over a limit l.
a is a limit point.
f,g,h may not be defined at a, since it is the limit point.

g(x) leq f(x) leq h(x)

lim_{x to a} g(x) = lim_{x to a} h(x) = L

To be proved:
lim_{x to a} f(x) = L



I’ll try and clarify what is a limit as mathematically defined, and hopefully without equations,but words only.
Well, according to wikipedia page, limit of a function f(x) means that the function f(x) can be made as close to a value (say L),
by making x sufficiently close to c.

Or to write out the equation
lim_{x to c}f(x) = L

The “when is cheryl’s birthday?” problem — ipython solving steps

Cool, step by step solving of the recently viral when is cheryl’s birthday problem?.

from The “when is cheryl’s birthday?” problem — ipython solving steps

from Tumblr http://ift.tt/1TTBDwS


from The “when is cheryl’s birthday?” problem — ipython solving steps

from Tumblr http://ift.tt/1NIPglP


from The “when is cheryl’s birthday?” problem — ipython solving steps

from Tumblr http://ift.tt/1WePwvL


from The “when is cheryl’s birthday?” problem — ipython solving steps

from Tumblr http://ift.tt/1NITLwP


from The “when is cheryl’s birthday?” problem — ipython solving steps

from Tumblr http://ift.tt/1OTDyoV


from The “when is cheryl’s birthday?” problem — ipython solving steps

from Tumblr http://ift.tt/1NIX1Io


from The “when is cheryl’s birthday?” problem — ipython solving steps
from Tumblr http://ift.tt/1TAPniI

Stack Ranking

Why stack-ranking is always a case of ‘the house always wins’:


1. It’s partly a defendant’s argument, and I am biased towards my client(i.e the employee).

2.I’ve little experience managing a group of people and don’t claim to know all the challenges involved.

3.My research/reading has been restricted to supporting ideas/theories/assumptions only. Not the thorough, covering all other bases(and unbiased) kind of literature survey.(**wink** stack ranking vs performance/vitality curve distinction)

At first look it looks like a wonderful meritocratic setup. It uses relative comparison with peers(not unlike pagerank algorithm/ eigen morality. . On the face of it is a very brilliant idea or a good idea that works well, when measuring quantities that haven’t been quantified before enough. In fact, if I were trying to do science on measuring performance, it’s a reasonably sane and tried approach. However, there are problems with using it.
I understand why it makes it easier to make decisions(especially big organizations), you get a single number that’s guaranteed to fall within some expected values(in the probability theory sense) and forcing a curve simply makes it easier to fit a fixed amount for bonuses and incentives. However, here’s the challenge how do you know your employees’/managers’/directors’ performance falls into a bell curve*? Maybe your company’s hiring practices always get bad performing employees or average performing employees or high performing employees(all three in comparison to the general population)?. In which case, aren’t you alienating a high performing employee, because his peers did better(perhaps in revenue)? The catch is revenue has more factors influencing than the performance of your employees.

Here’s a quote from here. :

“You have to have an objective when you do stuff like this. At GE there was only one objective, and that was to force honesty. That’s all it ever was—to force an honest discussion between your manager and you. And there’s nothing that quite forces that more than employees knowing that they expect to know how that manager ranks them, and then asking that manager, ‘Tell me where I rank and tell me why.’”

See anything wrong in that argument? Try replacing ‘honesty’ with ‘dishonesty’ and the argument still is logically consistent and sounds right. Guess why, because there’s an underlying assumption, stack-ranking raises honesty(or honest communication). While I agree, it’s a good way to force managers to give feedback(especially negative) to their employees, am not convinced it’s good or encourages honesty. I get that people(and managers) are more likely to avoid giving negative feedback and they are also subject to confirmation bias. . All of which can create bloated inefficient departments/teams. Here’s the catch, when you force something like this you’re eventually pushing the lowest ranks to people who are bad negotiators(with their managers) and therefore don’t push back when given negative feedback. Over half a decade or so you get a whole company of employees, who are all very good negotiators(no correlation positive or negative with performance).
In the end that defense sounds way too much like someone (who’s a reformer) and is stuck in the values/virtues node aka holy priest(I know, I’ve been guilty of it so often and probably right now). Enough of debate-level arguments, here’s an attempt at discussion of why it becomes something bad.
In theory, it can encourage managers to be honest to give negative feedback to their hires/employees, but in practice, it comes down to compromises/favours/future promises traded between the employee and the manager. You’re forcing the manager to make compromise/favour/future promise to one employee to pay the other. Even then, if it is still one number and some subjective reasoning between manager and employee it has some hopes of being a measure**. Now that post doesn’t make it clear why it’s a bad idea to use a normal curve on measuring performance, but it’s basic necessity before we can talk about using/finalizing measures of a hitherto unquantified phenomenon. For that we need to understand where does this vitality curve concept comes from.
Here goes the google scholar search result showing up nothing.I’ve been trying to find what research went into the whole stack ranking idea. A google search shows up Vitality curve. Ok where could Jack Welch have picked up this insane idea of vitality curve? The closest I can find is

Central Limit Theorem in Statistics.
The basic premise of this theorem is that if we take enough number of samples of random variable of unknown distribution, the average of the samples will form a normal standard distribution.

This is not the strongest form of the theorem, but is the basic one the rest of the theorems are based on.

Now let’s look at what this means. When you’re examining a measurable quantity, whose distribution is unknown, you can essentially take samples(enough no. of times and enough size) and average it to form a normal distribution, if there’s enough samples and sample sizes.
Why/How is this useful?

Well it becomes useful when you want to compare two random variables and see if they have anything in correlation or common causal factors.

Especially, when you have figured out ways to manipulate/control one of the variables, we can simply design experiments that measure both of these variables, plot the difference of their averages(of the samples) and see how much it varies from the standard normal curve. This can give us whether they are positively or negatively correlated or simply unrelated. This is how experimental sciences work. Ofcourse, it’s not perfect, but it’s the best we have.**

On top of all this it breaks down at a critical assumption of IID***

Now, let’s get back to the original topic, if your organization/manager is implementing a stack ranking and if they refer to central limit theorem(you’re in luck, I haven’t heard any manager relate both of these, or name any of these.) you can question where does their idea of normality comes from. There’ll be cases, where your manager will tell you, your performance was average/below-average/above-average with respect to the rest of the team/organization’s. You get to question, how did they arrive at the normal curve’s values( most likely answer would be past year’s performance).

But here’s the catch, if they understand the experimentation process, the challenge then is to prove/question the current curve has seen enough samples. I don’t think it’s possible in most organizations/most roles. Of course, in very well established industries, with very specifically defined roles, it makes sense and is possible, but I’m not sure it applies well in the modern business environment.

Now the bigger your organization, more likely your performance is rated among different aspects/vectors/areas, which essentially multiplies the number of variables, and actually complicates the problem(requiring more samples to normalize).

What are the basic premises of the “Central Limit Theorem”?
Well, for one that you are comparing two distributions of random variables. (aka random distributions).

* — A quick read based on the blog here suggests not all companies use standard normal distribution, but normal/gaussian distribution with different spreads. facebook seems to have a narrower spread than amazon( which makes me think of the differences in corporate culture and what this model entails for it, but that’s more thinking and perhaps another blog post, about nash equilibrium competition vs co-operation.Hunch/Guess: more competition than co-operation at facebook and vice-versa at amazon.). It’s not clear what google uses.

** — Scientists, don’t get angry with this. I know there are more nuances that go into statistical inferences, but think this is core value/process, and can be explained simply. Besides, am not a real scientist, just a guy who went out of the academics.

*** — Making this assumption about a lot of the variables, I’ve seen used in a performance review is rather comical (like this).

P.S: To put a cynical quip (paraphrasing i think Douglas Adams), The universe is either mildly malevolent or neutral(i.e: definitely not benevolent), the modern workplace is definitely malevolent(either mildly or fatally).

from Stack Ranking

from Tumblr http://ift.tt/1TTC9Lj


from Stack Ranking

from Tumblr http://ift.tt/1NIPGIY


from Stack Ranking

from Tumblr http://ift.tt/1WePFPA


from Stack Ranking

from Tumblr http://ift.tt/1NITQAv


from Stack Ranking

from Tumblr http://ift.tt/1OTDaGS


from Stack Ranking

from Tumblr http://ift.tt/1NIWTZw


from Stack Ranking
from Tumblr http://ift.tt/1TAPB9q

HTTP protocol.. RFC study notes

Alright, I sholud have done this atleast 2 years ago and was too much of an idiot to not do this, better late than never.

Study Notes — http protocol (RFC 7230 – 7235)*

RFC 7230 — Message syntax and Routing

Key parties:
1. HTTP Server: the sytem that responds to http requets with http responses
2. User Agent/http client: the system that sends the http requests

There are some intermediate parties in the communication between 1 and 2. (Because of how tcp/ip works).
Note: these are relevant because, some of the keywords are related to these. (aka, this is where the http vs tcp/ip abstraction leaks)
1. proxy:
message-forwarding agent selected by client(via configurable rules),
commonly used to group an organizations’ requests
2. gateway:
an intermediary that acts as origin(http) server for a outbound connection but translates the requests and forwards them inbound to other servers.
3. tunnel:
Tunnel is a blind relay between 2 connections, that passes on messages. it differs from gateway, but not translating the requests, but blindly passing them on. Generally used in situations like TLS + https secure communication via a firewall proxy

Details in RFC 7234.
1. Local store of previous response messages
2. A response may or may not be cached based on :
a, cacheable flag is set.
b, A set of constraints defined in rfc7234

A Message has atleast these fields:
Version is .

HTTP-version = HTTP-name “/” DIGIT “.” DIGIT
HTTP-name = %x48.54.54.50 ; “HTTP”, case-sensitive

Major version denotes http messaging syntax, while minor version is the client’s communication capabilities.
Hmm.. these two don’t seem well-defined so far in the rfc.
My guess is the major version corresponds to tell the server which protocol-specific syntax, (ie: http/https/ftp/etc.) to connect with the server is used for the request.
While minor version is which version client understands, so the response can be formatted in a compatible manner.
My guess about major num is wrong.

The intention of HTTP’s versioning design is that the major number
will only be incremented if an incompatible message syntax is
introduced, and that the minor number will only be incremented when
changes made to the protocol have the effect of adding to the message
semantics or implying additional capabilities of the sender.
However, the minor version was not incremented for the changes
introduced between [RFC2068] and [RFC2616], and this revision has
specifically avoided any such changes to the protocol.

Uniform Resource Identifiers:
1. identifies resources
For the URI syntax, I’ll just quote from the links on the rfc.

URI-reference =
absolute-URI =
relative-part =
scheme =
authority =
uri-host =
port =
path-abempty =
segment =
query =
fragment =

absolute-path = 1*( “/” segment )
partial-URI = relative-part [ “?” query ]

http URI Scheme:

* — Original RFC was 2616 http://ift.tt/1qWngNQ, but it was superseded by these.

from HTTP protocol.. RFC study notes

from Tumblr http://ift.tt/1TJ1z20


from HTTP protocol.. RFC study notes

from Tumblr http://ift.tt/1YUABVk


from HTTP protocol.. RFC study notes
from Tumblr http://ift.tt/1RmDqIV

Pure math.. — Definition for Explain like I’m 5

Why Do We Pay Pure Mathematicians?

Brilliant writing as mathwithbaddrawings always comes up with.

from Pure math.. — Definition for Explain like I’m 5

from Tumblr http://ift.tt/1TTC6PB


from Pure math.. — Definition for Explain like I’m 5

from Tumblr http://ift.tt/1NIPutn


from Pure math.. — Definition for Explain like I’m 5
from Tumblr http://ift.tt/1WePPqi

What I would change about python?

1. The semantics of the ‘or’ keyword. I know it’s supposed to make it readable, as it currently exists(i.e: read boolean values of left side expression, and if false read right side of the expression and return whichever is true. False if both are false.). I’d rather have it return True or False instead. I think that’s more logical for a programmer, and perhaps that’s part of python being not a purely-functional language.

2. The distinction between expression and statement.

3. Side-Effects: While it’s possible to write code that provides functional interface, it(interpreter) does not guarantee no side-effects/assignments.

from What I would change about python?

from Tumblr http://ift.tt/1TJ0M11


from What I would change about python?

from Tumblr http://ift.tt/1YUBayt


from What I would change about python?
from Tumblr http://ift.tt/1RmDB70

Why read fiction?

Why do I read fiction? Or what do I get out of reading fiction?
Vivek haldar here talks about how he doesn’t read fiction because it does nothing to him, or rather means nothing to him.
It set me thinking like a knot in my brain, or a thorn in the brain. I read it long time ago, and my first thought was am the opposite.
I prefer reading fiction. In the time since, I have held the question in my mind for some time and come up with the following possibilities:

0. Theory of mind– there’s some (scanty,debatable)evidence reading fiction helps understanding how other minds work.
Here’s the study
And I do have a tendency to retreat into reading fiction, when I am upset/confused or trying figure out what’s the right decision(usually regarding people in my life) to make.

1. I find it kinda enhances or clears my head to goad into logical thinking.* i.e: once am done reading through the fiction to completion.

2.It definitely affords a comfortable/guilt-free thing to do, without being(nay feeling) guilty of procrastination, supposedly reading is always considered a good thing(socially).

3.It could also simply be my way of dealing with the modern world’s craziness. Much like VGR refers here.

4. It helps as good practice for thought experiments and therefore makes it easier to consider alternative explanations**.

5. It definitely helps to clear out the emotional components from my decision-making/thinking. More specifically in the (alertness/arousal) scale, it helps lowering out arousal level, and therefore raising the alertness/arousal ratio. (One of my hypothesis is that rational thinking directly proportional to ratio of alertness to arousal levels).

*– Might simply be wishful thinking on my part.
P.S: The above is a rather descriptive attempt. Some of the points may and probably do have overlap with other points. The bullet point format is simply organized for communication, instead of empirical hypothesis testing.

from Why read fiction?

from Tumblr http://ift.tt/1TTC6iz


from Why read fiction?

from Tumblr http://ift.tt/1NIPCJe


from Why read fiction?
from Tumblr http://ift.tt/1WePGmC