Word_2_vector.. (aka word embeddings)

Word 2 vector:

  • word 2 vector is a way to take a big set of text and convert into a matrix with a word at
    each row.
  • It is a shallow neural-network(2 layers)
  • Two options/training methods (

CBOW(Continuous-bag-of-words assumption)

  • — a text is represented as the bag(multiset) of its words
  • — disregards grammar
  • — disregards word order but keeps multiplicity
  • — Also used in computer vision

skip-gram() — it is a generalization

of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts x_i based on x_(i-(n-1)), ....,x_(i-1) .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.

Bias-vs-Variance Tradeoff:

  • — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make

Smoothing Techniques:

  • — Problems of balance weight between infrequent n-grams.
  • — Unseen n-grams by default get 0.0 without smoothing.
  • — Use pseudocounts for unseen n-grams.(generally motivated by
    bayesian reasoning on the sub n-grams, for n < original n)

  • — Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.

  • — Could be useful for languages with less strict subject-verb-object order than English.

Alternative link

  • Depends on Distributional Hypothesis
  • Vector representations of words called “word embeddings”
  • Basic motivation is that compared to audio, visual domains, the word/text domain treats
    them as discrete symbols, encoding them as sparse dataset. Vector based representation
    works around these issues.
  • Also called as vector Space models
  • Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
    a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.

  • Trained using the Maximum Likelihood model

  • Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
  • However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
  • Therefore a logistic regression aka binary classification objective functionis used.
  • The way this is achieved is called negative sampling
Advertisements

The Cauvery Water debate — opinions

This was inspired by the controversy around cauvery water debate in mid-september 2016.

Before we begin, I’ll set down my biases and priors and  assumptions:

  1.    I’m from TN, living in bangalore for about 10 years.
  2.    I’m unaware of the actual rain level, agricultural needs, ecological needs and others.
  3.   I’m not going to propose a verdict as much as a process/method to deal with the conflicts that don’t depend on politicians or supreme court.
  4.  I’ve travelled in TN, to most parts in my youth, and trekked to most parts of Karnataka(speaking broken kannada) in the last 10 years,(obviously not as much as TN) and have a grasp on the cultural/mental attitudes in general.

 

One reason I’m ruling out political solution is because we live in a representational democracy.  The way the incentives are in that setup are for the politicians to do what gets them the most votes from the biggest part of their demographics. Trying to expect them to talk to politicians from other state and  come to a compromise is hard because on top of representational democracy, we have a  multi-party system.  Which means, there’s scope for local parties to not care about the interest of the other state parties and people. I’ve seen a few national parties taking contradictory stances, based on which state’s division they are making statements from.  In addition to this incentives, this is a situation with  prisoner’s dilemma type dynamics(i.e: to say, if one agent stops co-operating and defects, then the rest are better off doing the same.).  The only rewards for the politicians in this are media-time and vote bank support .

 

So what I do advocate is a mix of open data and predictive models plus persuasion and media (attention) frenzy that’ll overtake anything like the top-down media stuff the politicians can stir up. It won’t work without both of them, but I have no clue/idea about what will be successful and what will not in the latter case, so will focus majority of the post on the first.

Advocating open data access (water level, population, catchment area, drought area, cultivation area, predicted loss of agricultural area etc….) managed /maintained by a panel of experts, but open to all for debating and opining..

Major points(on the open data front):

  1.    Make the data open and easily  available . Here the data will be catchment areas, agricultural need estimates, actual rainfall, water table levels, water distribution wastage/efficiency, sand mining and their effects on water flow, economic impacts of the water shortage(bankruptcies, loss of revenue, loss of investment etc..). (There are some platforms like this and this already in India)*
  2.    Create/use a open data science platforms let bloggers and volunteers, modify the models (for estimates) and make blogs/predictions based on the given data, but with different models and parameters. (Some tools can be found here and here)
  3. Try to present the models in a way they can be interacted with even by people without programming experience. (The notebook links i provided above need python knowledge to edit, but anything built with this won’t)
  4.  Add volunteers to cross-check some of the data, like sand-mining, rain fall level, etc..
  5. Publish/collaborate with reporters to inform/write stories around the issue, with the help of models.(something with atleast the level of science journalism.)

 

Some thoughts(on the media – based persuasion front):

  1.  Recruit enough people interested in the exercise of figuring out details about impact of the issue.
  2. Make sure you can reach the ones that are currently most likely to indulge in violence(I can only guess at details, but  better targeted marketing strategy is what we need).

 

O.k: Enough of the idealistic stuff.

  1. Will this work? Well the question is too broad. Will it work to bring out the truth.. Ah it can bring us closer to the truth than what  we have.  And more importantly can define/establish a method/platform for us to get closer to data-driven debates and/or arguments.
  2. Will it cut the violence/bandhs/property-damage etc? Well that  boils down to the media and marketing front activism or work done. Leaving that persuasion part to politicians with skewing incentives towards gaining votes(from the steady voting population) is the problem now.  So can we have alternative parties (say,  business owners,) trying to use persuasion tactics only to discourage violence? I don’t know, but it seems likely that violence and martyrdom is preferred mostly by politicians and dons, but not the rest.(say media, local business owners, sheep-zens etc…). So this move has lower expected probability of violence.
  3. Who will pay for all this effort? Ah.. a very pertinent question. The answer is well, it’s going to be hard to pay the costs of even maintaining a information system, not to mention the cost of collecting the data.. That said, I think the big challenge is in the cost of collecting the data, and finding volunteers(something like this in US) to collect it for free. As for the hosting, building and maintaining an information system, I think there can be a cheap way found.
  4.  Is this likely to happen? Haha… no.. not in the next half century or so..
  5. Is there a cheaper way? Ah.. Not in the global/community/country level.. But at the individuals(media/politicians/public(aka u and me) ) sense yes, but it’s not really a cheaper way in the cost it inflicts. May be I’m just not creative enough, feel free to propose one, just be careful to include the costs to others around you now and others to come in the future.(aka your children)
  6. Why will this work? Well apart from the mythical way of saying “Sunlight is the best disinfectant”, I think this approach is basically an ambiguity- reduction approach, which translates to breaking down of status illegibility. (One reason no politician is likely to support this idea.) Status illegibility is the foundation of socio-political machinations and it applies to modern day state politics. So this will raise the probability of something close to a non-violent solution.
  • — I haven’t checked whether these data-sets  are already openly available, but I doubt they are and even if they are, some of the data are estimates, and we would need the models that made the estimates too to be public.

 

UPDATE: A few weeks after this I looked up on the google maps, the path followed by cauvery from origin to it’s end at the sea, and realized, I’ve actually visited more of the places it flows through in Karnataka and a lot fewer in TamilNadu. But that doesn’t change my stance/bias on misuse/abuse of sand mining and lake resources as housing projects in TN as that’s a broader , pervasive and pertinent issue.

 

UPDATE-1: A few months after writing this there was a public announcement, which if you read close enough is a typical persuation-negotiation move, with a specific action(and strong concession, and right now) demanded from the opponent, in exchange for a vague, under-specified promise in the future. This whole thing was on the news, is more support for my thesis that the incentives for politicians are skewed too much towards PR.

 

UPDATE-2:  Some platforms for hosting data, models and code do exist as below(although with different focus):

  1. Kaggle
  2. Drivendata
  3. Crowdai

so the question of collecting, cleaning, verifying and updating data is left.Also here’s a quora answer on the challenges of bootstrapping a data science team, which will be needed for this.

Dirichlet Distribution

Dirichlet Distribution(DD):
* — Symmetric DD can be considered a distribution of distributions
* — Each sample from Symmetric DD is a categorical distribution over K categories.
* — Generates samples that are similar discrete distributions
* — It is parameterized G0, a distribution over K categories and \alpha a scale
factor
<code>
import numpy as np
from scipy.stats import dirichlet
np.set_printoptions(precision=2)

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
print ” alpha:”, scale_factor
print ” element-wise mean:”, samples.mean(axis=0)
print “element-wise standard deviation:”, samples.std(axis=0)
print

for scale in [0.1, 1, 10, 100, 1000]:
stats(scale)
</code>

Dirichlet Process(DP)

  • — A way to generalize Dirichlet Distribution
  • — Generate samples that are distributions similar to the parameter H_0
  • — Also has the parameter \alpha determines how much will the samples
    vary from H0

  • — a sample H of DP( \alpha, H_0) is constructed by drawing a countably
    infinite number of samples \theta k and then setting
    H = \sum_{k=1}^{\infty} \pi_k * \delta(x-\theta_k)

where \pi_k — is carefully chosen weights that sum to 1
\delta — is the Dirac delta function
* — Since the samples from a DP are similar to a parameter H_0 one-way to
test if a DP is generating your dataset is to check if the distributions(of
different attributes/dimensions) you get are similar to each other. Something like
permutation test, but understand the assumptions and caveats.

  • — The code for dirichlet sampling can be written as:
    <code>
    import matplotlib.pyplot as plt
    from scipy.stats import beta, norm

def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
betas = []
pis = []
betas.append(beta(1, alpha).rvs())
pis.append(betas[0])
while sum(pis) < (1.-tol):
s = np.sum([np.log(1 – b) for b in betas])
new_beta = beta(1, alpha).rvs()
betas.append(new_beta)
pis.append(new_beta * np.exp(s))
pis = np.array(pis)
thetas = np.array([base_measure() for _ in pis])
return pis, thetas

def plot_normal_dp_approximation(alpha):
plt.figure()
plt.title(“Dirichlet Process Sample with N(0,1) Base Measure”)
plt.suptitle(“alpha: %s” % alpha)
pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
pis = pis * (norm.pdf(0) / pis.max())
plt.vlines(thetas, 0, pis, )
X = np.linspace(-4,4,100)
plt.plot(X, norm.pdf(X))

plot_normal_dp_approximation(.1)
plot_normal_dp_approximation(1)
plot_normal_dp_approximation(10)
plot_normal_dp_approximation(1000)
</code>

  • — The code for Dirichlet process can be written as :
    <code>
    from numpy.random import choice

class DirichletProcessSample():
def init(self, base_measure, alpha):
self.base_measure = base_measure
self.alpha = alpha

self.cache = []
self.weights = []
self.total_stick_used = 0.

def call(self):
remaining = 1.0 – self.total_stick_used
i = DirichletProcessSample.roll_die(self.weights + [remaining])
if i is not None and i < len(self.weights) :
return self.cache[i]
else:
stick_piece = beta(1, self.alpha).rvs() * remaining
self.total_stick_used += stick_piece
self.weights.append(stick_piece)
new_value = self.base_measure()
self.cache.append(new_value)
return new_value

@staticmethod
def roll_die(weights):
if weights:
return choice(range(len(weights)), p=weights)
else:
return None
</code>

Statistics — Tests of independence

Tests of independence:

Basic principle is the same as ${\chi}^2$ – goodness of fit test
* Between categorical variables

${\chi}^2$-square tests:

The standard approach is to compute expected counts, and find the
distribution of sum of square of difference between expected counts and ordinary
counts(normalized).
* Between Numerical Variables

${\chi}^2$-square test:

  • Between a categorical and numerical variable?

Null Hypothesis:

  • The two variables are independent.
  • Always a right-tail test
  • Test statistic/measure has a ${\chi}^2$ distribution, if assumptions are met:
  • Data are obtained from a random sample
  • Expected frequency of each category must be
    atleast 5
  • ### Properties of the test:
  • The data are the observed frequencies.
  • The data is arranged into a contingency table.
  • The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the column variable. It is not one less than the sample size, it is the product of the two degrees of freedom.
  • It is always a right tail test.
  • It has a chi-square distribution.
  • The expected value is computed by taking the row total times the column total and dividing by the grand total
  • The value of the test statistic doesn’t change if the order of the rows or columns are switched.
  • The value of the test statistic doesn’t change if the rows and columns are interchanged (transpose of the matrix

The mystery of short term past performance versus future equity fund returns

The Eighty Twenty Investor

In our earlier posts, here and here, we found to our dismay that, our natural inclination to choose the top mutual fund performers of the past 1 & 3 years hasn’t worked too well.

That leaves us with the obvious question..

What actually goes wrong when we pick the top funds of the past few years?

 The rotating sector winners..

Below is a representation of the best performing sectors year over year. What do you notice?

Sector wise calendar year performance.png

The sector performance over each and every year varies significantly and the top and bottom sectors keep changing dramatically almost every year.

Sample this:

  • 2007 – Metals was the top performer with a whopping 121% annual return
  • 2008 – Metals was the bottom performer with a negative 74% returns & FMCG was the top perfomer (-21%)
  • 2009 – The tables turned! FMCG was the bottom performer (47%) while Metals was the…

View original post 1,399 more words

Share: Harry Potter and the Methods of Rationality

Is there some amazing rational thing you do when your mind’s running in all different directions?” she managed.
“My own approach is usually to identify the different desires, give them names, conceive of them as separate individuals, and let them argue it out inside my head. So far the main persistent ones are my Hufflepuff, Ravenclaw, Gryffindor, and Slytherin sides, my Inner Critic, and my simulated copies of you, Neville, Draco, Professor McGonagall, Professor Flitwick, Professor Quirrell, Dad, Mum, Richard Feynman, and Douglas Hofstadter.”
Hermione considered trying this before her Common Sense warned that it might be a dangerous sort of thing to pretend. “There’s a copy of me inside your head?”
“Of course there is!” Harry said. The boy suddenly looked a bit more vulnerable. “You mean there isn’t a copy of me living in your head?”
There was, she realized; and not only that, it talked in Harry’s exact voice.
“It’s rather unnerving now that I think about it,” said Hermione. “I do have a copy of you living in my head. It’s talking to me right now using your voice, arguing how this is perfectly normal.”
“Good,” Harry said seriously. “I mean, I don’t see how people could be friends without that.”
She continued reading her book, then, Harry seeming content to watch the pages over her shoulder.
She’d gotten all the way to number seventy, Katherine Scott, who’d apparently invented a way to turn small animals into lemon tarts, when she finally worked up the courage to speak.

Regularization

Based on a small post found here.

One of the standard problems in ML with meta modelling algorithms(Algorithms that run multiple statistical models over given data and identifies the best fitting model. For ex: randomforest  or the rarely practical genetic algorithm) ,  is that they might favour overly complex models that over fit the given training data but on live/test data perform poorly.

The way these meta modelling algorithms work is they have a objective function(usually the RMS of error of the stats/sub model with the data)  they pick the model based on.(i.e: whichever model yields lowest value of the objective function).  So we can just add a complexity penalty(one obvious idea is the rank of the polynomial that model uses to fit, but how does that work for comparing with exponential functions?)  and the objective function suddenly becomes RMS(Error) + Complexity_penalty(model).

 

Now depending on the right choice of Error function and Complexity penalty this can find models that may perform less than more complex models on the training data, but can perform better in the live model scenario.

The idea of complexity penalty itself is not new, I don’t dare say ML borrowed it from scientific experimentation methods or some thing but the idea that the more complex a theory or a model it should be penalized over a simpler theory or model is very old. Here’s a better written post on it.

Related Post: https://softwaremechanic.wordpress.com/2016/08/12/bayesians-vs-frequentistsaka-sampling-theorists/

 

Sleeper Theorems

This inspired me to compile a list:
Since, I’m not a mathematician(pure/applied) I just compiled things from the blog post combining
with the comments:
* Bayes Theorem P(A|B) = P(B|A) *P(A)/P(B)

  • Jensen’s Inequality \psi(E(X)) <= E(\psi(X)) if \psi is a convex function and X is
    a random variable. Extends convexity from sums to integrals(aka discrete to continuous)
  • lto’s lemma: aka(Merton, Black and Scholes option pricing formula)
  • Complex analysis.. should i disqualify this as not a theorem??
  • Standard error of the mean.details link
  • Jordan Curve Theorem: A closed curve has an inside and an outside. (sounds obvious in 2D
    and 3D, perhaps with time as 4D, keeping options open is staying outside closed curves??)
  • kullback-leibler positivity:(no clue need to look up wolfram alpha or wikipedia)
  • Hahn-Banach Theorem (again needs searching)
  • Pigeon-Hole principle link here
  • Taylor’s theorem, (once again continuous function approximated by sum of discrete
    components/expressions) Used in:
  • Approximating any function with nth degree precision
  • Bounding the error term of an approximation
  • Decomposing functions into linear combinations of other functions
  • Kolmogorov’s Inequality for the maximum absolute value of the partial sums of a sequence of IID random variables.( the basis of martingale theory)
  • Karush-Kuhn-Tucker optimality conditions for nonlinear programming, link
    here
  • Envelope Theorem — from economics
  • Zorn’s lemma , also Axiom of Choice
  • Fourier Transform and Fast Fourier Transform