- word 2 vector is a way to take a big set of text and convert into a matrix with a word at
- It is a shallow neural-network(2 layers)
- Two options/training methods (
- — a text is represented as the bag(multiset) of its words
- — disregards grammar
- — disregards word order but keeps multiplicity
- — Also used in computer vision
skip-gram() — it is a generalization
of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts based on .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.
- — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make
- — Problems of balance weight between infrequent n-grams.
- — Unseen n-grams by default get 0.0 without smoothing.
— Use pseudocounts for unseen n-grams.(generally motivated by
bayesian reasoning on the sub n-grams, for n < original n)
— Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.
- — Could be useful for languages with less strict subject-verb-object order than English.
- Depends on Distributional Hypothesis
- Vector representations of words called “word embeddings”
- Basic motivation is that compared to audio, visual domains, the word/text domain treats
them as discrete symbols, encoding them as sparse dataset. Vector based representation
works around these issues.
- Also called as vector Space models
Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.
Trained using the Maximum Likelihood model
- Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
- However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
- Therefore a logistic regression aka binary classification objective functionis used.
- The way this is achieved is called negative sampling
I thought of you when I read this quote from “Seven secrets of Shiva” by DEVDUTT PATTANAIK –
“Culture by its very nature makes room for some practices and some people, and excludes others. Thieves and criminals and ghosts and goblins have no place in culture.”
Start reading this book for free: http://amzn.in/5sGSMqm
. Given the choice between the Farm and the Organization, he picked the Organization. I would have too. I have yet to meet a TPS report so onerous I would prefer to be handpicking cotton in Tennessee in August.
Source: #108: SEEDS goes public
Did You Really Think All Billionaires Were the Same?
Recently, I became a bit obsessed with the one percent of the one percent – Billionaires. I was intrigued when I stumbled on articles telling us who and what billionaires really are. The articles said stuff like: Most entrepreneurs do not have a degree and the average billionaire was in their 30s before starting their business. I felt like this was a bit of a generalization and I’ll explain. Let’s take a look at Bill Gates and Hajime Satomi, the CEO of Sega. Both are billionaires but are they really the same? In the past decade, Bill Gates has been a billionaire every single year while Hajime has dropped off the Forbes’ list three times. Is it fair to put these two individuals in the same box, post nice articles and give nice stats when no one wants to be a…
View original post 1,457 more words
This was inspired by the controversy around cauvery water debate in mid-september 2016.
Before we begin, I’ll set down my biases and priors and assumptions:
- I’m from TN, living in bangalore for about 10 years.
- I’m unaware of the actual rain level, agricultural needs, ecological needs and others.
- I’m not going to propose a verdict as much as a process/method to deal with the conflicts that don’t depend on politicians or supreme court.
- I’ve travelled in TN, to most parts in my youth, and trekked to most parts of Karnataka(speaking broken kannada) in the last 10 years,(obviously not as much as TN) and have a grasp on the cultural/mental attitudes in general.
One reason I’m ruling out political solution is because we live in a representational democracy. The way the incentives are in that setup are for the politicians to do what gets them the most votes from the biggest part of their demographics. Trying to expect them to talk to politicians from other state and come to a compromise is hard because on top of representational democracy, we have a multi-party system. Which means, there’s scope for local parties to not care about the interest of the other state parties and people. I’ve seen a few national parties taking contradictory stances, based on which state’s division they are making statements from. In addition to this incentives, this is a situation with prisoner’s dilemma type dynamics(i.e: to say, if one agent stops co-operating and defects, then the rest are better off doing the same.). The only rewards for the politicians in this are media-time and vote bank support .
So what I do advocate is a mix of open data and predictive models plus persuasion and media (attention) frenzy that’ll overtake anything like the top-down media stuff the politicians can stir up. It won’t work without both of them, but I have no clue/idea about what will be successful and what will not in the latter case, so will focus majority of the post on the first.
Advocating open data access (water level, population, catchment area, drought area, cultivation area, predicted loss of agricultural area etc….) managed /maintained by a panel of experts, but open to all for debating and opining..
Major points(on the open data front):
- Make the data open and easily available . Here the data will be catchment areas, agricultural need estimates, actual rainfall, water table levels, water distribution wastage/efficiency, sand mining and their effects on water flow, economic impacts of the water shortage(bankruptcies, loss of revenue, loss of investment etc..). (There are some platforms like this and this already in India)*
- Create/use a open data science platforms let bloggers and volunteers, modify the models (for estimates) and make blogs/predictions based on the given data, but with different models and parameters. (Some tools can be found here and here)
- Try to present the models in a way they can be interacted with even by people without programming experience. (The notebook links i provided above need python knowledge to edit, but anything built with this won’t)
- Add volunteers to cross-check some of the data, like sand-mining, rain fall level, etc..
- Publish/collaborate with reporters to inform/write stories around the issue, with the help of models.(something with atleast the level of science journalism.)
Some thoughts(on the media – based persuasion front):
- Recruit enough people interested in the exercise of figuring out details about impact of the issue.
- Make sure you can reach the ones that are currently most likely to indulge in violence(I can only guess at details, but better targeted marketing strategy is what we need).
O.k: Enough of the idealistic stuff.
- Will this work? Well the question is too broad. Will it work to bring out the truth.. Ah it can bring us closer to the truth than what we have. And more importantly can define/establish a method/platform for us to get closer to data-driven debates and/or arguments.
- Will it cut the violence/bandhs/property-damage etc? Well that boils down to the media and marketing front activism or work done. Leaving that persuasion part to politicians with skewing incentives towards gaining votes(from the steady voting population) is the problem now. So can we have alternative parties (say, business owners,) trying to use persuasion tactics only to discourage violence? I don’t know, but it seems likely that violence and martyrdom is preferred mostly by politicians and dons, but not the rest.(say media, local business owners, sheep-zens etc…). So this move has lower expected probability of violence.
- Who will pay for all this effort? Ah.. a very pertinent question. The answer is well, it’s going to be hard to pay the costs of even maintaining a information system, not to mention the cost of collecting the data.. That said, I think the big challenge is in the cost of collecting the data, and finding volunteers(something like this in US) to collect it for free. As for the hosting, building and maintaining an information system, I think there can be a cheap way found.
- Is this likely to happen? Haha… no.. not in the next half century or so..
- Is there a cheaper way? Ah.. Not in the global/community/country level.. But at the individuals(media/politicians/public(aka u and me) ) sense yes, but it’s not really a cheaper way in the cost it inflicts. May be I’m just not creative enough, feel free to propose one, just be careful to include the costs to others around you now and others to come in the future.(aka your children)
- Why will this work? Well apart from the mythical way of saying “Sunlight is the best disinfectant”, I think this approach is basically an ambiguity- reduction approach, which translates to breaking down of status illegibility. (One reason no politician is likely to support this idea.) Status illegibility is the foundation of socio-political machinations and it applies to modern day state politics. So this will raise the probability of something close to a non-violent solution.
- — I haven’t checked whether these data-sets are already openly available, but I doubt they are and even if they are, some of the data are estimates, and we would need the models that made the estimates too to be public.
UPDATE: A few weeks after this I looked up on the google maps, the path followed by cauvery from origin to it’s end at the sea, and realized, I’ve actually visited more of the places it flows through in Karnataka and a lot fewer in TamilNadu. But that doesn’t change my stance/bias on misuse/abuse of sand mining and lake resources as housing projects in TN as that’s a broader , pervasive and pertinent issue.
UPDATE-1: A few months after writing this there was a public announcement, which if you read close enough is a typical persuation-negotiation move, with a specific action(and strong concession, and right now) demanded from the opponent, in exchange for a vague, under-specified promise in the future. This whole thing was on the news, is more support for my thesis that the incentives for politicians are skewed too much towards PR.
UPDATE-2: Some platforms for hosting data, models and code do exist as below(although with different focus):
so the question of collecting, cleaning, verifying and updating data is left.Also here’s a quora answer on the challenges of bootstrapping a data science team, which will be needed for this.
* — Symmetric DD can be considered a distribution of distributions
* — Each sample from Symmetric DD is a categorical distribution over K categories.
* — Generates samples that are similar discrete distributions
* — It is parameterized G0, a distribution over K categories and a scale
import numpy as np
from scipy.stats import dirichlet
def stats(scale_factor, G0=[.2, .2, .6], N=10000):
samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
print ” alpha:”, scale_factor
print ” element-wise mean:”, samples.mean(axis=0)
print “element-wise standard deviation:”, samples.std(axis=0)
for scale in [0.1, 1, 10, 100, 1000]:
- — A way to generalize Dirichlet Distribution
- — Generate samples that are distributions similar to the parameter
- — Also has the parameter determines how much will the samples
vary from H0
- — a sample H of is constructed by drawing a countably
infinite number of samples and then setting
where — is carefully chosen weights that sum to 1
— is the Dirac delta function
* — Since the samples from a DP are similar to a parameter one-way to
test if a DP is generating your dataset is to check if the distributions(of
different attributes/dimensions) you get are similar to each other. Something like
permutation test, but understand the assumptions and caveats.
- — The code for dirichlet sampling can be written as:
import matplotlib.pyplot as plt
from scipy.stats import beta, norm
def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
betas = 
pis = 
while sum(pis) < (1.-tol):
s = np.sum([np.log(1 – b) for b in betas])
new_beta = beta(1, alpha).rvs()
pis.append(new_beta * np.exp(s))
pis = np.array(pis)
thetas = np.array([base_measure() for _ in pis])
return pis, thetas
plt.title(“Dirichlet Process Sample with N(0,1) Base Measure”)
plt.suptitle(“alpha: %s” % alpha)
pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
pis = pis * (norm.pdf(0) / pis.max())
plt.vlines(thetas, 0, pis, )
X = np.linspace(-4,4,100)
- — The code for Dirichlet process can be written as :
from numpy.random import choice
def init(self, base_measure, alpha):
self.base_measure = base_measure
self.alpha = alpha
self.cache = 
self.weights = 
self.total_stick_used = 0.
remaining = 1.0 – self.total_stick_used
i = DirichletProcessSample.roll_die(self.weights + [remaining])
if i is not None and i < len(self.weights) :
stick_piece = beta(1, self.alpha).rvs() * remaining
self.total_stick_used += stick_piece
new_value = self.base_measure()
return choice(range(len(weights)), p=weights)