via Communicating — When Your Mind Travels At Warp Speed
math
mathematics
Word_2_vector.. (aka word embeddings)
Word 2 vector:
 word 2 vector is a way to take a big set of text and convert into a matrix with a word at
each row.  It is a shallow neuralnetwork(2 layers)
 Two options/training methods (
CBOW(Continuousbagofwords assumption)
 — a text is represented as the bag(multiset) of its words
 — disregards grammar
 — disregards word order but keeps multiplicity
 — Also used in computer vision
skipgram() — it is a generalization
of ngrams(which is basically a markov chain model, with (n1)order)
* — It is a n1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of ngrams
* — predicts based on .
* — in language modeling independence assumptions are made so that each
word depends only on n1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning nonzero probabilities to unseen words or ngrams.
BiasvsVariance Tradeoff:
 — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make
Smoothing Techniques:
 — Problems of balance weight between infrequent ngrams.
 — Unseen ngrams by default get 0.0 without smoothing.

— Use pseudocounts for unseen ngrams.(generally motivated by
bayesian reasoning on the sub ngrams, for n < original n) 
— Skip grams also allow the possibility of skipping. So a 1skip bi(2)gram would create bigrams while skipping the second word in a three sequence.
 — Could be useful for languages with less strict subjectverbobject order than English.
Alternative link
 Depends on Distributional Hypothesis
 Vector representations of words called “word embeddings”
 Basic motivation is that compared to audio, visual domains, the word/text domain treats
them as discrete symbols, encoding them as sparse dataset. Vector based representation
works around these issues.  Also called as vector Space models

Two ways of training: a, CBOW(ContinuousBagOfWords) model predicts target words, given
a group of words, b, skipgram is ulta. aka predicts group of words from a given word. 
Trained using the Maximum Likelihood model
 Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
 However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
 Therefore a logistic regression aka binary classification objective functionis used.
 The way this is achieved is called negative sampling
#108: SEEDS goes public
Source: #108: SEEDS goes public
A Data Driven Guide to Becoming a Consistent Billionaire
A Data Driven Guide to Becoming a Consistent Billionaire
Did You Really Think All Billionaires Were the Same?
Recently, I became a bit obsessed with the one percent of the one percent – Billionaires. I was intrigued when I stumbled on articles telling us who and what billionaires really are. The articles said stuff like: Most entrepreneurs do not have a degree and the average billionaire was in their 30s before starting their business. I felt like this was a bit of a generalization and I’ll explain. Let’s take a look at Bill Gates and Hajime Satomi, the CEO of Sega. Both are billionaires but are they really the same? In the past decade, Bill Gates has been a billionaire every single year while Hajime has dropped off the Forbes’ list three times. Is it fair to put these two individuals in the same box, post nice articles and give nice stats when no one wants to be a…
View original post 1,457 more words
When Humans Keep Letting You Down
Dirichlet Distribution
Dirichlet Distribution(DD):
* — Symmetric DD can be considered a distribution of distributions
* — Each sample from Symmetric DD is a categorical distribution over K categories.
* — Generates samples that are similar discrete distributions
* — It is parameterized G0, a distribution over K categories and a scale
factor
<code>
import numpy as np
from scipy.stats import dirichlet
np.set_printoptions(precision=2)
def stats(scale_factor, G0=[.2, .2, .6], N=10000):
samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
print ” alpha:”, scale_factor
print ” elementwise mean:”, samples.mean(axis=0)
print “elementwise standard deviation:”, samples.std(axis=0)
print
for scale in [0.1, 1, 10, 100, 1000]:
stats(scale)
</code>
Dirichlet Process(DP)
 — A way to generalize Dirichlet Distribution
 — Generate samples that are distributions similar to the parameter
 — Also has the parameter determines how much will the samples
vary from H0  — a sample H of is constructed by drawing a countably
infinite number of samples and then setting
where — is carefully chosen weights that sum to 1
— is the Dirac delta function
* — Since the samples from a DP are similar to a parameter oneway to
test if a DP is generating your dataset is to check if the distributions(of
different attributes/dimensions) you get are similar to each other. Something like
permutation test, but understand the assumptions and caveats.
 — The code for dirichlet sampling can be written as:
<code>
import matplotlib.pyplot as plt
from scipy.stats import beta, norm
def dirichlet_sample_approximation(base_measure, alpha, tol=0.01):
betas = []
pis = []
betas.append(beta(1, alpha).rvs())
pis.append(betas[0])
while sum(pis) < (1.tol):
s = np.sum([np.log(1 – b) for b in betas])
new_beta = beta(1, alpha).rvs()
betas.append(new_beta)
pis.append(new_beta * np.exp(s))
pis = np.array(pis)
thetas = np.array([base_measure() for _ in pis])
return pis, thetas
def plot_normal_dp_approximation(alpha):
plt.figure()
plt.title(“Dirichlet Process Sample with N(0,1) Base Measure”)
plt.suptitle(“alpha: %s” % alpha)
pis, thetas = dirichlet_sample_approximation(lambda: norm().rvs(), alpha)
pis = pis * (norm.pdf(0) / pis.max())
plt.vlines(thetas, 0, pis, )
X = np.linspace(4,4,100)
plt.plot(X, norm.pdf(X))
plot_normal_dp_approximation(.1)
plot_normal_dp_approximation(1)
plot_normal_dp_approximation(10)
plot_normal_dp_approximation(1000)
</code>
 — The code for Dirichlet process can be written as :
<code>
from numpy.random import choice
class DirichletProcessSample():
def init(self, base_measure, alpha):
self.base_measure = base_measure
self.alpha = alpha
self.cache = []
self.weights = []
self.total_stick_used = 0.
def call(self):
remaining = 1.0 – self.total_stick_used
i = DirichletProcessSample.roll_die(self.weights + [remaining])
if i is not None and i < len(self.weights) :
return self.cache[i]
else:
stick_piece = beta(1, self.alpha).rvs() * remaining
self.total_stick_used += stick_piece
self.weights.append(stick_piece)
new_value = self.base_measure()
self.cache.append(new_value)
return new_value
@staticmethod
def roll_die(weights):
if weights:
return choice(range(len(weights)), p=weights)
else:
return None
</code>
Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R
Statistics — Tests of independence
Tests of independence:
Basic principle is the same as ${\chi}^2$ – goodness of fit test
* Between categorical variables
${\chi}^2$square tests:
The standard approach is to compute expected counts, and find the
distribution of sum of square of difference between expected counts and ordinary
counts(normalized).
* Between Numerical Variables
${\chi}^2$square test:
 Between a categorical and numerical variable?
Null Hypothesis:
 The two variables are independent.
 Always a righttail test
 Test statistic/measure has a ${\chi}^2$ distribution, if assumptions are met:
 Data are obtained from a random sample
 Expected frequency of each category must be
atleast 5  ### Properties of the test:
 The data are the observed frequencies.
 The data is arranged into a contingency table.
 The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the column variable. It is not one less than the sample size, it is the product of the two degrees of freedom.
 It is always a right tail test.
 It has a chisquare distribution.
 The expected value is computed by taking the row total times the column total and dividing by the grand total
 The value of the test statistic doesn’t change if the order of the rows or columns are switched.
 The value of the test statistic doesn’t change if the rows and columns are interchanged (transpose of the matrix
The mystery of short term past performance versus future equity fund returns
In our earlier posts, here and here, we found to our dismay that, our natural inclination to choose the top mutual fund performers of the past 1 & 3 years hasn’t worked too well.
That leaves us with the obvious question..
What actually goes wrong when we pick the top funds of the past few years?
The rotating sector winners..
Below is a representation of the best performing sectors year over year. What do you notice?
The sector performance over each and every year varies significantly and the top and bottom sectors keep changing dramatically almost every year.
Sample this:
 2007 – Metals was the top performer with a whopping 121% annual return
 2008 – Metals was the bottom performer with a negative 74% returns & FMCG was the top perfomer (21%)
 2009 – The tables turned! FMCG was the bottom performer (47%) while Metals was the…
View original post 1,399 more words