Word 2 vector:
 word 2 vector is a way to take a big set of text and convert into a matrix with a word at
each row.  It is a shallow neuralnetwork(2 layers)
 Two options/training methods (
CBOW(Continuousbagofwords assumption)
 — a text is represented as the bag(multiset) of its words
 — disregards grammar
 — disregards word order but keeps multiplicity
 — Also used in computer vision
skipgram() — it is a generalization
of ngrams(which is basically a markov chain model, with (n1)order)
* — It is a n1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of ngrams
* — predicts based on .
* — in language modeling independence assumptions are made so that each
word depends only on n1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning nonzero probabilities to unseen words or ngrams.
BiasvsVariance Tradeoff:
 — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make
Smoothing Techniques:
 — Problems of balance weight between infrequent ngrams.
 — Unseen ngrams by default get 0.0 without smoothing.

— Use pseudocounts for unseen ngrams.(generally motivated by
bayesian reasoning on the sub ngrams, for n < original n) 
— Skip grams also allow the possibility of skipping. So a 1skip bi(2)gram would create bigrams while skipping the second word in a three sequence.
 — Could be useful for languages with less strict subjectverbobject order than English.
Alternative link
 Depends on Distributional Hypothesis
 Vector representations of words called “word embeddings”
 Basic motivation is that compared to audio, visual domains, the word/text domain treats
them as discrete symbols, encoding them as sparse dataset. Vector based representation
works around these issues.  Also called as vector Space models

Two ways of training: a, CBOW(ContinuousBagOfWords) model predicts target words, given
a group of words, b, skipgram is ulta. aka predicts group of words from a given word. 
Trained using the Maximum Likelihood model
 Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
 However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
 Therefore a logistic regression aka binary classification objective functionis used.
 The way this is achieved is called negative sampling