Word_2_vector.. (aka word embeddings)

Word 2 vector:

word 2 vector is a way to take a big set of text and convert into a matrix with a word at
each row.
It is a shallow neural-network(2 layers)
Two options/training methods (

CBOW(Continuous-bag-of-words assumption)

— a text is represented as the bag(multiset) of its words
— disregards grammar
— disregards word order but keeps multiplicity
— Also used in computer vision

skip-gram() — it is a generalization

of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts $x_i$ based on $x_(i-(n-1)), ....,x_(i-1)$ .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.

Bias-vs-Variance Tradeoff:

— Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make

Smoothing Techniques:

— Problems of balance weight between infrequent n-grams.
— Unseen n-grams by default get 0.0 without smoothing.
— Use pseudocounts for unseen n-grams.(generally motivated by
bayesian reasoning on the sub n-grams, for n < original n)
— Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.
— Could be useful for languages with less strict subject-verb-object order than English.

Alternative link

Depends on Distributional Hypothesis
Vector representations of words called “word embeddings”
Basic motivation is that compared to audio, visual domains, the word/text domain treats
them as discrete symbols, encoding them as sparse dataset. Vector based representation
works around these issues.
Also called as vector Space models
Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.
Trained using the Maximum Likelihood model
Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
Therefore a logistic regression aka binary classification objective functionis used.
The way this is achieved is called negative sampling

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

softwaremechanic

Word_2_vector.. (aka word embeddings)

Word 2 vector:

CBOW(Continuous-bag-of-words assumption)

skip-gram() — it is a generalization

Bias-vs-Variance Tradeoff:

Smoothing Techniques:

Alternative link

Leave a comment Cancel reply

Word 2 vector:

CBOW(Continuous-bag-of-words assumption)

skip-gram() — it is a generalization

Bias-vs-Variance Tradeoff:

Smoothing Techniques:

Alternative link

Share this:

Related

Leave a comment Cancel reply