Word 2 vector:
- word 2 vector is a way to take a big set of text and convert into a matrix with a word at
each row. - It is a shallow neural-network(2 layers)
- Two options/training methods (
CBOW(Continuous-bag-of-words assumption)
- — a text is represented as the bag(multiset) of its words
- — disregards grammar
- — disregards word order but keeps multiplicity
- — Also used in computer vision
skip-gram() — it is a generalization
of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts based on .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.
Bias-vs-Variance Tradeoff:
- — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make
Smoothing Techniques:
- — Problems of balance weight between infrequent n-grams.
- — Unseen n-grams by default get 0.0 without smoothing.
-
— Use pseudocounts for unseen n-grams.(generally motivated by
bayesian reasoning on the sub n-grams, for n < original n) -
— Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.
- — Could be useful for languages with less strict subject-verb-object order than English.
Alternative link
- Depends on Distributional Hypothesis
- Vector representations of words called “word embeddings”
- Basic motivation is that compared to audio, visual domains, the word/text domain treats
them as discrete symbols, encoding them as sparse dataset. Vector based representation
works around these issues. - Also called as vector Space models
-
Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
a group of words, b, skip-gram is ulta. aka predicts group of words from a given word. -
Trained using the Maximum Likelihood model
- Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
- However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
- Therefore a logistic regression aka binary classification objective functionis used.
- The way this is achieved is called negative sampling