Word_2_vector.. (aka word embeddings)

Word 2 vector:

  • word 2 vector is a way to take a big set of text and convert into a matrix with a word at
    each row.
  • It is a shallow neural-network(2 layers)
  • Two options/training methods (

CBOW(Continuous-bag-of-words assumption)

  • — a text is represented as the bag(multiset) of its words
  • — disregards grammar
  • — disregards word order but keeps multiplicity
  • — Also used in computer vision

skip-gram() — it is a generalization

of n-grams(which is basically a markov chain model, with (n-1)-order)
* — It is a n-1 order markov model
* — Used in Protein sequencing, DNA Sequencing, Computational
linguistics(character and word)
* — Models sequences, using the statistical properties of n-grams
* — predicts x_i based on x_(i-(n-1)), ....,x_(i-1) .
* — in language modeling independence assumptions are made so that each
word depends only on n-1 previous words.(or characters in case of
character level modeling)
* — The probability of a word conditional on previous n-1 words follows a
Categorical Distribution
* — In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams.

Bias-vs-Variance Tradeoff:

  • — Finding the right ‘n’ for a model is based on the Bias Vs Variance tradeoff we’re wiling to make

Smoothing Techniques:

  • — Problems of balance weight between infrequent n-grams.
  • — Unseen n-grams by default get 0.0 without smoothing.
  • — Use pseudocounts for unseen n-grams.(generally motivated by
    bayesian reasoning on the sub n-grams, for n < original n)

  • — Skip grams also allow the possibility of skipping. So a 1-skip bi(2-)gram would create bigrams while skipping the second word in a three sequence.

  • — Could be useful for languages with less strict subject-verb-object order than English.

Alternative link

  • Depends on Distributional Hypothesis
  • Vector representations of words called “word embeddings”
  • Basic motivation is that compared to audio, visual domains, the word/text domain treats
    them as discrete symbols, encoding them as sparse dataset. Vector based representation
    works around these issues.
  • Also called as vector Space models
  • Two ways of training: a, CBOW(Continuous-Bag-Of-Words) model predicts target words, given
    a group of words, b, skip-gram is ulta. aka predicts group of words from a given word.

  • Trained using the Maximum Likelihood model

  • Ideally, Maximizes probability of next word given the previous ‘h’ words in terms of a softmax function
  • However, calculating the softmax values requires computing and normalizing each probability using score for all the other words in context at every step.
  • Therefore a logistic regression aka binary classification objective functionis used.
  • The way this is achieved is called negative sampling

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.