Date post: | 21-Jul-2015 |
Category: |
Education |
Upload: | abdullah-khan-zehady |
View: | 242 times |
Download: | 2 times |
Paper Presentation
Word Representations
in Vector space
Abdullah Khan Zehady
Department of Computer Science,
Purdue University.
E-mail: [email protected]
Neural Word Embedding● Continuous vector space representation
o Words represented as dense real-valued vectors in Rd
● Distributed word representation ↔ Word Embedding
o Embed an entire vocabulary into a relatively low-dimensional linear
space where dimensions are latent continuous features.
● Classical n-gram model works in terms of discrete units
o No inherent relationship in n-gram.
● In contrast, word embeddings capture regularities and relationships
between words.
Syntactic & Semantic Relationship
Regularities are observed as the constant offset vector between
pair of words sharing some relationship.
Gender Relation
KING-QUEEN ~ MAN - WOMAN
Singular/Plural Relation
KING-KINGS ~ QUEEN - QUEENS
Other Relations:● Language
France - French~
Spain - Spanish
● Past TenseGo – Went
~Capture - Captured
Language Model(LM)
● Different models for estimating continuous representations of words.
○ Latent Semantic Analysis (LSA) ○ Latent Dirichlet Allocation (LDA)○ Neural network Language Model(NNLM)
Feed Forward NNLM● Consists of input, projection, hidden and output layers.
● N previous words are encoded using 1-of-V coding, where V is size of the
vocabulary. Ex: A = (1,0,...,0), B = (0,1,...,0), … , Z = (0,0,...,1) in R26
● NNLM becomes computationally complex between projection(P) and
hidden(H) layer
○ For N=10, size of P = 500-2000, size of H = 500-1000
○ Hidden layer is used to compute prob. dist. over all the words in
vocabulary V
● Hierarchical softmax as the rescue.
Recurrent NNLM● No projection Layer, consists of input, hidden and output layers only.
● No need to specify the context length like feed forward NNLM
● What is special in RNN model?
○ Recurrent matrix that connects layer to itself.
○ Allows to form short-term memory
■ Information from the past is represen-
ted by the hidden layer
● RNN-embedded vector achieved state of the
art results in relational similarity identification task.
RNN Model
Recurrent NNLM
w(t): Input word at time ty(t): Output layer produces a prob. Dist.
over words.s(t): Hidden layerU: Each column represents a word
● Four-gram neural net language model architecture(Bengio 2001)● RNN is trained with SGD and backpropagation to maximize the ● log likelihood.
Bringing efficiency..● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)
Continuous Bag of Word Model(CBOW)● Non-linear hidden layer is removed
● Projection layer is shared for all words(not
just the projection matrix).
● All words get projected into the same
position(vectors are averaged).
● Naming Reson: Order of words in the
history does not influence the projection.
● Best performance obtained by a log-
linear classifier with four future and
four history words at the input
○ Training Criterion: Correctly classify
Predicts the current word based on
the context.
Continuous Skip-gram Model● Objective: Tries to maximize
classification of a word based on another
word in the same sentence. Maximize the
average log probability
● Define p(wt+j |wt ) using the softmax
function:
Predicts surrounding word given
the current word.
Bringing efficiency..● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)
Hierarchical Softmax for efficient computation
● This formulation is impractical because the cost of computing ∇logp(wO|wI) is proportional to W, which is often large (105–107 terms).
● With hierarchical softmax, the cost is reduced
Hierarchical Softmax● Uses a binary tree (Huffman code) representation of the output layer with the W
words as its leaves.
o A random walk that assigns probabilities to words.
● Instead of evaluating W output nodes, evaluate log(W) nodes to calculate prob. dist.
● Each word w can be reached by an appropriate path from the root of the tree● n(w, j): j-th node on the path from the root to w
● L(w): The length of this path
● n(w, 1) = root and n(w, L(w)) = w
● ch(n): An arbitrary fixed child of an inner node n
● [x] = 1 if x is true and [x] = -1 otherwise
Negative Sampling● Noise Contrastive Estimation (NCE)
o A good model should be able to differentiate data from noise by means of logistic regression.
o Alternative to the hierarchical softmax.
o Introduced by Gutmann and Hyvarinen and applied to language modeling by Mnih and Teh.
● NCE approximates the log probability of the softmax
● Define Negative Sampling by the objective which replaces log P(w0|wI) in the skip-gram.
● Task: Distinguish the target word wO from draws from the noise distribution Pn(w) using logistic regression, where there are k negative samples for each
Subsampling of Frequent words● Most frequent words provide less information than rare words.
o Co-occurrences of “France” and “Paris” is informative
o Co-occurrences of “France” and “the” is less informative
● A simple subsampling approach to counter imbalance
o Each word wi in the training set is discarded with probability
where f(wi) is the frequency of word wi and t is a chosen threshold, typically around 10−5
● Aggressive subsampling of words whose frequency is greater than t while preserving the ranking of the frequencies.
Automatic learning by skip-gram model
● No supervised information about what a capital city means.
● But the model is still capable of
o Automatic organization of concepts
o Learning implicit relationship
PCA projection of 100- dimensional skip-gram vectors
Analogical Reasoning Performance
● Analogical Reasoning task introduced by Mikolov
o Syntactic analogies: “quick” : “quickly” :: “slow” : ? “slowly”
o Semantic analogies: “Germany” : “Berlin” :: “France” : ? “Paris”
Learning Phrases● To learn phrase vectors
o First find words that appear frequently together, and infrequently in other contexts.
o Replace with unique tokens. Ex: “New York Times” -> New_York_Times
● Phrases are formed based on the unigram and bigram counts, using
δ(discounting coefficient) prevents too many phrases consisting of very infrequent words.
Phrase Skip-gram Results● Accuracies of the Skip-gram models on the phrase analogy dataset
o Using different hyperparameters
o Models trained on approximately one billion words from the news dataset
● Size of the training data matters.
o HS-Huffman( dimensionality=1000) trained on 33 billion words reaches an accuracy of 72%
Additive compositionality
● Possible to meaningfully combine words by an element-wise addition of their
vector representations.
○ Word vectors represents the distribution of the context in which it appears.
● Vector values related logarithmically to the probabilities computed by output layer.
○ The sum of two word vectors is related to the product of the two context
distributions
Closest Entities
Closest entity search using two methods-Negative sampling and Hierarchical Softmax.
Comments● Reduction of computational complexity is impressive.
● Works with unsupervised/unlabelled data
● Vector representation can be extended to large pieces of text
Paragraph Vector (Mikolov et al. 2013)
● Applicable to a lot of NLP tasks
o Tagging
o Named Entity Recognition
o Translation
o Paraphrasing
Sentiment analysis