Post on 17-Dec-2015
transcript
A Neural Probabilistic Language Model
2014-12-16Keren Ye
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
n-gram models
• Construct tables of conditional probabilities for the next word
• Combinations of the last n-1 words
11
11 |ˆ|ˆ
t
nttt
t wwPwwP
n-gram models
• i.e. “I like playing basketball”
– Unigram(1-gram)
– Bigram(2-gram)
– Trigram(3-gram)
playingbasketballPplayinglikeIbasketballP |ˆ,,|ˆ
basketballPplayinglikeIbasketballP ˆ,,|ˆ
playinglikebasketballPplayinglikeIbasketballP ,|ˆ,,|ˆ
n-gram models
• Disadvantages
– It is not taking into account contexts farther than 1 or 2 words
– It is not taking into account the similarity between words
• i.e.“The cat is walking in the bedroom”(training corpus)
• “A dog was running in a room”(?)
n-gram models
• Disadvantages
– Curse of Dimensionality
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
Fighting the Curse of Dimensionality
• Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in )
• Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence
• Learn simultaneously the word feature vectors and the parameters of that probability function
mR
Fighting the Curse of Dimensionality
• Word feature vectors
– Each word is associated with a point in a vector space
– The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)
Fighting the Curse of Dimensionality
• Probability function
– Using a multi-layer neural network to predict the next word given the previous ones, in the experiments
– This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data
Fighting the Curse of Dimensionality
• Why does it work?
– If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from
• The cat is walking in the bedroom
– to and likewise to
• A dog was running in a room
• The cat is running in a room
• A dog is walking in a bedroom
• ….
Fighting the Curse of Dimensionality
• NNLM
– Neural Network Language Model
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
A Neural Probabilistic Language Mode
• Denotations
– The training set is a sequence of words , where the vocabulary V is a large but finite set
– The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood
– The only constraint on model is that for any choice of , the sum
Tww ...1 Vwt
111 |ˆ,..., t
tntt wwPwwf1
1tw
1,...,,1 11 V
i ntt wwif
A Neural Probabilistic Language Mode
• Objective function
– Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term
RwwfT
Lt
ntt ;,...,log1
1
R
A Neural Probabilistic Language Mode
• Model
– We decompose the function in two parts
• A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary
• The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context, , to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability
111 |ˆ,..., t
tntt wwPwwf
mRiC
11 ,..., tnt wCwC
1111 ,...,,,...,, nttntt wCwCigwwif
A Neural Probabilistic Language Mode
A Neural Probabilistic Language Mode
• Model details (two hidden layers)
– The shared word features layer C, which has no non-linearity (it would not add anything useful)
– The ordinary hyperbolic tangent hidden layer
A Neural Probabilistic Language Mode
• Model details (formal description)
– The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1
i
y
y
nttt i
tw
e
ewwwP 11,...,|ˆ
A Neural Probabilistic Language Mode
• Model details (formal description)
– The are the unnormalized log-probabilities for each output word i , computed as follows, with parameters b, W, U, d and H
• Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections)
• And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C
iy
)tanh( HxdUWxby
11 ,..., tnt wCwCx
A Neural Probabilistic Language Mode
A Neural Probabilistic Language Mode
Parameters Brief Dimensions
b Output biases |V|
d Hidden layer bieses h
W No direct connections 0
U Hidden-to-output weights |V|*h matrix
H Word features to output weights h*(n-1)m matrix
C Word features |V|*m matrix
)tanh( HxdUWxby
A Neural Probabilistic Language Mode
A Neural Probabilistic Language Mode
• Stochastic gradient ascent
– Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window
CHUWdb ,,,,,
11,...,|ˆlog nttt wwwP
A Neural Probabilistic Language Mode
• Parallel Implementation
– Data-Parallel Processing
• Relied on synchronization commands – slow
• No locks – noise seems to be very small and did not apparently slow down training
– Parameter-parallel Processing
• Parallelize across the parameters
A Neural Probabilistic Language Mode
A Neural Probabilistic Language Mode
Continuous Bag of Words(Word2vec)
• Bag of words
– Traditional solution for the problem of Curse of Dimensionality
playinglikeIP
basketballplayingPbasketballlikePbasketballIPbasketballPplayinglikeIbasketballP
,,
|||,,|
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
Continuous Bag of Words(Word2vec)
• Continuous Bag of Words
Continuous Bag of Words(Word2vec)
• Distinctness
– Projection layer
• Sum vs Concatenate
• Order of words
– Hidden layer
• tanh vs NULL
– Hierarchical Softmax
Thanks
Q&A