A Neural Probabilistic Language Model 2014-12-16 Keren Ye.

transcript

A Neural Probabilistic Language Model

2014-12-16Keren Ye

CONTENTS

• N-gram Models

• Fighting the Curse of Dimensionality

• A Neural Probabilistic Language Model

• Continuous Bag of Words(Word2vec)

n-gram models

• Construct tables of conditional probabilities for the next word

• Combinations of the last n-1 words

11 |ˆ|ˆ

t wwPwwP

n-gram models

• i.e. “I like playing basketball”

– Unigram(1-gram)

– Bigram(2-gram)

– Trigram(3-gram)

playingbasketballPplayinglikeIbasketballP |ˆ,,|ˆ

basketballPplayinglikeIbasketballP ˆ,,|ˆ

playinglikebasketballPplayinglikeIbasketballP ,|ˆ,,|ˆ

n-gram models

• Disadvantages

– It is not taking into account contexts farther than 1 or 2 words

– It is not taking into account the similarity between words

• i.e.“The cat is walking in the bedroom”(training corpus)

• “A dog was running in a room”(?)

n-gram models

• Disadvantages

– Curse of Dimensionality

CONTENTS

• N-gram Models

Fighting the Curse of Dimensionality

• Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in )

• Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence

• Learn simultaneously the word feature vectors and the parameters of that probability function

• Word feature vectors

– Each word is associated with a point in a vector space

– The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)

• Probability function

– Using a multi-layer neural network to predict the next word given the previous ones, in the experiments

– This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data

• Why does it work?

– If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from

• The cat is walking in the bedroom

– to and likewise to

• A dog was running in a room

• The cat is running in a room

• A dog is walking in a bedroom

• ….

• NNLM

– Neural Network Language Model

CONTENTS

• N-gram Models

A Neural Probabilistic Language Mode

• Denotations

– The training set is a sequence of words , where the vocabulary V is a large but finite set

– The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood

– The only constraint on model is that for any choice of , the sum

Tww ...1 Vwt

111 |ˆ,..., t

tntt wwPwwf1

1,...,,1 11 V

i ntt wwif

• Objective function

– Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term

ntt ;,...,log1

• Model

– We decompose the function in two parts

• A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary

• The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context, , to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability

111 |ˆ,..., t

tntt wwPwwf

11 ,..., tnt wCwC

1111 ,...,,,...,, nttntt wCwCigwwif

• Model details (two hidden layers)

– The shared word features layer C, which has no non-linearity (it would not add anything useful)

– The ordinary hyperbolic tangent hidden layer

• Model details (formal description)

– The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1

nttt i

ewwwP 11,...,|ˆ

• Model details (formal description)

– The are the unnormalized log-probabilities for each output word i , computed as follows, with parameters b, W, U, d and H

• Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections)

• And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C

)tanh( HxdUWxby

11 ,..., tnt wCwCx

Parameters Brief Dimensions

b Output biases |V|

d Hidden layer bieses h

W No direct connections 0

U Hidden-to-output weights |V|*h matrix

H Word features to output weights h*(n-1)m matrix

C Word features |V|*m matrix

)tanh( HxdUWxby

• Stochastic gradient ascent

– Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window

CHUWdb ,,,,,

11,...,|ˆlog nttt wwwP

• Parallel Implementation

– Data-Parallel Processing

• Relied on synchronization commands – slow

• No locks – noise seems to be very small and did not apparently slow down training

– Parameter-parallel Processing

• Parallelize across the parameters

Continuous Bag of Words(Word2vec)

• Bag of words

– Traditional solution for the problem of Curse of Dimensionality

playinglikeIP

basketballplayingPbasketballlikePbasketballIPbasketballPplayinglikeIbasketballP

|||,,|

CONTENTS

• N-gram Models

• Continuous Bag of Words

• Distinctness

– Projection layer

• Sum vs Concatenate

• Order of words

– Hidden layer

• tanh vs NULL

– Hierarchical Softmax

Thanks

A Neural Probabilistic Language Model 2014-12-16 Keren Ye.

Documents