+ All Categories
Home > Science > Word2vec slide(lab seminar)

Word2vec slide(lab seminar)

Date post: 14-Jan-2017
Category:
Upload: jinpyo-lee
View: 1,128 times
Download: 0 times
Share this document with a friend
39
Word2vec from scratch 11/10/2015 Jinpyo Lee KAIST
Transcript
Page 1: Word2vec slide(lab seminar)

Word2vec from scratch

11/10/2015Jinpyo Lee

KAIST

Page 2: Word2vec slide(lab seminar)

Contents

• Introduction• Previous Methods for represent words• Word2Vec

• Extensions of skip-gram model / Learning Phrase / Additive

Compositionality & Evaluation

• Conclusion

• Demo• Discussions• References

Page 3: Word2vec slide(lab seminar)

Introduction

• Example of NLP processing• EASY

• Spell Chekcing (Checking)

• Keyword Search (Ctrl+F)

• Finding Synonyms

• MEDIUM• Parsing information form documents, web, etc.

• HARD• Machine Translation (e.g. Translate Korean to English)

• Semantic Analysis (e.g. What’s meaning of this query?)

• Co-reference (e.g. What does “it” refers in this sentence?)

• Question Answering (e.g. IBM Watson)

Page 4: Word2vec slide(lab seminar)

Introduction

• BUT, Most important is

How we represent wordsas input for all the NLP tasks.

Page 5: Word2vec slide(lab seminar)

Introduction

• BUT, Most important is

How we represent meaning of wordsas input for all the NLP tasks.

Page 6: Word2vec slide(lab seminar)

• At first, most NLP treated word as ATOMIC symbol

• They needed notion of similarity & difference

• So,• WordNet: Taxonomy has hypernyms (is-a)

relationship and synonym set

Simple example of wordnet showing synonyms and antonyms

Prev. Methods for represent words- Discrete Representation

Page 7: Word2vec slide(lab seminar)

• COOL! (see also, Semantic Web)• Great resource but, missing nuances

Expert == Good ? Usu-ally? Probably NO!

* Synonym set of good using nltk lib (CS224d-Lecture note)

How about new words?: Wicked, ace, wizard, genius, ninja

- Discrete RepresentationPrev. Methods for represent words

Page 8: Word2vec slide(lab seminar)

• COOL! (see also, Semantic Web)• Great resource but, missing nuances

* Synonym set of good using nltk lib (CS224d-Lecture note)

Disadvantage• Hard to keep up to date• Requires human labor• Subjective• Hard to compute accurate

word similarity

- Discrete RepresentationPrev. Methods for represent words

Page 9: Word2vec slide(lab seminar)

• Another problem of discrete representation• Can’t gives similarity• Too sparse

e.g. Horse = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ]

Zebra = [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ] “one-hot” representation: Typical, simple representation.

All 0s with one 1, Identical

Horse ∩ Zebra

= [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] ∩ [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]

= 0 (nothing) (But, we know does are mammal)

- Discrete Representation

Mam-mal

Prev. Methods for represent words

Page 10: Word2vec slide(lab seminar)

• Use neighbor to represent words! (Co-occurence)• Conjecture: Words that are related will often appear in the same

documents.

Window allow capture both syntactic and semantic info.

e.g. I enjoy baseball

corpus I like NLP

I like deep learning.

* Co-occurrence Matrix with window size = 1 (CS224d-Lecture note)

Co-occurs beside “I”, 2-times

Prev. Methods for represent words

Page 11: Word2vec slide(lab seminar)

• Use this matrix for word-embedding (feat. SVD)• Applying Single Value Decomposition

for the simplicity, SVD: X (Co-occur Mat) = U*S*VT

X U S VT

(detail would be in linear algebra textbook)

• Select k-columns from U as k-dimension word-vector

Prev. Methods for represent words

Page 12: Word2vec slide(lab seminar)

• Result of SVD based Model

K = 2 K = 3

Prev. Methods for represent words

Page 13: Word2vec slide(lab seminar)

• Disadvantage• Co-occur Matrix is extremely sparse

• Very high dimensional

• Quadratic cost to train (i.e. perform SVD)

• Needs hacks for the imbalance in word frequency (i.e.

“it”, “the”, “has”, etc.)

• Some solutions exist for problem but, not intrinsic

Prev. Methods for represent words

Page 14: Word2vec slide(lab seminar)

Contents

• Introduction• Previous Methods for represent words• Word2Vec

• Extensions of skip-gram model / Learning Phrase / Additive

Compositionality & Evaluation

• Conclusion

• Demo• Discussions• References

Page 15: Word2vec slide(lab seminar)

Word2vec (related paper)

• Then how?Directly learn (iteration) low-dimensional word vectors at a time!

Go Back to the 1986• Learning representations by back-propagating errors (Rumelhart

et al. 1986)

• A neural probabilistic language model (Bengio et al., 2003)

• NLP from Scratch (Collobert & Weston, 2008) • Word2Vec (Mikolov et al. 2013)

• Efficient Estimation of Word Representation in Vector Space

• Distributed Representations of words and phrases and their compositionality

7/31

Page 16: Word2vec slide(lab seminar)

Efficient Estimation of Word Representation in Vector Space

• Introduce initial architecture of word2vec (2013)

• Two New Model: Continuous-Bag-of-word, Skip-gram model

• Empirically show that this word model has better syntactic, semantic

representation then other model• Compare two model

• Skip-gram model works well on semantic but training is slower.• CBOW model works well on syntactic and training is faster.

(P)Review

8/31

Page 17: Word2vec slide(lab seminar)

Word2vec (profile)

• Distributed Representations of words and phrases

and their compositionality

• NIPS 2013 (Submitted on 16 Oct 2013)

• Tomas Mikorov, (FaceBook (2014 ~ )) et al.

• Includes additional works of “Efficient Estimation of Word Repre-

sentation in Vector Space”.

9/31

Page 18: Word2vec slide(lab seminar)

Word2vec (Contents)

• This paper includes,• Extensions of skip-gram model (fast & accurate)

• Method

• Hierarchical soft-max

• NEG

• Subsampling

• Ability of Learning Phrase• Find Additive Compositionality• Conclusion

10/31

Page 19: Word2vec slide(lab seminar)

• Skip-gram model• Objective of Skip-gram model is “Find word representations use-

ful for predicting context words in a sentence.

• Softmax function

• …

Extension of Skip-Gram

BUT, without understanding original model, we will....going to.. fall ...asleep..

11/31

Page 20: Word2vec slide(lab seminar)

Example

13/31

Page 21: Word2vec slide(lab seminar)

CBOW (Original)

• Continuous-Bag-of-word model

• Idea: Using context words, we can predict center wordi.e. Probability( “It is ( ? ) to finish” “time” )

• Present word as distributed vector of probability Low dimension

• Goal: Train weight-matrix(W ) satisfies below

• Loss-function (using cross-entropy method)

} * Softmax(): K-dim vector of x∈ℝ K-dim vector that has (0,1)∈ℝ

𝐸=− log𝑝 (𝑤𝑡∨𝑤𝑡 −𝐶 ..𝑤𝑡+𝐶)

context words (window_size=2)

12/31

Page 22: Word2vec slide(lab seminar)

CBOW (Original)

• Continuous-Bag-of-word model

• Input• “one-hot” word vector

• Remove nonlinear hidden layer

• Back-propagate error from

output layer to Weight matrix

(Adjust W s)

It

is

finish

to

time[01000]

(predicted)

[0 0 1 0 0]TWin∙

h

Win ∙ x i

[0 0 0 0 1]TWin∙

y(true) =

Backpropagate toMinimize error

vs

Win(old) Wout(old)

Win(new) Wout(new)

: Input, output Weight-matrix, n is dimension for word embedding: input, output word vector (one-hot) from vocabulary V: hidden vector, avg of W*x

[NxV]*[Vx1] [Nx1] [VxN]*[Nx1] [Vx1]

Initial input, not results

14/31

Page 23: Word2vec slide(lab seminar)

• Skip-gram model• Idea: With center word,

we can predict context words

• Mirror of CBOW (vice versa) i.e. Probability( “time” “It is ( ? ) to finish” )

• Loss-function:

Skip-Gram (Original)

𝐸=− log𝑝 (𝑤𝑡−𝐶 ..𝑤𝑡+𝐶∨𝑤𝑡)

time

It

is

to

finish

Win ∙ x i

h

y i

Win(old) Wout(old)

Win(new) Wout(new)

[NxV]*[Vx1] [Nx1] [VxN]*[Nx1] [Vx1]

CBOW:

15/31

Page 24: Word2vec slide(lab seminar)

• Hierarchical Soft-max function• To train weight matrix in every step, we need to pass the calcu-

lated vector into Loss-Function

• Soft-max function

• Before calculate loss function

calculated vector should normalized as real-number in (0,1)

Extension of Skip-Gram(1)

(𝑬=− 𝐥𝐨𝐠𝒑 (𝒘 𝒕−𝑪 ..𝒘𝒕+𝑪|𝒘𝒕 ))

16/31

Page 25: Word2vec slide(lab seminar)

• Hierarchical Soft-max function (cont.)• Soft-max function

(I have already calculated, it’s boring …….…)

Extension of Skip-Gram(1)

Original soft-max function of skip-gram model

17/31

Page 26: Word2vec slide(lab seminar)

• Hierarchical Soft-max function (cont.)• Since V is quite large, computing costs to much

• Idea: Construct binary Huffman tree with word

Cost: to

• Can train Faster!• Assigning

• Word =

(* details in “Hierarchical Probabilistic Neural Network Language Model ")

Extension of Skip-Gram(1)

18/31

Page 27: Word2vec slide(lab seminar)

• Negative Sampling (similar to NCE)• Size(Vocabulary) is computationally huge! Slow for train

• Idea: Just sample several negative examples!

• Do not loop full vocabulary, only use neg. sample fast

• Change the target word as negative sample and learn negative ex-

amples get more accuracy

• Objective function

Extension of Skip-Gram(2)

i.e. “Stock boil fish is toy” ???? negative sample

Noise Constrastive Estimation

𝐸=− log𝑝 (𝑤𝑡−𝐶 ..𝑤𝑡+𝐶∨𝑤𝑡)

19/31

Page 28: Word2vec slide(lab seminar)

• Subsampling• (“Korea”, ”Seoul”) is helpful, but (“Korea”, ”the”) isn’t helpful

• Idea: Frequent word vectors (i.e. “the”) should not change signif-

icantly after training on several million examples.

• Each word in the training set is discarded with below proba-

bility

• It aggressively subsamples frequent words while preserve

ranking of the frequencies

• But, this formula was chosen heuristically…

Extension of Skip-Gram(3)

20/31

Page 29: Word2vec slide(lab seminar)

• Evaluation • Task: Analogical reasoning

• Accuracy test using cosine similarity determine how the model an-

swer correctly.

i.e. vec(X) = vec(“Berlin”) – vec(“Germany”) + vec(“France”)

Accuracy = cosine_similarity( vec(X), vec(“Paris”) )

• Model: skip-gram model(Word-embedding dimension = 300)

• Data Set: News article (Google dataset with 1 billion words)

• Comparing Method (w/ or w/o 10-5subsampling)• NEG(Negative Sampling)-5, 15

• Hierarchical Softmax-Huffman

• NCE-5(Noise Contrastive Estimation)

Extension of Skip-Gram

21/31

Page 30: Word2vec slide(lab seminar)

• Empirical Results

• Model w/ NEG outperforms the HS on the analogical reasoning task (even slightly better than NCE)

• The subsampling improves the training speed several times and makes the word representations more accurate

Extension of Skip-Gram

22/31

Page 31: Word2vec slide(lab seminar)

• Word base model can not represent idiomatic word• i.e. “Newyork Times”, “Larry Page”

• Simple data driven approach• If phrases are formed based on 1-gram, 2-gram counts

• Target words that has high score would meaningful phrase

Learning Phrases

(

23/31

Page 32: Word2vec slide(lab seminar)

• Evaluation• Task: Analogical reasoning

• Accuracy test using cosine similarity determine how the model an-

swer correctly with phrase• i.e. vec(X) = vec(“Steve Ballmer”) – vec(“Microsoft”) + vec(“Larry

Page”)

Accuracy = cosine_similarity( vec(x), vec(“Google”) )

• Model: skip-gram model(Word-embedding dimension = 300)

• Data Set: News article (Google dataset with 1 billion words)

• Comparing Method (w/ or w/o 10-5subsampling)• NEG-5

• NEG-15

• HS-Huffman

Learning Phrases

24/31

Page 33: Word2vec slide(lab seminar)

• Empirical Results

• NEG-15 achieves better performance than NEG-5• HS become the best performing method when subsampling

• This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.

• When training set = 33 billion, d=1000 72% (6B 66%)

• Amount of training set is crucial!

Learning Phrases

25/31

Page 34: Word2vec slide(lab seminar)

• Simple vector addition (on Skip-gram model)• Previous experiments shows Analogical reasoning (A+B-C)

• Vector’s values are related logarithmically to the probabilities

Sum of two vector is related to product of context distribution

• Interesting!

Additive Compositionality

26/31

Page 35: Word2vec slide(lab seminar)

• Contributions• Showed detailed process of training distributed representa-

tion of words and phrases

• Can be more accurate and faster model than previous

word2vec model by sub-sampling• Negative Sampling: Extremely simple and accurate for fre-

quent words. (not frequent like phrase, HS was better)

• Word vectors can be meaningful by simple vector addition

• Made a code and dataset as open-source project

Conclusion

27/31

Page 36: Word2vec slide(lab seminar)

• Compare to other Neural network model<Find most similar word>

• Skip-gram model trained on large corpus outperforms all to

other paper’s models.

Conclusion

28/31

Page 37: Word2vec slide(lab seminar)

• Very Interesting model• Simple, short paper

• Easy to read

• Hard to understand detail

• In HS, way of Tree construction• Several Heuristic methods• Pre-processing like eliminate stop-words

Speaker’s Opinion

29/31

Page 38: Word2vec slide(lab seminar)

• Papers• Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint

arXiv:1301.3781 (2013).• Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language

model." Proceedings of the international workshop on artificial intelligence and statistics. 2005.• Guthrie, David, et al. "A closer look at skip-gram modelling." Proceedings of the 5th international Con-

ference on Language Resources and Evaluation (LREC-2006). 2006.• Rong, Xin. "word2vec Parameter Learning Explained." arXiv preprint arXiv:1411.2731 (2014).• Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling

word-embedding method." arXiv preprint arXiv:1402.3722(2014).• Collobert, Ronan, et al. "Natural language processing (almost) from scratch." The Journal of Machine

Learning Research 12 (2011): 2493-2537.• Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Re-

search 3 (2003): 1137-1155.• Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-

propagating errors." Cognitive modeling 5 (1988): 3.

• Websites & Courses• Richard Socher, CS224d: Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/)• http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html• http://nohhj.blogspot.kr/2015/08/word-embedding.html• https://yinwenpeng.wordpress.com/category/deep-learning-in-nlp/• http://rare-technologies.com/word2vec-tutorial/• https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors

References

30/31

Page 39: Word2vec slide(lab seminar)

? = Word2vec(“Slide” + “End”)

End

31/31


Recommended