+ All Categories
Home > Education > Word representations in vector space

Word representations in vector space

Date post: 21-Jul-2015
Category:
Upload: abdullah-khan-zehady
View: 242 times
Download: 2 times
Share this document with a friend
30
Paper Presentation Word Representations in Vector space Abdullah Khan Zehady Department of Computer Science, Purdue University. E-mail: [email protected]
Transcript

Paper Presentation

Word Representations

in Vector space

Abdullah Khan Zehady

Department of Computer Science,

Purdue University.

E-mail: [email protected]

Word Representation

Neural Word Embedding● Continuous vector space representation

o Words represented as dense real-valued vectors in Rd

● Distributed word representation ↔ Word Embedding

o Embed an entire vocabulary into a relatively low-dimensional linear

space where dimensions are latent continuous features.

● Classical n-gram model works in terms of discrete units

o No inherent relationship in n-gram.

● In contrast, word embeddings capture regularities and relationships

between words.

Syntactic & Semantic Relationship

Regularities are observed as the constant offset vector between

pair of words sharing some relationship.

Gender Relation

KING-QUEEN ~ MAN - WOMAN

Singular/Plural Relation

KING-KINGS ~ QUEEN - QUEENS

Other Relations:● Language

France - French~

Spain - Spanish

● Past TenseGo – Went

~Capture - Captured

Vector Space Model

Language 1: English

Language 2: Estonian

Neural Net

Hidden

Layer

Input

Layer

Output

Layer

Language Model(LM)

● Different models for estimating continuous representations of words.

○ Latent Semantic Analysis (LSA) ○ Latent Dirichlet Allocation (LDA)○ Neural network Language Model(NNLM)

Feed Forward NNLM● Consists of input, projection, hidden and output layers.

● N previous words are encoded using 1-of-V coding, where V is size of the

vocabulary. Ex: A = (1,0,...,0), B = (0,1,...,0), … , Z = (0,0,...,1) in R26

● NNLM becomes computationally complex between projection(P) and

hidden(H) layer

○ For N=10, size of P = 500-2000, size of H = 500-1000

○ Hidden layer is used to compute prob. dist. over all the words in

vocabulary V

● Hierarchical softmax as the rescue.

Recurrent NNLM● No projection Layer, consists of input, hidden and output layers only.

● No need to specify the context length like feed forward NNLM

● What is special in RNN model?

○ Recurrent matrix that connects layer to itself.

○ Allows to form short-term memory

■ Information from the past is represen-

ted by the hidden layer

● RNN-embedded vector achieved state of the

art results in relational similarity identification task.

RNN Model

Recurrent NNLM

w(t): Input word at time ty(t): Output layer produces a prob. Dist.

over words.s(t): Hidden layerU: Each column represents a word

● Four-gram neural net language model architecture(Bengio 2001)● RNN is trained with SGD and backpropagation to maximize the ● log likelihood.

Bringing efficiency..● Computational complexity of the NNLMs are high.

● We can remove the hidden layer and speed up 1000x

○ Continuous bag-of-words model

○ Continuous skip-gram model

● The full softmax can be replaced by:

○ Hierarchical softmax (Morin and Bengio)

○ Hinge loss (Collobert and Weston)

○ Noise contrastive estimation (Mnih et al.)

Continuous Bag of Word Model(CBOW)● Non-linear hidden layer is removed

● Projection layer is shared for all words(not

just the projection matrix).

● All words get projected into the same

position(vectors are averaged).

● Naming Reson: Order of words in the

history does not influence the projection.

● Best performance obtained by a log-

linear classifier with four future and

four history words at the input

○ Training Criterion: Correctly classify

Predicts the current word based on

the context.

Continuous Skip-gram Model● Objective: Tries to maximize

classification of a word based on another

word in the same sentence. Maximize the

average log probability

● Define p(wt+j |wt ) using the softmax

function:

Predicts surrounding word given

the current word.

Bringing efficiency..● Computational complexity of the NNLMs are high.

● We can remove the hidden layer and speed up 1000x

○ Continuous bag-of-words model

○ Continuous skip-gram model

● The full softmax can be replaced by:

○ Hierarchical softmax (Morin and Bengio)

○ Hinge loss (Collobert and Weston)

○ Noise contrastive estimation (Mnih et al.)

Hierarchical Softmax for efficient computation

● This formulation is impractical because the cost of computing ∇logp(wO|wI) is proportional to W, which is often large (105–107 terms).

● With hierarchical softmax, the cost is reduced

Hierarchical Softmax● Uses a binary tree (Huffman code) representation of the output layer with the W

words as its leaves.

o A random walk that assigns probabilities to words.

● Instead of evaluating W output nodes, evaluate log(W) nodes to calculate prob. dist.

● Each word w can be reached by an appropriate path from the root of the tree● n(w, j): j-th node on the path from the root to w

● L(w): The length of this path

● n(w, 1) = root and n(w, L(w)) = w

● ch(n): An arbitrary fixed child of an inner node n

● [x] = 1 if x is true and [x] = -1 otherwise

Negative Sampling● Noise Contrastive Estimation (NCE)

o A good model should be able to differentiate data from noise by means of logistic regression.

o Alternative to the hierarchical softmax.

o Introduced by Gutmann and Hyvarinen and applied to language modeling by Mnih and Teh.

● NCE approximates the log probability of the softmax

● Define Negative Sampling by the objective which replaces log P(w0|wI) in the skip-gram.

● Task: Distinguish the target word wO from draws from the noise distribution Pn(w) using logistic regression, where there are k negative samples for each

Subsampling of Frequent words● Most frequent words provide less information than rare words.

o Co-occurrences of “France” and “Paris” is informative

o Co-occurrences of “France” and “the” is less informative

● A simple subsampling approach to counter imbalance

o Each word wi in the training set is discarded with probability

where f(wi) is the frequency of word wi and t is a chosen threshold, typically around 10−5

● Aggressive subsampling of words whose frequency is greater than t while preserving the ranking of the frequencies.

Empirical Results

Automatic learning by skip-gram model

● No supervised information about what a capital city means.

● But the model is still capable of

o Automatic organization of concepts

o Learning implicit relationship

PCA projection of 100- dimensional skip-gram vectors

Analogical Reasoning Performance

● Analogical Reasoning task introduced by Mikolov

o Syntactic analogies: “quick” : “quickly” :: “slow” : ? “slowly”

o Semantic analogies: “Germany” : “Berlin” :: “France” : ? “Paris”

Learning Phrases● To learn phrase vectors

o First find words that appear frequently together, and infrequently in other contexts.

o Replace with unique tokens. Ex: “New York Times” -> New_York_Times

● Phrases are formed based on the unigram and bigram counts, using

δ(discounting coefficient) prevents too many phrases consisting of very infrequent words.

Learning Phrases

Goal: Compute the fourth phrase using the first three.

(Best model accuracy: 72%)

Phrase Skip-gram Results● Accuracies of the Skip-gram models on the phrase analogy dataset

o Using different hyperparameters

o Models trained on approximately one billion words from the news dataset

● Size of the training data matters.

o HS-Huffman( dimensionality=1000) trained on 33 billion words reaches an accuracy of 72%

Additive compositionality

● Possible to meaningfully combine words by an element-wise addition of their

vector representations.

○ Word vectors represents the distribution of the context in which it appears.

● Vector values related logarithmically to the probabilities computed by output layer.

○ The sum of two word vectors is related to the product of the two context

distributions

Closest Entities

Closest entity search using two methods-Negative sampling and Hierarchical Softmax.

Compare with published word representations

Comments● Reduction of computational complexity is impressive.

● Works with unsupervised/unlabelled data

● Vector representation can be extended to large pieces of text

Paragraph Vector (Mikolov et al. 2013)

● Applicable to a lot of NLP tasks

o Tagging

o Named Entity Recognition

o Translation

o Paraphrasing

Sentiment analysis

Thank you.


Recommended