Efficient Parallel Learning of Word2Vec

Post on 15-Apr-2017

283 views 3 download

transcript

Efficient Parallel Learning of Word2Vec

Jeroen B. P. Vuurens1, Carsten Eickhoff2, and Arjen P. de Vries3

1The Hague University of Applied Science

2ETH Zurich

3Radboud University Nijmegen

June 24, 2016

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

More is more. . .

Figure courtesy of http://deepdist.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 3 / 14

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xi

I Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θ

I Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Hogwild!

Simply skip the locking:I Draw a random training example xi

I Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Zipf’s Law

Figure courtesy of http://wugology.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 8 / 14

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodes

I Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodes

I Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodesI 4x speed-up

I Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodesI 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodesI 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Thank You!

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14

Thank You!j.b.p.vuurens@tudelft.nl

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14