+ All Categories
Home > Technology > Efficient Parallel Learning of Word2Vec

Efficient Parallel Learning of Word2Vec

Date post: 15-Apr-2017
Category:
Upload: carsten-eickhoff
View: 283 times
Download: 3 times
Share this document with a friend
54
Efficient Parallel Learning of Word2Vec Jeroen B. P. Vuurens 1 , Carsten Eickhoff 2 , and Arjen P. de Vries 3 1 The Hague University of Applied Science 2 ETH Zurich 3 Radboud University Nijmegen June 24, 2016 J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14
Transcript
Page 1: Efficient Parallel Learning of Word2Vec

Efficient Parallel Learning of Word2Vec

Jeroen B. P. Vuurens1, Carsten Eickhoff2, and Arjen P. de Vries3

1The Hague University of Applied Science

2ETH Zurich

3Radboud University Nijmegen

June 24, 2016

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14

Page 2: Efficient Parallel Learning of Word2Vec

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Page 3: Efficient Parallel Learning of Word2Vec

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Page 4: Efficient Parallel Learning of Word2Vec

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Page 5: Efficient Parallel Learning of Word2Vec

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Page 6: Efficient Parallel Learning of Word2Vec

More is more. . .

Figure courtesy of http://deepdist.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 3 / 14

Page 7: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 8: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 9: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xi

I Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 10: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θ

I Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 11: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 12: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 13: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 14: Efficient Parallel Learning of Word2Vec

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14

Page 15: Efficient Parallel Learning of Word2Vec

Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Page 16: Efficient Parallel Learning of Word2Vec

Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Page 17: Efficient Parallel Learning of Word2Vec

Hogwild!

Simply skip the locking:I Draw a random training example xi

I Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Page 18: Efficient Parallel Learning of Word2Vec

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Page 19: Efficient Parallel Learning of Word2Vec

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Page 20: Efficient Parallel Learning of Word2Vec

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14

Page 21: Efficient Parallel Learning of Word2Vec

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Page 22: Efficient Parallel Learning of Word2Vec

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Page 23: Efficient Parallel Learning of Word2Vec

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Page 24: Efficient Parallel Learning of Word2Vec

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14

Page 25: Efficient Parallel Learning of Word2Vec

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Page 26: Efficient Parallel Learning of Word2Vec

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Page 27: Efficient Parallel Learning of Word2Vec

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Page 28: Efficient Parallel Learning of Word2Vec

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Page 29: Efficient Parallel Learning of Word2Vec

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Page 30: Efficient Parallel Learning of Word2Vec

Zipf’s Law

Figure courtesy of http://wugology.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 8 / 14

Page 31: Efficient Parallel Learning of Word2Vec

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Page 32: Efficient Parallel Learning of Word2Vec

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Page 33: Efficient Parallel Learning of Word2Vec

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Page 34: Efficient Parallel Learning of Word2Vec

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14

Page 35: Efficient Parallel Learning of Word2Vec

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Page 36: Efficient Parallel Learning of Word2Vec

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Page 37: Efficient Parallel Learning of Word2Vec

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Page 38: Efficient Parallel Learning of Word2Vec

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14

Page 39: Efficient Parallel Learning of Word2Vec

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Page 40: Efficient Parallel Learning of Word2Vec

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Page 41: Efficient Parallel Learning of Word2Vec

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Page 42: Efficient Parallel Learning of Word2Vec

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14

Page 43: Efficient Parallel Learning of Word2Vec

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14

Page 44: Efficient Parallel Learning of Word2Vec

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14

Page 45: Efficient Parallel Learning of Word2Vec

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14

Page 46: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodes

I Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 47: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodes

I Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 48: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 49: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 50: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodesI 4x speed-up

I Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 51: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodesI 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 52: Efficient Parallel Learning of Word2Vec

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodesI 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14

Page 53: Efficient Parallel Learning of Word2Vec

Thank You!

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14

Page 54: Efficient Parallel Learning of Word2Vec

Thank [email protected]

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14


Recommended