Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

Post on 22-Jan-2018

232 views 2 download

transcript

REPRESENTATIONLEARNING @ RED HATMichael A. Alcorn (malcorn@redhat.com)

Machine Learning Engineer - Information Retrieval

https://sites.google.com/view/michaelaalcorn/

1

OutlineBackgroundword2vec/url2vecdoc2vec/account2vecDuplicate Detection(batter|pitcher)2vec

MLconf Blog

2

Background

Why?Small amount (zero?) of labeled data for taskLots of unlabeled data (labeled data for a differenttask?)

Can we use large amounts of unlabeled data to makebetter predictions?

Not the same as traditional unsupervised learning!

in Goodfellow et al.'s Deep Learningtextbook

by Bengio et al.

Representation learning

Transfer learning

Excellent chapter

Article

3

word2vec

ew

Deeplearning4j - " "

Mikolov et al. (2013)

Word2vec

5

word2vecAnalogies

"x is to y as ? is to z" x - y + z = ?bash - shellshock + heartbleed = opensslfirefox - linux + windows = internet_exploreropenshift - cloud + storage = glusterrhn_register - rhn + rhsm = subscription-manager

=+—

6

Naming Colors mapping RGB values to

color namesResults are pretty underwhelming for those in theknowCan word embeddings improve ( )?

Blog post by Janelle Shane

GitHub

7

url2vecTasks concerning URLs

Search - returning relevant contentTroubleshooting - recommending related articles

Obvious method - look at textAlternative/enhanced method - use customerbrowsing behavior as additional contextual clues

8

url2vecHow?

Treat each day of browsing activity as a "sentence"Treat each URL as a "word"Run word2vec!

9

doc2vec

" "

Le and Mikolov (2014)

NLP 05: From Word2vec to Doc2vec: a simple example with Gensim

11

customer2vecWhy?

Data-driven segmentation

Same idea as url2vec except now we treat each account asa "document" of many "sentences" (different browsingdays)

12

customer2vecWhy?

Data-driven segmentation

Same idea as url2vec except now we treat each account asa "document" of many "sentences" (different browsingdays)

13

Duplicate DetectionThere are a number of "duplicate" KCS solutions onthe Customer Portal

Muddy search results

How can we identify candidate duplicate documents?

Obvious approach - compare text (e.g., tf-idf)

Bag-of-words loses any structural meaning behind text

Can we learn better representations?

Title is essentially a summary of the solution contentLearn representations of body that are similar to titlerepresentations (like the DSSM; )my code

15

Deep Semantic Similarity Model

Jianfeng Gao - " "Deep Learning for Web Search and Natural Language Processing

16

(batter|pitcher)2vec ( )GitHubCan we learn meaningful representations of MLBplayers?

Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players

17

Can we learn meaningful representations of MLBplayers?

Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players

(batter|pitcher)2vec ( )GitHub 18

THANK YOU!

21