+ All Categories
Home > Documents > NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words...

NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words...

Date post: 21-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
21
NLP in Practice … and at Scale Ivan Bilan, Data Engineer
Transcript
Page 1: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

NLP in Practice… and at Scale

Ivan Bilan, Data Engineer

Page 2: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Ivan Bilan (LinkedIn)● B. Sc. and currently M. Sc. at CIS● Honors Degree in Technology Management at CDTM● Data Engineer at TrustYou ● Previously Data Engineer at MobileX AG and IT at LMU● External Consultant at Adidas, Cloudeo AG and Paterva/Maltego

Research interests:

● Author Profiling and Identification● Automated Summarization● Relation Extraction

Page 3: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

For every hotel on the planet, provide a summary

of traveler reviews.What does TrustYou do?

Page 4: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

✓ Excellent hotel!*

✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs

nhow Berlin (Full summary)

Page 5: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar
Page 6: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar
Page 7: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar
Page 8: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

TrustYou ArchitectureHadoop Cluster

Local Grammars

Text Generation

Machine LearningAggregation

Crawling API

3M new reviews per week!

Page 9: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Crawling

Page 10: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Scrapy

● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,

throttling … ● Scrapy Example Code

Page 11: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Semantic Analysis

Page 12: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

● “Nice room”● “Room wasn‘t so great”● “The air-conditioning

was so powerful that we were cold in the room even when it was off.”

● “อาหารรสชาติดี”● ” خدمة جیدة“

● At core: Rule-based linguistic system (CFG’s)

● 20 languages● Classify opinions in 140

categories● ML mostly in summarization● Hadoop: Scale out CPU

○ ~1B opinions in DB

Semantic Analysis at TrustYou

Page 13: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

ML @ TrustYou

● gensim doc2vec model to create hotel embeddings

● TF-IDF vectors● Combined with geographical

data

Page 14: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Creating Training Sets● For each hotel type, we need to create reliable training sets● Based on review content, amenities and (when possible) geographical

information● Open Street Map project: contains geo information of interest for several

categories

Examples:

- Coordinates of coastlines- Highways- Skilifts- Tourist attractions...

Page 15: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Word2Vec

● Map words to vectors● “Step up” from

bag-of-words model

● ‘Cats’ and ‘dogs’ should be similar – because they occur in similar contexts

>>> m["python"]

array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,

-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,

-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,

# ...

-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,

-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,

-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],

dtype=float32)

Page 16: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Classification Pipeline- Have a separate TF-IDF vectorizer / or word2vec for each language- Combine TF-IDF vectors before the classification step- Give different weights to each language to control their contribution to

the final classification results- Select only 80% of top word features using chi-squared function- Classify using Gradient Boosting - ensembling of decision trees

(https://github.com/dmlc/xgboost - the state-of-the-art in machine learning tasks based on Kaggle competitions)

- Gradient Boosting for the curious (lots of math): https://en.wikipedia.org/wiki/Gradient_boosting

Page 17: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Workflow Management& Scaling Up

Page 18: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

● Framework for distributed data processinga. Load data into a special collection called “RDD”b. Apply actions on it, e.g. .map, .reduceByKey …c. Spark distributes work in a cluster automatically

● Native Scala, but has a great Python API

Spark

Page 19: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

● Build complex pipelines ofbatch jobs

● Example:a. First, crawl new hotel reviews of the dayb. Then, analyze textc. Store results in DB

● Helps you with parallelism and error recovery

Luigi

Page 20: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

Workflows at TrustYou

Page 21: NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words model ‘Cats’ and ‘dogs’ should be similar – because they occur in similar

https://www.trustyou.com/job-department/engineering If you have any questions:[email protected]


Recommended