NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words...

Post on 21-Jun-2020

4 views 0 download

transcript

NLP in Practice… and at Scale

Ivan Bilan, Data Engineer

Ivan Bilan (LinkedIn)● B. Sc. and currently M. Sc. at CIS● Honors Degree in Technology Management at CDTM● Data Engineer at TrustYou ● Previously Data Engineer at MobileX AG and IT at LMU● External Consultant at Adidas, Cloudeo AG and Paterva/Maltego

Research interests:

● Author Profiling and Identification● Automated Summarization● Relation Extraction

For every hotel on the planet, provide a summary

of traveler reviews.What does TrustYou do?

✓ Excellent hotel!*

✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs

nhow Berlin (Full summary)

TrustYou ArchitectureHadoop Cluster

Local Grammars

Text Generation

Machine LearningAggregation

Crawling API

3M new reviews per week!

Crawling

Scrapy

● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,

throttling … ● Scrapy Example Code

Semantic Analysis

● “Nice room”● “Room wasn‘t so great”● “The air-conditioning

was so powerful that we were cold in the room even when it was off.”

● “อาหารรสชาติดี”● ” خدمة جیدة“

● At core: Rule-based linguistic system (CFG’s)

● 20 languages● Classify opinions in 140

categories● ML mostly in summarization● Hadoop: Scale out CPU

○ ~1B opinions in DB

Semantic Analysis at TrustYou

ML @ TrustYou

● gensim doc2vec model to create hotel embeddings

● TF-IDF vectors● Combined with geographical

data

Creating Training Sets● For each hotel type, we need to create reliable training sets● Based on review content, amenities and (when possible) geographical

information● Open Street Map project: contains geo information of interest for several

categories

Examples:

- Coordinates of coastlines- Highways- Skilifts- Tourist attractions...

Word2Vec

● Map words to vectors● “Step up” from

bag-of-words model

● ‘Cats’ and ‘dogs’ should be similar – because they occur in similar contexts

>>> m["python"]

array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,

-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,

-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,

# ...

-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,

-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,

-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],

dtype=float32)

Classification Pipeline- Have a separate TF-IDF vectorizer / or word2vec for each language- Combine TF-IDF vectors before the classification step- Give different weights to each language to control their contribution to

the final classification results- Select only 80% of top word features using chi-squared function- Classify using Gradient Boosting - ensembling of decision trees

(https://github.com/dmlc/xgboost - the state-of-the-art in machine learning tasks based on Kaggle competitions)

- Gradient Boosting for the curious (lots of math): https://en.wikipedia.org/wiki/Gradient_boosting

Workflow Management& Scaling Up

● Framework for distributed data processinga. Load data into a special collection called “RDD”b. Apply actions on it, e.g. .map, .reduceByKey …c. Spark distributes work in a cluster automatically

● Native Scala, but has a great Python API

Spark

● Build complex pipelines ofbatch jobs

● Example:a. First, crawl new hotel reviews of the dayb. Then, analyze textc. Store results in DB

● Helps you with parallelism and error recovery

Luigi

Workflows at TrustYou

https://www.trustyou.com/job-department/engineering If you have any questions:ivan.bilan@trustyou.com