NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words...

transcript

NLP in Practice… and at Scale

Ivan Bilan, Data Engineer

Ivan Bilan (LinkedIn)● B. Sc. and currently M. Sc. at CIS● Honors Degree in Technology Management at CDTM● Data Engineer at TrustYou ● Previously Data Engineer at MobileX AG and IT at LMU● External Consultant at Adidas, Cloudeo AG and Paterva/Maltego

Research interests:

● Author Profiling and Identification● Automated Summarization● Relation Extraction

For every hotel on the planet, provide a summary

of traveler reviews.What does TrustYou do?

✓ Excellent hotel!*

✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs

nhow Berlin (Full summary)

TrustYou ArchitectureHadoop Cluster

Local Grammars

Text Generation

Machine LearningAggregation

Crawling API

3M new reviews per week!

Crawling

Scrapy

● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,

throttling … ● Scrapy Example Code

Semantic Analysis

● “Nice room”● “Room wasn‘t so great”● “The air-conditioning

was so powerful that we were cold in the room even when it was off.”

● “อาหารรสชาติดี”● ” خدمة جیدة“

● At core: Rule-based linguistic system (CFG’s)

● 20 languages● Classify opinions in 140

categories● ML mostly in summarization● Hadoop: Scale out CPU

○ ~1B opinions in DB

Semantic Analysis at TrustYou

ML @ TrustYou

● gensim doc2vec model to create hotel embeddings

● TF-IDF vectors● Combined with geographical

Creating Training Sets● For each hotel type, we need to create reliable training sets● Based on review content, amenities and (when possible) geographical

information● Open Street Map project: contains geo information of interest for several

NLP in Practice … and at Scale...Word2Vec Map words to vectors “Step up” from bag-of-words...

Documents