Date post: | 19-Feb-2017 |
Category: |
Technology |
Upload: | steffen-wenz |
View: | 68 times |
Download: | 1 times |
Powered by PythonSummarizing hotel reviews for 100 million travelers
Steffen Wenz, CTO
10,000 hotelsuse TrustYou Analytics to analyze their guest reviews.
100 million travelerssee our data on Google, Hotels.com, Kayak … actually it’s probably more.
Architecture ;-)Hadoop Cluster
(Hortonworks Distribution)
Big Data Python
Machine LearningNLP
Scraping API
MagicLove
Hadoop:
… slow & massive
Python on Hadoop:
… possible, but not natural
Let’s try Spark!$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} \; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a node on a file */
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pgenheaders.h"
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "token.h"
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "node.h"
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward */
20f6f686 (Tim Peters 2000-07-09 03:09:57 +0000 9) static void list1node(FILE *, node *);
20f6f686 (Tim Peters 2000-07-09 03:09:57 +0000 10) static void listnode(FILE *, node *);
Let’s try Spark!import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(\d{4})-\d{2}-\d{2}"
years_hist = sc.textFile("blame") \
.flatMap(lambda line: re.findall(year_re, line)) \
.map(lambda year: (year, 1)) \
.reduceByKey(op.add)
output = years_hist.collect()
What happened here?
Grammars & ParsingOr: Why you should have paid attention in
compilers class
Grammars and Parsing$ less Grammar/Grammar
...
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
while_stmt: 'while' test ':' suite ['else' ':' suite]
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
...
Parsing: Given an input string, determine/guessgrammar production rules to generate it
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NOUN COP ADJ
... OPINION -> ADJ NOUN
... NOUN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... ADJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print(tree)
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
Word2Vec
● Map words to vectors● “Step up” from
bag-of-words model
● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040,
-0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166,
0.3312, -0.0928, -0.0967,
-0.0199, -0.2498, -0.4445,
-0.0445,
# ...
Fun with Word2Vec>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]
ML @ TrustYou
● gensim doc2vec model to create hotel embedding
● Used - together with other features - for various classifiers
● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs
Luigi
class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!