Copyright © 2014 KNIME.com AG
Text Mining TripAdvisor Datain search of ethnic Restaurants in San Francisco
May, 2016
Rosaria Silipo
KNIME.com AG, Zurich, Switzerland
www.knime.com
@KNIME
Copyright © 2014 KNIME.com AG
Demo: TripAdvisor Restaurant Data Set (SF)
2
Copyright © 2014 KNIME.com AG
Demo: TripAdvisor Data (SF Restaurants)
3
Reviews about Italian and Chinese restaurants in San Francisco:
• Chinese: 272
• Italian: 268
Copyright © 2014 KNIME.com AG
Demo: Goal of this Tutorial
4
Goal:
• Build a classifier to distinguish between Chinese and Italian restaurants, based on the reviews.
Italian or Chinese Restaurant?
Copyright © 2014 KNIME.com AG
Demo: Final Workflow
5
Goal:1. Import Data 2. Enrichment (Tagging)
3. Pre-processing(Filtering, Stemming, …)
4. TransformationBoW, Frequencies,Document Vector
5. ClassificationClustering
Copyright © 2014 KNIME.com AG
Demo Workflows
0-TripAdvisorCrawling: importing data from web
1-Reading: Importing data from text, word, pdf, Twitter, XML, …
2-Enrichment POS: String to Document and Word Tagging in Document
3-Preprocessing: Filtering and Stemming
4-Classification-Cuisine: BoW, Frequencies, Document to Document Vector
Other workflows for multi-words, clustering, topic extraction, and reporting.
6
Copyright © 2014 KNIME.com AG
1.) Reading
Read/Parse textual data
7
Copyright © 2014 KNIME.com AG
0.) Web Crawler Workflow
Palladian Extension from:
KNIME Community Contributions – Other
8
Html Parser XPath
Copyright © 2014 KNIME.com AG
2.) Enrichment
Enrich documents with semantic information
9
This assigns a tag to each word:- Grammar tags (POS)- Context dependent tags- Sentiment tags- Named Entity tags - Custom tags
Copyright © 2014 KNIME.com AG
3.) Preprocessing
Preprocess documents and filter words
10
Filter Numbers, Punctuation, Stop Words
to lowercase
Stemming
Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
11
BoW creates the list of words for each documentTF calculates word frequencies (absolute or relative)
in each document
Copyright © 2014 KNIME.com AG
Frequencies
Frequency Calculation
• Compute TF value for terms
TFrel (word) = n(word)/N
IDF(word) = log(1+(n(docs)/n(word, docs))
TFrel(word) * IDF(word) is used often
ICF(word) = log(1+(n(cat)/n(word, cat))
• Sort output data by frequency, top words should be most important
12
Copyright © 2014 KNIME.com AG
Tag Cloud for Italian Restaurants
13
Copyright © 2014 KNIME.com AG
Tag Cloud for Chinese Restaurants
14
Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
15
Copyright © 2014 KNIME.com AG
Transform to Numerical Data Tables
Transformation
• Transform to document vectors
• Extract category (class) value
16
Copyright © 2014 KNIME.com AG
5.) Classification
Back to classical Data Analytics:
Training of a model (decision tree) and scoring
17
Copyright © 2014 KNIME.com AG
Decision Tree Insights
18
Discriminative Words:
• Italian vs. Chinese
• Pizza, Pasta, Wine vs. Sum, Soup
• Chinatown, China vs. Beach
Copyright © 2014 KNIME.com AG
5.) X-Validation Loop
Applying the Cross-Validation Loop (10 folds)
19
Copyright © 2014 KNIME.com AG
Error Stats from X-Validation Loop
20
Error Stats from the Statistics Node:
Copyright © 2014 KNIME.com AG
Topic Extraction (Topic Extractor node)
Using “Topic Extractor (Parallel LDA)” node with 10 words to describe each topic
Simple parallel threaded implementation of LDA, from:
Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models with SparseLDA sampling scheme and data structure, JMLR (2009) Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).
The node uses the "MALLET: A Machine Learning for Language Toolkit." topic modeling library.
21
Copyright © 2014 KNIME.com AG
Introducing NGrams (NGram Creator node)
• Multi Word Tagging
– Detection of frequent Ngrams (Ngram Creator)
– Creation of dictionary from Ngrams
– Applying Dictionary Tagger
• Classification with Multi Words (X-Validation Loop):
22
Higher average error
Lower variance
Copyright © 2014 KNIME.com AG
Thank You
40k
60k
20k
23
Questions:
• http://tech.knime.org/forum
Follow us:
• Twitter: @KNIME
• LinkedIn: https://www.linkedin.com/groups?gid=2212172
• KNIME Blog: http://www.knime.org/blog