+ All Categories
Home > Documents > Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF...

Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF...

Date post: 02-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
23
Copyright © 2014 KNIME.com AG Text Mining TripAdvisor Data in search of ethnic Restaurants in San Francisco May, 2016 Rosaria Silipo KNIME.com AG, Zurich, Switzerland www.knime.com @KNIME [email protected]
Transcript
Page 1: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Text Mining TripAdvisor Datain search of ethnic Restaurants in San Francisco

May, 2016

Rosaria Silipo

KNIME.com AG, Zurich, Switzerland

www.knime.com

@KNIME

[email protected]

Page 2: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Demo: TripAdvisor Restaurant Data Set (SF)

2

Page 3: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Demo: TripAdvisor Data (SF Restaurants)

3

Reviews about Italian and Chinese restaurants in San Francisco:

• Chinese: 272

• Italian: 268

Page 4: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Demo: Goal of this Tutorial

4

Goal:

• Build a classifier to distinguish between Chinese and Italian restaurants, based on the reviews.

Italian or Chinese Restaurant?

Page 5: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Demo: Final Workflow

5

Goal:1. Import Data 2. Enrichment (Tagging)

3. Pre-processing(Filtering, Stemming, …)

4. TransformationBoW, Frequencies,Document Vector

5. ClassificationClustering

Page 6: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Demo Workflows

0-TripAdvisorCrawling: importing data from web

1-Reading: Importing data from text, word, pdf, Twitter, XML, …

2-Enrichment POS: String to Document and Word Tagging in Document

3-Preprocessing: Filtering and Stemming

4-Classification-Cuisine: BoW, Frequencies, Document to Document Vector

Other workflows for multi-words, clustering, topic extraction, and reporting.

6

Page 7: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

1.) Reading

Read/Parse textual data

7

Page 8: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

0.) Web Crawler Workflow

Palladian Extension from:

KNIME Community Contributions – Other

8

Html Parser XPath

Page 9: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

2.) Enrichment

Enrich documents with semantic information

9

This assigns a tag to each word:- Grammar tags (POS)- Context dependent tags- Sentiment tags- Named Entity tags - Custom tags

Page 10: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

3.) Preprocessing

Preprocess documents and filter words

10

Filter Numbers, Punctuation, Stop Words

to lowercase

Stemming

Page 11: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

4.) Transformation

Creation of numerical representation of documents

11

BoW creates the list of words for each documentTF calculates word frequencies (absolute or relative)

in each document

Page 12: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Frequencies

Frequency Calculation

• Compute TF value for terms

TFrel (word) = n(word)/N

IDF(word) = log(1+(n(docs)/n(word, docs))

TFrel(word) * IDF(word) is used often

ICF(word) = log(1+(n(cat)/n(word, cat))

• Sort output data by frequency, top words should be most important

12

Page 13: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Tag Cloud for Italian Restaurants

13

Page 14: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Tag Cloud for Chinese Restaurants

14

Page 15: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

4.) Transformation

Creation of numerical representation of documents

15

Page 16: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Transform to Numerical Data Tables

Transformation

• Transform to document vectors

• Extract category (class) value

16

Page 17: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

5.) Classification

Back to classical Data Analytics:

Training of a model (decision tree) and scoring

17

Page 18: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Decision Tree Insights

18

Discriminative Words:

• Italian vs. Chinese

• Pizza, Pasta, Wine vs. Sum, Soup

• Chinatown, China vs. Beach

Page 19: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

5.) X-Validation Loop

Applying the Cross-Validation Loop (10 folds)

19

Page 20: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Error Stats from X-Validation Loop

20

Error Stats from the Statistics Node:

Page 21: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Topic Extraction (Topic Extractor node)

Using “Topic Extractor (Parallel LDA)” node with 10 words to describe each topic

Simple parallel threaded implementation of LDA, from:

Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models with SparseLDA sampling scheme and data structure, JMLR (2009) Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).

The node uses the "MALLET: A Machine Learning for Language Toolkit." topic modeling library.

21

Page 22: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Introducing NGrams (NGram Creator node)

• Multi Word Tagging

– Detection of frequent Ngrams (Ngram Creator)

– Creation of dictionary from Ngrams

– Applying Dictionary Tagger

• Classification with Multi Words (X-Validation Loop):

22

Higher average error

Lower variance

Page 23: Text Mining TripAdvisor Data in search of ethnic ...files.meetup.com/16175222/Text Mining SF Restaurants 2016.pdf · Back to classical Data Analytics: Training of a model (decision

Copyright © 2014 KNIME.com AG

Thank You

40k

60k

20k

23

Questions:

• http://tech.knime.org/forum

[email protected]

Follow us:

• Twitter: @KNIME

• LinkedIn: https://www.linkedin.com/groups?gid=2212172

• KNIME Blog: http://www.knime.org/blog


Recommended