Text mining and analytics v6 - p1

Tutorial: Text Data Mining and AnalyticsHICSS 44 – January 2011

Dave King

Copyright 2011 JDA Software Group, Inc.

2

Welcome to one of theHICSS SWoTs


3

Difference between a Symposium & a Tutorial at HICSS

Symposium

Tutorial

Audience

M:M

1:M


4

Difference between a Symposium & a Tutorial at HICSS

Wv(t + 1) = Wv(t) + Θ (v, t) α(t)(D(t) - Wv(t))


5

Agenda

• Part 1:– Growing Interest in Analytics– Overview of Text Mining and Analysis– General Text Mining and Analysis Processes

• Part 2:– Classification and Categorization– Clustering– Information Extraction– Overview of Tools & Packages


6

This is the only note you’ll need to take

Presentation can be found at:

www.slideshare.net


7

Biography: Dave King

• Currently, EVP of Product Development and Management at JDA Software

• 28 years in enterprise package software business

• 15 years as university professor• 12 years as Co-Chair of the Internet &

Digital Economy Track (HICSS)• Long time interest in various aspects of

E-Commerce & Business Intelligence• Tutorial topic primarily reflects a

personal interest and tangentially a job(s) related interest.


8

Personal Experiences with Analytics

• Taught applied statistics and math modeling• In software R&D

– Optimization in the 80s– Natural Language Frontends

• NLI Query & CMU Robotics Lab– EIS Competitive Analysis

• Dow Jones and Reuters• Verity Topics• NewsAlert

– InXight’s Hyperbolic Tree

• Often the audiences has been small, sometimes bewildered, and often fleeting


9

If I have seen further it is only by …

plagiarizing the works of others.

http://upload.wikimedia.org/wikipedia/commons/3/39/GodfreyKneller-IsaacNewton-1689.jpg


10

Text Mining & Analysis Resources: Books


11

Text Mining & Analysis Resources: Books


12

Text Mining & Analysis Resources: Web Sites & Sources• TM/Blog -- blogs.sas.com/text-mining• TM/Blog -- texttechnologies.com• TM/Blog -- lingpipe-blog.com• TM & Analytics /Blog --

intelligent-enterprise.informationweek.com/movabletype/blog/sgrimes.html• TM/Wiki -- textanalytics.wikidot.com• TA/General -- social.textanalyticsnews.com• TA/General -- textanalysis.info• TA/General -- klariti.com/text-mining/index.shtml• TM & DM/Online Book -- statsoft.com/textbook/text-mining/• TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html• TA Tutorial -- slideshare.net/SethGrimes/text-analytics-for-dummies-2010• TM Tutorial -- www.esi.uem.es/~jmgomez/tutorials/ecmlpkdd02• TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial


13

Text Mining & Analysis Resources: Associated Web Sites & Sources• DM/Blog -- datamining.typepad.com• DM/Blog -- abbottanalytics.blogspot.com• DM/Blog -- bx.businessweek.com/data-mining/blogs• DM/Blog -- blog.data-miners.com• DM/Blog -- datawrangling.com• DM/Blog -- bytemining.com• DM/Blog -- marktab.net/datamining• DM/Blog -- dataminingblog.com• DM/Blog -- timmanns.blogspot.com• DM/General -- kdnuggets.com• DM/General -- mydatamine.com• DM/General -- the-data-mine.com• DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm• DM/Tutorial -- autonlab.org/tutorials/


14

Initial Question:What search terms are graphed?

http://www.google.com/trends


15

Interest in Analytics:Growing Awareness

Analytics – “Extensive use of data, statistical and quantitative analysis, exploratory and predictive models, and fact-based management to drive decisions and actions…a subset of what has come to be called BI.”(Davenport and Harris, Competing on Analytics, HBS, 2007)

Source: Google Trends


16

Interest in Analytics:Theory and Practice

Classification

Decision Trees Regression Trees Association Rules

Genetic Algorithms

Self-Org Map State Vector Machines

Oner

Clustering Artificial Neural Nets

Hopfield Net ZeroR FP-Growth

K-Means Association Rules Expectation Maximization

Multi-layered Perceptron

Eclat Graph-based Matching

1920 1930 1940 1950 1960 1970 1980 1990 2000Minmax theorem

Linear Program Simplex Algorithm

Dynamic Program

Maximal principle Convex Optimization

Polynomial-time algorithms

Convex Multi-criteria

Karush–Kuhn–Tucker conds.

Gradient methods Combinatorics Non-linear programming

Decision Analysis Constraint programming

Global optimization

Interior-point methods (LO)

Elipsoid Algorithm

In theory, there is no difference between theory and practice. But, in practice, there is.

Data Mining

Optimization


17

Interest in Analytics:Popular Titles


18

Interest in Analytics:Potential Reasons for the Interest

• Next generation DSS:– Progression of DSS->EIS->BI->PM->Analytics

• Increasing volumes of data requiring new approaches or modifications in existing approaches

• Focus on CRM and Supply Chains• …• General belief that more sophisticated analysis is required

to compete in today’s environments …


19

Interest in Text Mining & Analytics: An old adage

“WHY did you want to climb Mount Everest?" (in 1923 interview).His reply, “Because it’s there.”.

George Mallory.

http://en.wikipedia.org/wiki/File:GeorgeMallory.jpg


20

Interest in Text Mining & Analytics: The 80% Rule

“It's a truism that 80 percent of business-relevant information originates in unstructured form, primarily text… The 80 percent unstructured figure comes from, well, everywhere.”

Source: Seth Grimes, Unstructured Data and the 80 Percent Rule

20%

80%

Unstructured(Textual)

Structured(Databases)


21

Text Mining and Analytics:Definitions

• General: All types of text processing that deal with finding, organizing and analyzing textual (unstructured) information.

• Formal: Utilizing data mining techniques to create new information that is not obvious in a collection of documents (implies that Text Analytics ~ Text Mining ~ Text Data Mining)


22

Text Mining and Analytics:Types of Processing and Techniques• Clustering. Grouping similar documents without having a predefined set of

categories.• Categorization. Identifying the main themes of a document and then placing

the document into a predefined set of categories based on those themes.• Information extraction. Identification of key phrases and relationships within

text by looking for predefined sequences in text via pattern matching– Named-Entity Recognition Seeks to locate and classify atomic elements in

text into predefined categories (e.g. names of persons)– Concept linking and Topic Tracking. Connects related documents by

identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods.

– Summarization. Summarizing a document to save time on the part of the reader.


23

Text Mining and Analytics:Sample Application Areas

Problem Sources PatternsMedical Discovery Scientific and clinical literature;

treatment records and reportsExtracting associations and temporal relationships (e.g. among symptoms, treatments, …)

Health Care Case Management

Patient and treatment records, insurance and regulatory filings

Enhance diagnosis and reduce misdiagnosis, ensure adequate treatment, promote quality of service, increase utilization, reduce fraud, and control costs.

Intelligence and Counter-Terrorism

News and investigative reports, communications intercepts,case files … all in a variety of languages

Organizational associations and networks, attack patterns, threat assessment, event prediction…

Law Enforcement Case files, crime and court reports, legal documents, communications...

Detection of crime patterns (temporal, geospatial, persons and organizations), support of criminal investigations and prosecutions

Securities Fraud Financial and news reports, corporate filings and documents, trading and other transaction records

The goals include detecting insider trading, reporting irregularities, money laundering and illegal transactions, and pricing anomalies.

Customer Relationship Management

Customer e-mail and letters, call center notes and transcripts

Identify product and service quality issues, to assist in product design and management, and to route contacts

Sentiment Analysis & Reputation Management

News reports, Web pages, market analyses, correspondence, ...

Extracting concepts including “sentiments” and scoring criteria and weights; and running analyses

Seth Grimes Papers


24

Text Mining:A Common Issue

A great dowry is a bed full of brambles. Outlandish Proverbs, 1640

George Herbert, Welsh Poet & Priest

Structured data mining is a bed of roses when compared to unstructured, textual mining which is a bed of brambles

http://en.wikipedia.org/wiki/File:George_Herbert.jpg


25

Data Mining: Simple Example (Affinity Analysis)

• Study of attributes or characteristics that “go together.”– Seek to uncover “association rules” that quantify the relationship

between two or more attributes.– Rules take the form of “If antecedent, then consequent”

• Examples:– Market basket analysis to determine which items are purchased

together (in single transaction)– Web analysis to determine which sequences of pages users visit

• Major issue is number of potential combinations as the number of attributes increases


26


1. Market Basket Analysis: Items for Sale:

Apples Bananas Cherries Durians

2. Possible Transactions: With one item or a collection of items selected as the Driver or Independent Variable

3. Objective is to empirically determine those groups of items that occur frequently together in a set of transactions, producing a set of rules of the form X -> Y.


27


Support = N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29%

Confidence = N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50%

Standard Market Basket Measures:

Where N(T) = No of Trans and N(X & Y) = No of Trans X&Y


28

No n(X U Y) N % Support n(X) Confidence Is in Rules?1 A B 2 7 29% 4 50%2 A C 2 7 29% 4 50%3 A D 3 7 43% 4 75%4 A B C 1 7 14% 4 25%5 A C D 2 7 29% 4 50%6 A B D 1 7 14% 4 25%7 A B C D 1 7 14% 4 25%8 B A 2 7 29% 5 40%9 B C 2 7 29% 5 40%10 B D 4 7 57% 5 80%11 B A C 1 7 14% 5 20%12 B C D 2 7 29% 5 40%13 B A D 1 7 14% 5 20%14 B A C D 1 7 14% 5 20%15 C A 2 7 29% 3 67%16 C B 2 7 29% 3 67%17 C D 3 7 43% 3 100%18 C A B 1 7 14% 3 33%19 C B D 2 7 29% 3 67%20 C A D 2 7 29% 3 67%21 C A B D 1 7 14% 3 33%22 D A 3 7 43% 6 50%23 D B 4 7 57% 6 67%24 D C 3 7 43% 6 50%25 D A B 1 7 14% 6 17%26 D B C 2 7 29% 6 33%27 D A C 2 7 29% 6 33%28 D A B C 1 7 14% 6 17%29 A B C 1 7 14% 9 11%30 A B D 1 7 14% 9 11%31 A B C D 1 7 14% 9 11%32 A C B 1 7 14% 7 14%33 A C D 2 7 29% 7 29%34 A C B D 1 7 14% 7 14%35 A D B 1 7 14% 10 10%36 A D C 2 7 29% 10 20%37 A D B C 1 7 14% 10 10%38 B C A 1 7 14% 8 13%39 B C D 2 7 29% 8 25%40 B C A D 1 7 14% 8 13%41 B D A 1 7 14% 11 9%42 B D C 2 7 29% 11 18%43 B D A C 1 7 14% 11 9%44 C D A 2 7 29% 9 22%45 C D B 2 7 29% 9 22%46 C D A B 1 7 14% 9 11%47 A B C D 1 7 14% 12 8%48 A B D C 1 7 14% 15 7%49 A C D B 1 7 14% 13 8%50 B C D A 1 7 14% 14 7%

X Y



29

Data Mining: General Data Assumptions

• Requires structured data (numbers and categories well-defined)

• Transformed by data preparation or collected with a prior design in mind

• Typically housed and organized in a relational database, data mart or data warehouse


30

Data Mining: Simple Example

But, what if the baskets were described in the following manner:

– Jane bought a handful of maraschinos and a couple of granny smiths.

– Harold purchased a bag of appls and 2 bananas.– Bill paid for a pound of cherries but decided not to buy

the three durians because of their odor.

How could we automate the analysis?


31

Data Mining: CRISP-DM

Data Consolidation

Data Transformation

Data Cleaning

Data Reduction

Well-FormedData

Real-WorldData

Cross-Industry Standard Process for Data Mining

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modeling

Evaluation

Deployment


32

Text MiningCRISP-Like Processes

BusinessUnderstanding

DocumentUnderstanding

DocumentPreparation

Modeling

Evaluation

Deployment

DocumentConsolidation

Corpus Refinement(Token, Stem, Stop…)

Establish theCorpus

Feature Selection & Weighting

Term-Doc-Matrix*

Real-WorldText Data

Documents

* - Entity-Relationships


33

Text Mining Process:Establish the Corpus

• First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studied

• Range of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls …

• Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery.


34

Text Mining Process:Establish the Corpus

• Brown Corpus – first million word corpus compiled in 60s at Brown U., 500 samples across 15 genres, each ~2000 words with POS tags

• Linguistic Consortium Treebanks – collections of manually tagged and parsed (tree structures) of sentences from a variety of sources (includes well-known Penn Treebank collection)

• Reuters 21578, RCV1 & V2 -- collections (1000s of) Reuter’s English & multi-lingual news stories classified into topics and grouped into training & test sets

• Pang & Lee’s Sentiment Analysis – 1000 positive and 1000 negative movie reviews

• MEDLINE – An extensive collection of articles and abstracts (18M+) used in a variety of biomedical and linguistic text mining applications

• WordNet® -- large lexical database of English grouped into sets of cognitive synonyms (synsets) and interlinked by means of conceptual-semantic and lexical relations.

• Google Ngram -- 500 billion words from 5.2 million books published between 1500 and 2008 in English, French, Spanish, German, Russian, and Chinese.


35

Text Mining Process:Establishing the Corpus (Brown)

ID File Genre DescriptionChi ca16 news Chicago Tribune: Society ReportageChr cb02 editorial Christian Science Monitor: EditorialsTim cc17 reviews Time Magazine: ReviewsUnd cd12 religion Underwood: Probing the Ethics of RealtorsNor ce36 hobbies Norling: Renting a Car in EuropeBor cf25 lore Boroff: Jewish Teenage CultureRei cg22 belles_lettres Reiner: Coping with Runaway TechnologyUS ch15 government US Office of Civil and Defence Mobilization: The Family Fallout ShelterMos cj19 learned Mosteller: Probability with Statistical ApplicationsW.E ck04 fiction W.E.B. Du Bois: Worlds of ColorHit cl13 mystery Hitchens: Footsteps in the NightHei cm01 science_fiction Heinlein: Stranger in a Strange LandFie cn15 adventure Field: Rattlesnake RidgeCal cp12 romance Callaghan: A Passion in RomeThu cr06 humor Thurber: The Future, If Any, of Comedy

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

Sample Tagged Entry


36

Text Mining Process:Establishing the Corpus (Penn Treebank)

.STARTPierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

Raw

( (S (NP-SBJ (NP Pierre Vinken) , (ADJP (NP 61 years)

old) ,) (VP will (VP join (NP the board) (PP-CLR as

(NP a nonexecutive director)) (NP-TMP Nov. 29)))

.))

[ Pierre/NNP Vinken/NNP ],/, [ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB [ the/DT board/NN ]as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.

Tagged

Parsed


37

Text Mining Process:Establishing the Corpus (Reuters)

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT Mounting trade friction between the U.S. And Japan has raised fears among many of Asia's exporting nations that the row could inflict far-reaching economic damage, businessmen and officials said. They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionist sentiment in the U.S. And lead to curbs on American imports of their products. But some exporters said that while the conflict would hurt them in the long-run, in the short-term Tokyo's loss might be their gain. The U.S. Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17, in retaliation for Japan's alleged failure to stick to a pact not to sell semiconductors on world markets at below cost. Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes.

14826

File Categorytest/14826 tradetest/14828 graintest/14829 nat-gas crudetest/14832 rubber tin sugar corn rice grain tradetest/14833 palm-oil veg-oiltest/14839 ship…training/14805 coppertraining/14818 ship


38

Text Mining Process:Establish the Corpus (Google NGrams)

http://ngrams.googlelabs.com

http://ngrams.googlelabs.com/


39


Source: Google NGram


40


• 8,500 new words a year, 70% growth from 1950-2000, 50%+ of English lexicon is "dark matter."

• We’re forgetting our past faster with each passing year (tracking the references to the numerical years)

• Innovations spread faster than ever• Modern celebrities are younger and more

famous than predecessors, but their fame is shorter-lived.

• Culturomics is a powerful tool for automatically identifying censorship and propaganda. (e.g. e, Jewish artist Marc Chagall was mentioned just once in the entire German corpus from 1936-44) to 1944, even as his prominence in English-language books grew roughly fivefold.

• "Freud" is more deeply engrained in our collective subconscious than "Galileo," "Darwin," or "Einstein."

“Quantitative Analysis of Culture Using Millions of Digitized Books”Science Magazine, Dec. 18, 2010


41

Text Mining Process: Corpus Refinement

• Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text.

• Normalize — Convert them to lowercase.• Eliminate stop words — Eliminate terms that appear very often (e.g.

the, and, …).• Stemming — Convert the terms into their stemmed form—remove

plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNet Synset]

Tokenization NormalizeEliminate

Stop WordsStemming

Common representation of tokens within and between documents


42

Text Mining: Feature Extraction & Weighting

FeatureExtraction

Vector Representation:Word, Term or Token/Doc Matrix

“Bag of Words, Terms or Tokens”

Words or Tokens are attributes and documents are examples

Doc1 Doc2 Doc3 Doc4 …Token1 1 2 2 4Token2 4 2 3 0Token3 1 1 1 0Token4 1 1 1 2…


43

Text Mining:Transforming Frequencies

• Binary Frequencies: tf =1 for tf>0; otherwise 0• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0• Normalized Frequencies: Divide each frequency by SQRT

of Sum of Squares of the frequencies within the vector (column)

• Term Frequency–Inverse Document Frequency– TF * IDF– Inverse Document Frequency: log(N/(1+D)) where N is total

number of docs and D is number with term


44

Text Mining Processes:Simple Overview Example

• Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ).

• Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and looks to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way.

• URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner.

• Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.

http://wefeelfine.org/


45

Text Mining Processes:Simple Overview Example• API Query from wefeelfine.org:

http://api.wefeelfine.org:8080/ShowFeelings?display=xml&returnfields=imageid,feeling,sentence,posttime,postdate,posturl,gender,born,country,state,city,lat,lon,conditions&limit=500

• Result from Query: <?xml version="1.0" ?> - <feelings> <feeling feeling="super" sentence="i've been feeling super depressed missing my ex" posttime="1292298985" postdate="2010-12-13" posturl="http://screamingnspace.blogspot.com/2010/12/guilty-as-charged.html" gender="0" country="united states" state="south carolina" />

Source: www.wefeelfine.org/api.html


46

Text Mining Processes:Simple Overview Example• i'm blinded to other santas because this was my first but i can't help

feeling that there can't be a better one• i went to mcd with an idiot which is having the same feeling as me

now• i feel asleep• i feel about little red shoes and mittens• i feel the sands of time moving so quickly in my life it seems• i feel too young to have her this beauty across from me• i feel like im waiting for something profound or inspirational to hit me• …


47

Text Mining Processes:Simple Overview Example• Input String (43743 chars; 8245 spaces)

– "i'm blinded to other santas because this was my first but i can't help feeling that there can't be a better one\ni went to mcd with an idiot which is having the same feeling as me now\ni'll feel bad bout it and so\ni feel asleep\n…”

• Tokenize (9019 tokens)– ['i', "'m", 'blinded', 'to', 'other', 'santas', 'because', 'this', 'was', 'my', 'first', 'but', 'i', 'ca',

"n't", 'help', 'feeling', 'that', 'there', 'ca', "n't", 'be', 'a', 'better', 'one', 'i', 'went', 'to', 'mcd', 'with', 'an', 'idiot', 'which', 'is', 'having', 'the', 'same', 'feeling', 'as', 'me', 'now', 'i', "'ll", 'feel', 'bad', 'bout', 'it', 'and', 'so', 'i', 'feel', 'asleep', …]

• Set of Tokens (1816 distinct tokens) – ["'", "'bout", "'cleaner", "'d", "'http", "'i", "'ll", "'m", "'re", "'s", "'ve", '000', '039', '097', '1',

'100', '101', '102', '104', '105', '108', '111', '114', '115', '116', '118', '11am', '12', '121', '15', '16', '180', '1998', '1st', '2', '2013', '23', '2nd', '3', '30', '78', '9', ':', 'a', 'ab', 'abit', 'able', 'about', 'above', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', …]


48


Corpus Word Length Sentence Length Lexical DiversityWe Feel Fine 4 18 5Counts =35318/9019 =9019/500 =9019/1816Gutenberg CorpusAusten-persuasion.txt 4 23 16Bible-kjv.txt 4 33 79Blake-poems.txt 4 18 5Carroll-alice.txt 4 16 12Melville-moby.txt 4 24 15Milton-paradise.txt 4 52 15Shakespeare-caesar.txt 4 12 8Shakespeare-hamlet.txt 4 13 7


49

Text Mining Process: Simple Overview Example• Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …)

– Content (4390 or 49% of tokens not stopwords – 4053 with tokens starting with apostrophes and #s eliminated )

– Set of tokens (1651) with stopwords eliminated ['ab', 'abit', 'able', 'abs', 'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', 'add', …]

• Stemming– Stemmed tokens (4053) ['ab', 'abit', 'abl', 'ab', 'absolut', 'absolut', 'absorb', 'abus',

'accomplish', 'accomplish', 'achiev', 'achiev', 'across', 'act', 'action', 'activ', 'activ', 'actual', 'acura', 'add’,…]

– Set of tokens in stemmed content(1388) ['ab', 'abit', 'abl', 'absolut', 'absorb', 'abus', 'accomplish', 'achiev', 'across', 'act', 'action', 'activ', 'actual', 'acura', 'ad', 'add’,…]


50



51

Text Mining Process: Simple Overview Example

Sum 146 42 41 34 30 27 25 23 23 23 22 22 21 21 20 20 19 19 17 17Sum Feelings like know time realli want make better life love go good need way get think someth back ca much one

4 content1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 10 content2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 content3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 content4 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 00 content5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 content6 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 content7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 content8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 content9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 content10 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

… … … … … … … … … … … … … … … … … … … … … …0 content490 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 content491 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 content492 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 content493 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 content494 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 content495 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 content496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 content497 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 02 content498 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 00 content499 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 content500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Document-Term Matrix


52

Text Mining Process:Establish the Corpus (Simple Example)

Madness Murmerings Montage

MoundsMetricsMobs

http://wefeelfine.org/movements.html


53

Text Mining Processes:Overview Example 2

Question: What is this?

Answer: This is Twitter on steroids.


54

Text Mining Process:Overview Example 2

Twitter Statistics:• ~106M registered users.• New users 300K per day.• 180 million unique visitors per mnth.• 75% of traffic from 3rd Party Apps• Average 55 million tweets a day.• 600 million search queries per day.• 37% use their phone to tweet.• 60% of tweets from 3rd Party Apps

Based on 1+B tweets generated by over 20 million Twitter users in 2010 (bio, web site, loc info).

Source:huffingtonpost.com/2010/04/14/twitter-user-statistics-r_n_537992.html


55

Text Mining Process:Overview Example 2• Each tweet <= 140 characters (avg. 10-

15 words/message)• Heavy presence of non-alpha symb0-

ols, abbrevs, misspellings and slang• Tweets often include retweets (original

tweet repeated)• In spite of this – Tweets have proven to

be an interesting text mining resource (e.g. see lifeanalytics.blogspot.com & mashable.com/author/dan-zarrella/)


56


• Twitter gets a total of 3 billion requests a day via its API

• API Calls for Public Tweets– http://search.twitter.com/search.json?q=%3A)+feel+ feeling&rpp=100&page=1– http://api.twitter.com/1/trends/current.json?

exclude=hashtags


57

Text Mining Process:Overview Example 2u'iso_language_code': u'en',u'to_user_id_str': None, u'text': u"RT @EverSoSassy56 <--- I'm sportin' my glasses... I feel all sophisticated and stuff. :-) -- And the operative word is feeling...LOL", u'from_user_id_str': u'168852471', u'profile_image_url': u'http://a0.twimg.com/profile_images/1166685224/Jonise_normal.jpg', u'id': 16300313380130816L, u'source': u'<a href="http://twidroid.com" rel="nofollow">twidroid</a>', u'id_str': u'16300313380130816', u'from_user': u‘XXXXXXXXXX', u'from_user_id': 168852471, u'to_user_id': None, u'geo': None, u'created_at': u'Sun, 19 Dec 2010 01:14:32 +0000', u'metadata': {u'result_type': u'recent'}


58

Text Mining Process:Establish the Corpus (2nd Example)

Happy Face Sad Face

Tokens = 14670 Set of Tokens= 2289 avg./Sent = 24 lex. div. = 6.4 Non-Stop words = 10406 Set Non-Stop = 2117Stems = 5003 Set of Stems = 1052w/o Feel = 3921 Set w/o Feel = 1051


59

Text Mining Process:Overview Example 2• “Twitter Sentiment Classification

using Distant Supervision”• Utilizes presence of emoticons “ :)” &

“ :( “ to serve as surrogates for classification as positive and negative sentiment statements

• To construct the term-document matrix relies on a list of positive and negative key words from Twittratr, counting number of key words that appear in each tweet.

• 180K tweets collected for training purposes between April and June 2009

• 80%+ accuracy in classification


60

Text Mining Processes:Overview Example 2

An area cartogram is a map in which some thematic mapping variable – such as travel time or GNP -- is substituted for land area. The geometry or space of the map is distorted in order to convey the information of this alternate variable.

What is this?


61


• Pulse of the Nation: U.S. Mood throughout the Day Inferred from Twitter• Analyzed 300M public tweets produced in the US from 9/2006-8/2009 and

containing words from a psychological word-rating system (“Affective Norms for English Words”)

• Through a natural language processing algorithm called Sentiment Analysis, each tweet was assigned a mood score based on the number of positive or negative words it contained.

• Calculated the average mood score of all the users living in a state hour by hour which formed the basis of a series of time-varying mood maps.

http://www.ccs.neu.edu/home/amislove/twittermood/

http://en.wikipedia.org/wiki/Sentiment_analysis


62



63

Text Mining Process:Establish the Corpus (2nd Example)

Date post:	12-May-2015
Category:	Education
Upload:	dave-king
View:	517 times
Download:	1 times

Text mining and analytics v6 - p1

Education