+ All Categories
Home > Documents > Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Date post: 26-Mar-2015
Category:
Upload: alexander-butler
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
36
Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)
Transcript
Page 1: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Text Mining LabAdrian and Shawndra

December 4, 2012 (version 1)

Page 2: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Outline

1. Download and Install Python

2. Download and Install NLTK

3. Download and Unzip Project Files

4. Simple Naïve Bayes Classifier

5. Demo collecting tweets -- > Evaluation

6. Other things you can do …

Page 3: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Download and Install Python

http://www.python.org/getit/ (latest version 2.7.3)

http://pypi.python.org/pypi/setuptools (install the setup tools for 2.7)

Page 4: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Download and Install NLTK

Install PyYAML: http://pyyaml.org/wiki/PyYAML

Install NUMPY: http://numpy.scipy.org/

Install NLTK: http://pypi.python.org/pypi/nltk

Install MatPlotLib: http://matplotlib.org/

Page 5: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Test Installation

Run python

At the prompt type

>>import nltk

>>import matplotlib

Page 6: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Downloading Models

>> nltk.download()

Open GUI downloader

Select “Models” tab and download:

maxent_ne_chunker

maxent_treebank_pos_tagger

hmm_treebank_pos_tagger

Select “Corpora” tab and download:

stopwords

AlternativelySelect “Collections” , click on all, and click the button to download all

Page 7: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Getting Started

Unzip project directory (lab1.zip)

Change to the lab1 directory

Open command window in the “lab1” directory

Windows 7 and later – Hold SHIFT; right-click in directory, select “Open command window here”

Unix/Mac – Open terminal; cd PATH/TO/lab1

Type “python” and then <enter> in terminal

>> import text_processing as tp

>> import nltk

Note: text_processing comes from your lab1 folder

Note: You must work from your lab1 directory

Page 8: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Downloading Models

>> nltk.download()

Open GUI downloader

Select “Models” tab and download:

maxent_ne_chunker

maxent_treebank_pos_tagger

hmm_treebank_pos_tagger

Select “Corpora” tab and download:

stopwords

AlternativelySelect “Collections” , click on all, and click the button to download all

Page 9: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Simple NB Sentiment Classifier

Page 10: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Read in tweetsCALL

>> paths = [‘neg_examples.txt', ‘pos_examples.txt']

>> documentClasses = [‘neg', ‘pos']

>> tweetSet = [tp.loadTweetText(p) for p in paths]

SAMPLE OUTPUT

>> len(tweetSet[0]), len(tweetSet[1])

(20000, 40000)

>> tweetSet[1][50]

"@davidarchie hey david !me and my bestfriend are forming a band .could you give us any advice please? it's means a lot for us :)"

Page 11: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Read in tweets (Code)

Reads in a file and treats each line as a tweet, lower-casing the text

Page 12: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

TokenizeCALL

>> tokenSet = [tp.tokenizeTweets(tweets) for tweets in tweetSet]

SAMPLE OUTPUT

>> len(tokenSet[1][50])

31

>> tokenSet[1][50]

['@', 'davidarchie', 'hey', 'david', '!', 'me', 'and', 'my', 'bestfriend', 'are', 'forming', 'a', 'band', '.', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', '?', 'it', "'", 's', 'means', 'a', 'lot', 'for', 'us', ':)']

Page 13: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Tokenize (Code)

For each tweet, splits the tweet by whitespace. Splits off punctuation separately.

(nltk.WordPunctTokenizer)

Page 14: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Filter out Non-EnglishCALL

>> englishSet = [tp.filterOnlyEnglish(tokens)

for tokens in tokenSet]

SAMPLE OUTPUT

>> len(englishSet[1][50])

22

>> englishSet[1][50]

['hey', 'david', 'me', 'and', 'my', 'are', 'forming', 'a', 'band', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', 'it', 'means', 'a', 'lot', 'for', 'us']

Page 15: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Filter out Non-English (Code)

Reads in a dictionary file of English words – “wordsEn.txt” – and only keeps tokens in that dictionary

Page 16: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Filter out Stopwords

SAMPLE OUTPUT

>> len(noStopSet[1][50])

12

>> noStopSet[1][50]

['hey', 'david', 'forming', 'band', 'could', 'give', 'us', 'advice', 'please', 'means', 'lot', 'us']

CALL

>> noStopSet = [tp.removeStopwords(tokens,

[':)', ':(']) for tokens in englishSet]

Page 17: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Filter out Stopwords (Code)

Loads stop word list, and removes tokens that are in stop words. Also able to pass additional words as stop words

using the “addtlStopwords” argument

Page 18: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Stem

CALL

>> stemmedSet = [tp.stemTokens(tokens) for tokens in noStopSet]

SAMPLE OUTPUT

>> len(stemmedSet[1][50])

12

>> stemmedSet[1][50]

['hey', 'david', 'form', 'band', 'could', 'give', 'us', 'advic', 'pleas', 'mean', 'lot', 'us']

Page 19: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Stem (Code)

Loads a Porter stemmer implementation to remove suffixes from tokens. http://nltk.org/api/nltk.stem.html for more

information on NLTK's stemmers.

Page 20: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Make Bags of WordsCALL

>> bagsOfWords = [tp.makeBagOfWords(

tokens, documentClass=docClass)

for docClass, tokens in zip(documentClasses,

stemmedSet)]

SAMPLE OUTPUT

>> bagsOfWords[1][50][0].items()

[('us', 2), ('advic', 1), ('band', 1), ('could', 1), ('david', 1), ('form', 1), ('give', 1), ('hey', 1), ('lot', 1), ('mean', 1), ('pleas', 1)]

Page 21: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Make Bags of Words (Code)

For each tweet, constructs a bag of words (FreqDist) that counts that number of times each token occurs. Setting the bigrams

argument to True will also include bigrams in the bags.

Page 22: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Make Train and TestCALL

>> trainSet, testSet = tp.makeTrainAndTest(

reduce(lambda x, y: x + y, bagsOfWords),

cutoff=0.9)SAMPLE OUTPUT

>> len(trainSet), len(testSet)

(50697, 5633)

Page 23: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Make Train and Test (Code)

Given all of your examples, randomly selects proportion cutoff of examples for training, and 1-cutoff examples for

testing.

Page 24: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Train ClassifierCALL

>> import nltk.classify.util

>> nbClassifier = tp.trainNBClassifier(trainSet,

testSet)

SAMPLE OUTPUT

>> classifier.show_most_informative_features(n=20)

…..

Page 25: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Train Classifier (Code)

Trains a Naive Bayes classifier over the input training set. Prints out the accuracy over the test set and prints tokens

the most discriminating tokens.

Page 26: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Twitter Collection Demo

Page 27: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Directions for Collection Demo

Now, try out twitter_kw_stream.py to collect more tweets over a couple of different “classes”.

Some possible tokens (with high volume)

apple google cat dog pizza “ice cream”

Open a new terminal window in the same directory

For each keyword – KW -- you search for, type:

python twitter_kw_stream.py --keywords=KW

Wait a minute or so (till you retrieve about 100 tweets)

Page 28: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Accuracy given training data size

Assuming keywords searched for were:

apple google cat dog pizza “ice cream”

In terminal already running Python interpreter:>> paths = ['apple.txt', 'google.txt', 'cat.txt', 'dog.txt',

'pizza.txt', 'ice cream.txt']

>> addtlStopwords = ['apple', 'google', 'cat', 'dog', 'pizza', 'ice', 'cream']

>> cutoffs, uniAccs, biAccs = tp.plotAccuracy(paths,

addtlStopwords=addtlStopwords)If matplotlib installed correctly, should display accuracy of

NB classifier while varying the amount training data, with and without using bigrams. Saved to “nbPlot.png”

Page 29: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Other things you can do

Page 30: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Get Document SimilarityCALL

>> docs = tp.loadDocuments(paths)

>> sims = tp.getDocumentSimilarities(paths,

[p.replace('.txt', '') for p in paths])

SAMPLE OUTPUT

>> sims[('apple', 'dog')]

0.30735795122824466

>> sims[('apple', 'google')]

0.44204540065105324

Page 31: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Get Document Similarity (Code)

Calculates cosine similarity for each bag of word pair. Dot product of the two frequency vectors (after normalizing each

to a unit vector).

Page 32: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Calculate TF-IDF

CALL

>> tfIdfs = tp.getTfIdfs(docs)

SAMPLE OUTPUT

>> for path, tfIdf in zip(paths, tfIdfs):

… print 'Top 10 TF-IDF for %s: %s' %(path,

'\n'.join([str(t) for t in tfIdf]))

Page 33: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Part-of-Speech TagCALL

>> posneTagged = [[tp.partOfSpeechTag(ts) for ts in

classTokens[:100]]

for tokens in tokenSet]

SAMPLE OUTPUT>> posSet[1][50]

[('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')]

Page 34: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Part-of-Speech Tag (Code

Very simple, as long as you have a string of tokens, can just call nltk.pos_tag(tokens) to tag them with part-of-speech.

Page 35: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Find Named-EntitiesCALL

>> neSet = [[tp.getNamedEntityTree(ts) for ts in

classTokens[:100]]

for tokens in tokenSet]

SAMPLE OUTPUT>> neSet[1][50]

Tree('S', [('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')])

Page 36: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)

Find Named-Entities

Similarly simple, just call two NLTK functions. The performance of POS tagger and NE chunker are quite bad for

Twitter messages, however.


Recommended