Date post: | 09-May-2015 |
Category: |
Technology |
Upload: | ben-healey |
View: | 7,904 times |
Download: | 4 times |
Text Analytics with and
(w/ examples from Tobacco Control)@BenHealey
The Process
Look intenselyFrequencies
Classification
Bright Idea Gather Clean Standardise
De-dup and select
http://scrapy.org
Spiders Items Pipelines
- readLines, XML / Rcurl / scrapeR packages- tm package (factiva plugin), twitteR
- Beautiful Soup- Pandas (eg, financial data)
http://blog.siliconstraits.vn/building-web-crawler-scrapy/
• Translating text to consistent form– Scrapy returns unicode strings– Māori Maori
• SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]
• translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])
• cleaned_content = html_content.translate(translation_table)
– Or… • test=u’Māori’ (you already have unicode)• Unidecode(test) (returns ‘Maori’)
• Dealing with non-Unicode– http://nedbatchelder.com/text/unipain.html– Some scraped html will be in latin1 (mismatch UTF8)– Have your datastore default to UTF-8– Learn to love whack-a-mole
• Dealing with too many spaces:– newstring = ' '.join(mystring.split())– Or… use re
• Don’t forget the metadata!– Define a common data structure early if you have multiple
sources
Text Standardisation
• Stopwords– "a, about, above, across, ... yourself, yourselves, you've, z”
• Stemmers– "some sample stemmed words" "some sampl stem word“
• Tokenisers (eg, for bigrams)– BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))– ‘and said’, ‘and security’
Natural Language Toolkittm package
Text Standardisationlibs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels")
…
cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp)}
posts.corpus = cleanCorpus(posts.corpus)posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
Text Standardisation• Using dictionaries for stem completion
politi.tdm <- TermDocumentMatrix(politi.corpus)politi.tdm = removeSparseTerms(politi.tdm, 0.99)politi.tdm = as.matrix(politi.tdm)
# get word counts in decreasing order, put these into a plain text doc.word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)length(word_freqs)smalldict = PlainTextDocument(names(word_freqs))
politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")
Deduplication
• Python sets– shingles1 = set(get_shingles(record1['standardised_content']))
• Shingling and Jaccard similarity– (a,rose,is,a,rose,is,a,rose)– {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)}
• {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}
–
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf a free texthttp://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
Frequency Analysis
• Document-Term Matrix– politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed,
control = list(wordLengths=c(4,Inf)))
• Frequent and co-occurring terms– findFreqTerms(politi.dtm, 5000)
[1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi"
– findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57
Mentions of the 2025 goal
Mentions of the 2025 goal
Top 100 terms: Tariana Turia
Note: Documents from Aug 2011 – July 2012 Wordcloud package
Top 100 terms: Tony Ryall
Note: Documents from Aug 2011 – July 2012
• Exploration and feature extraction– Metadata gathered at time of collection (eg, Scrapy)– RODBC or MySQLdb with plain ol’ SQL– Native or package functions for length of strings, sna, etc.
• Unsupervised– nltk.cluster– tm, topicmodels, as.matrix(dtm) kmeans, etc.
• Supervised– First hurdle: Training set – nltk.classify– tm, e1071, others…
Classification
2 posts or fewer more than 750 posts846 1,157 23 45,499
41.0% 1.3% 1.1% 50.1%
Cohort: New users (posters) in Q1 2012
• LDA (topicmodels)– New users
– Highly active users
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5good smoke just smoke feelday time day quit daythank week get can dontwell patch realli one likewill start think will still
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5quit good day like feelsmoke one well day thingcan take great your justwill stay done now getluck strong awesom get time
• LDA (topicmodels)– Highly active users (HAU)
– HAU1 (F, 38, PI)
– HAU2 (F, 33, NZE)
– HAU3 (M, 48, NZE)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5quit good day like feelsmoke one well day thingcan take great your justwill stay done now getluck strong awesom get time
18% 14% 40% 8% 20%
31% 21% 27% 6% 16%
16% 9% 21% 49% 5%
Recap• Your text will probably be messy– Python, R-based tools reduce the pain
• Simple analyses can generate useful insight
• Combine with data of other types for context– source, quantities, dates, network position, history
• May surface useful features for classification
Slides, Code: [email protected]