UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Ranking for Sentiment
DCU at TREC 2008: The Blog Track
Adam [email protected]
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
DCU: Team Sentiment!
Adam Bermingham
Prof. Alan Smeaton
Dr. Deirdre Hogan
Dr. JenniferFoster
CLARITY: Sensor for Web TechnologiesCentre for Digital Video Processing
National Centre For Language Technology
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Sentiment Analysis IWho is favourite to win the match, Ireland or New Zealand?
What is the sentiment towards Barack Obama / the new iPod / Lehman Brothers Holdings Inc?
How opinionated is the discussion around Mary Hearney, Enda Kenny, Zig & Zag?
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Sentiment Analysis IIIdentification of subjectivity and polarity of opinion in textual informationCrossover of Information Retrieval, NLP, Text Mining
The ChallengesDocument classification, scoringOpinion extractionOpinion summarization, visualizationReal world correlation
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
SA: Document Scoring
Permits reranking, fusion with other document informationEg relevance, authority, pagerank etc.
Machine Learning approachesBag-of-words + variants
Lexicon approachesDictionaries for sentiment, polarity and subjectivity.
Alternative text features:Out of vocabulary words, punctuation (etc)
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
The Blog TrackRun at TREC since 2006
Three tasks (2008):1.Find relevant blog posts2.Find opinionated blog posts3.Find positive & negative blog posts
Results: 1000 ranked documents per topic (query), per task
50 topics per yearPrimary evaluation metric: MAP – Mean Average Precision
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Topic Example<num> Number: 1049 </num><title> YouTube </title>
<desc> Description:Find views about the YouTube video-sharing website.</desc>
<narr> Narrative: The YouTube video-sharing website provides internet users with a relatively new way to share videos. Documents which express views about how well it succeeds in meeting the needs of users are relevant.</narr>
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
AssessmentsRelevance judgementsPoolingHuman Assessors
QRELS:Not relevantRelevant, non-opinionatedRelevant, positively opinionatedRelevant, negatively opinionated Relevant, mixed opinionatedNot judged
32,021 QRELS from 2006, 2007 available
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Corpus – Blog06>3 million blog postsCrawled over a few weeks in 2006
Permalink HTML Also available: homepage HTML, RSS
Real-worldIncludes: spam blogs (“splogs”), multilingual blogs, inappropriate content
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Blogs“Weblog” coined 1997A website containing regular timestamped posts in chronological order
Universal McCann (March 2008)184 million WW have started a blog | 26.4 US 346 million WW read blogs | 60.3 US77% of active Internet users read blogs
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE 04/21/23
Blog – an example
Blog
Date
Post
Links
Tags
Comments
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Approach1. Get relevant documents
2. Assess results for sentiment using three feature sets
3. Re-rank relevant results using late fusion of feature sets
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Approach – feature set
Lexicon FeaturesAggregate sentiment scores for a document’s constituent words in a sentiment lexicon.
Surface Features Textual features which do not require parsing or syntactic understanding of the sentence structure.
Syntactic FeaturesTextual features derived from parsing and part-of-speech tagging documents.
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
System Architecture
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
RetrievalTerrierUniversity of GlasgowOpen source / Java
Retrieval: Okapi BM25
Query Expansion Bo1 (Bose-Einstein) Divergence From Randomness
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
PreprocessingParse HTMLHTMLParser toolDivide into text sections according to breaking HTML elements
Noise RemovalDiscard sections with:– A high anchor text to non-anchor text ratio (eg ad, blogroll)– A high non-alphabetic character to alphabetic character ratio (eg date, code, gobbledegook)
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Machine LearningWEKA - Waikato Environment for Knowledge Analysis JavaGood entry pointPerformance Issues (?)
Three-way Binary Logistic Regression ClassificationScores are obtained from distributions for classified documents
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Syntactic FeaturesParsed using Charniak and Johnson re-ranking parser*ICHEC – Irish Centre for High End Computing
The 50 most discriminative part-of-speech unigrams, bigrams and trigrams
Penn Treebank phrasal types:Normalised counts of typesNormalised counts of types as root of treeNormalised counts of parse tree structures likely to reflect subjectivity
*Thanks to Joachim Wagner!
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Surface FeaturesNormalized word counts for a manually created lexicon of obscenities and emotive and polarised words
Non-word characters and character sequences such as punctuation and emoticons.
Regex patterns to detect unusual word and punctuation structures. Eg “arrrrgh”, “?!?!”, “....”, “b****”
Document measurements
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Lexicon FeaturesSentiWordNetPositivity, Negativity score for each Synset in WordNet
ScoringWeighted sum of mean positivity and negativity scores per document
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
WeightingWeighted Comb Sum
Learning weights from 2006, 2007 MAP Rather than cross validation
Scores from 3 classifiers fused before merging with relevance score
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Weighting
Polarised Opinion finding:
Opinion finding:
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Results
Baseline (Opinion)
Opinion Finding
Polarised Opinion Finding
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Results – per topic
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Preliminary Conclusions
Syntactic features appear to subsume surface featuresObserved during in training weights
Significant gains can be had through an efficient, uniform baseline
Subjectivity important in polarity detectionBigger difference in writing style between objective and subjective texts than between negative and positive texts.
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Future WorkFurther work on parse trees for sentiment classificationMovie review classification – Wolfgang Seeker
Sub-document relevance and sentiment modellingUnstructured textLogical levels – Sentence? Phrase? Paragraph? Passage? N.O.T.A?
UNIVERSITY COLLEGE DUBLIN DUBLIN CITY UNIVERSITY TYNDALL NATIONAL INSTITUTE
Thanks!TREC – Blog Track Wikihttp://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/
Opinion Mining and Sentiment Analysis Surveyhttp://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html
TREC Blog Track 2007 overviewhttp://trec.nist.gov/pubs/trec16/papers/BLOG.OVERVIEW08.pdf
Tools:Weka: http://www.cs.waikato.ac.nz/ml/weka/ Terrier: http://ir.dcs.gla.ac.uk/terrier/ HTMLParser: http://htmlparser.sourceforge.net/ SentiWordNet: http://sentiwordnet.isti.cnr.it/