Date post: | 25-Jun-2015 |
Category: |
Data & Analytics |
Upload: | ali-belcaid |
View: | 173 times |
Download: | 7 times |
DASA ProjectData Acquisition for Sentiment Analysis
Ali Belcaid © AB Advisory & Consulting
High level architecture and components overview – March 2013
Objectives
• Streamline and facilitate the process of unstructured data acquisition
• Create and manage corpora’s for contextual opinions and sentiments
• Detect trends based on contexctual reviews, comments, discussions…
• Run and train models for sentiment or opinion analysis
• Provide Figures, results and graphs as outputs
Software components
• Python– Program language
• Django : Web application container
• Scapy : Web Crawler
• Librairies : Twitter,
• MySQL / MongoDB / Hbase– For the time being, no absolute choice is made But the final solution could be a mix
of different databases depending on the nature of the use.
• R Project– R Project will be used whenever specific textmining libraries are missing in python
or it become easier to use R instead of python. In that case, the R scripts will beencapsulated in python programs.
• Hadoop– For massive storage we will use Hadoop. The architecture is not yet depicted .
– It is used for Raw data storage.
Simplified Solution Architecture
…
…
Web Interface (Django)
Crawl Engine & API(Scrapy)
Text Mining Engine(NLTK)
(TM – R project)
Pre-processing &
Corpuses
Output results
ConfigurationCrawl
Content
1 2
3
4
5
Architecture components
1Data sources : The access will be managed via API or Crawls. Sources are all ones related to social media -> blogs, forums, advisors, social web… In general, all media where sentiment / opinion are expressed.
2 Web Interface to interact with the system -> to manage inputs, configurations, outputs…
3There will be a mix between Scrapy (the Crawler) and python scripts for using APIs. Basically, the engine will be used to gather all data sources and store them for further processing (pre-processing and analysis).
4There will be a mix between Scrapy (the Crawler) and python scripts for using APIs. Basically, the engine will be used to gather all data sources and store them for further processing.
5The target database solution is not yet selected. The objective is to store all the relative content whenever is raw data, configuration items or ouput results.
Characteristics of Sentiment Analysis
Sentiment = Holder + Polarity + Target + Auxiliary –Holder: who expresses the sentiment –Target: what/whom the sentiment is expressed to –Polarity: the nature of the sentiment (e.g., positive or negative)
“The games in iPhone 4s are pretty funny!”
Feature/Aspect Target Polarity : Positive
Holder = the user/reviewer
Auxiliary• Strength : Differentiate the intensity • Confidence : Measure the reliability of the sentiment • Summary : Explain the reason inducing the sentiment • Time
Basic Tasks
• Holder detection – Find who express the sentiment
• Target recognition – Find whom/what the sentiment is expressed towards
• Sentiment (Polarity) classification – Positive, negative, neutral
• Opinion summarization
• Opinion spam detection
Subjectivity versus Sentiment
• Sentiment analysis also known as opinion mining.• Attempts to identify the opinion/sentiment that a person may hold
towards an object• It is a finer grain analysis compared to subjectivity analysis
Lexicon Based Sentiment Classification
Basic idea
• Use the dominant polarity of the opinion words in the sentence to determine its polarity :• If positive/negative opinion prevails, the opinion sentence is regarded as
positive/negative• Lexicon + Counting• Lexicon + Grammar Rule + Inference Method
Example Lexicon : http://www.wjh.harvard.edu/~inquirerhttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rarhttp://sentiwordnet.isti.cnr.it/
Sentiment Analysis Tasks
Level Task Description
Document • Task: sentiment classification of reviews
• Classes: positive, negative, and neutral
• Assumption: each document (or review) focuses on a single object (not true in many discussion posts) and contains opinion from a single opinion holder.
Sentence • Task 1: identifying subjective/opinionated sentences
• Classes: objective and subjective (opinionated)
• Task 2: sentiment classification of sentences
• Classes: positive, negative and neutral.
• Assumption: a sentence contains only one opinion; not true in many cases.
• Then we can also consider clauses or phrases.
Feature • Task 1: Identify and extract object features that have been commented on by an opinion holder (e.g., a reviewer).
• Task 2: Determine whether the opinions on the features are positive, negative or neutral.
• Task 3: Group feature synonyms.
• Produce a feature-based opinion summary of multiple reviews.
Some tools
Lexicon-based tools
• Use sentiment and subjectivity lexicons• Rule-based classifier
• A sentence is subjective if it has at least two words in the lexicon• A sentence is objective otherwise
Corpus-based tools
• Use corpora annotated for subjectivity and/or sentiment• Train machine learning algorithms:
• Naïve bayes• Decision trees• SVM • …
• Learn to automatically annotate new text
Sentiment Analysis : Levels
• Document level –E.g., product/movie review
• Sentence level –E.g., news sentence
• Expression level –E.g., word/phrase
Sentiment Analysis : Holder detection
Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns
International officers believe that the EU will prevail. International officers said US officials want the EU to prevail.
• View source identification as an information extraction task and tackle the problem using sequence tagging and pattern matching techniques simultaneously
• Linear-chain CRF model to identify opinion sources • Patterns incorporated as features
Sentiment Analysis : Twitter
Sentiment Analysis : Twitter
1. Tweet normalization – A simple rule-based model –“gooood” to “good”, “luve” to “love”
2. POS tagging – OpenNLP POS tagger 3. Word stemming – A word stem mapping table (about 20,000
entries) 4. Syntactic parsing – A Maximum Spanning Tree dependency
parser
Crawling scenario : Definition
Scenario x
Instance 1
Instance 2
Instance n
URLS sélectionnées
Paramètres de configuration
Name
Key words
…
• Scenario : 1 -> n : Category.• Theme: n -> n : Scenario• Scenario : 1 -> n : instance
• The scenario define the type of Crawl wewant to run. It is tied to the notion of instance which is considered as a specificconfiguration of scenario.
Module gestion des URLS
Module gestion de paramètres
de configuration
Il faudra se pencher sur l’interface GUI en développement de Nutch et s’en inspirer pour la gestion des paramètres et des URLS.
Theme
Category