SEASR Text

Post on 26-Jan-2015

113 views 1 download

Tags:

description

Pathway to SEASR Workshop in March 2009 in North Carolina

transcript

Text

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

MONK Project

MONK provides:

•  1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database.

•  Several different open-source interfaces for working with this data

•  A public API to the datastore

•  SEASR under the hood, for analytics

MONK Project

Executes flows for each analysis requested

–  Predictive modeling using Naïve Bayes

–  Predictive modeling using Support Vector Machines (SVM)

Dunning Loglikelihood TagCloud

•  Words that are under-represented in writings by Victorian women as compared to Victorian men. —Sara Steger

Feature Lens

“The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end.

It would have been impossible to discern through traditional reading“

Semantic Analysis: Information Extraction

•  Definition: Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)

•  Extracttherelevantinforma1onandignorenon‐relevantinforma1on(important!)

•  Linkrelatedinforma1onandoutputinapredeterminedformat

Information Extraction

Informa(onType Stateoftheart(Accuracy)En((es

anobjectofinterestsuchasapersonororganiza1on.

90‐98%

A9ributes

apropertyofanen1tysuchasitsname,alias,descriptor,ortype.

80%

Facts

arela1onshipheldbetweentwoormoreen11essuchasPosi1onofa

PersoninaCompany.

60‐70%

Events

anac1vityinvolvingseveralen11essuchasaterroristact,airlinecrash,managementchange,newproduct

introduc1on.

50‐60%

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Information Extraction Approaches

•  Terminology (name) lists

–  This works very well if the list of names and name expressions is stable and available

•  Tokenization and morphology

–  This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas)

•  Use of characteristic patterns

–  This works fairly well for novel entities

–  Rules can be created by hand or learned via machine learning or statistical algorithms

–  RulescapturelocalpaFernsthatcharacterizeen11esfrominstancesofannotatedtrainingdata

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

NE:Person NE:Time

NE:Location

NE:Organization

Semantic Analytics

Named Entity (NE) Tagging

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

UNE:Organization

Semantic Analysis

Co-reference Resolution for entities and unnamed entities

Mayor Rex Luthor announced today the establishment

known as Boynton Laboratory

of a new research facility in Alderwoon. It will be

ACTIONACTOR WHEN OBJECT

WHERE

ACTION

OBJECT

COMPL

Semantic Analysis

Semantic Role Analysis

Rex Luthor

person

announce

action

establ.

event

Boynton Lab

organiz.

today

time

Alderwood

location

location

(where)

object

(what)

time(when)

objec

t(w

hat)

actor(who)

Semantic Analysis

Concept-Relation Extraction

Results: Timeline

Results: Maps

UIMA Structured data

•  Two SEASR examples using UIMA POS data

–  Frequent patterns (rule associations) on nouns (fpgrowth)

–  Sentiment analysis on adjectives

UIMA

Unstructured Information Management Applications

UIMA + P.O.S. tagging

Four Analysis Engines to analyze document to record Part Of Speech information.

OpenNLP Tokenizer

OpenNLP PosTagger

OpenNLP SentanceDetector POSWriter

Serialization of the UIMA CAS

UIMA to SEASR: Experiment I

•  Finding patterns

SEASR + UIMA: Frequent Patterns

Frequent Pattern Analysis on nouns

•  Goal:

–  Discover a cast of characters within the text

–  Discover nouns that frequently occur together

•  character relationships

Frequent Patterns: visualization

Analysis of Tom Sawyer 10 paragraph window Support set to 10%

UIMA to SEASR: Experiment II

•  Sentiment Analysis

UIMA + SEASR: Sentiment Analysis

•  Classifying text based on its sentiment

–  Determining the attitude of a speaker or a writer

–  Determining whether a review is positive/negative

•  Ask: What emotion is being conveyed within a body of text?

–  Look at only adjectives (UIMA POS)

•  lots of issues, challenges, and but’s “but … “

•  Need to Answer: –  What emotions to track?

–  How to measure/classify an adjective to one of the selected emotions?

–  How to visualize the results?

UIMA + SEASR: Sentiment Analysis

•  Which emotions:

–  http://en.wikipedia.org/wiki/List_of_emotions

–  http://changingminds.org/explanations/emotions/basic%20emotions.htm

–  http://www.emotionalcompetency.com/recognizing.htm

•  Parrot’s classification (2001)

–  six core emotions

–  Love, Joy, Surprise, Anger, Sadness, Fear

UIMA + SEASR: Sentiment Analysis

UIMA + SEASR: Sentiment Analysis

•  How to classify adjectives:

–  Lots of metrics we could use …

•  Lists of adjectives already classified

–  http://www.derose.net/steve/resources/emotionwords/ewords.html

–  Need a “nearness” metric for missing adjectives

–  How about the thesaurus game ?

•  Using only a thesaurus, find a path between two words

–  no antonyms

–  no colloquialisms or slang

UIMA + SEASR: Sentiment Analysis

•  How to get from delightful to rainy ?

['delightful', 'fair', 'balmy', 'moist', 'rainy'].

['sexy', 'provocative', 'blue', 'joyless’]

['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’]

•  sexy to joyless?

•  bitter to lovable?

UIMA + SEASR: Sentiment Analysis

•  Use this game as a metric for measuring a given adjective to one of the six emotions.

•  Assume the longer the path, the “farther away” the two words are.

•  address some of issues

SynNet: rainy to pleasant

UIMA + SEASR: Sentiment Analysis

•  SynNet Metrics

•  Common nodes

•  Path length

•  Symmetric: a->b->c c->b->a

•  Link strength:

•  tangy->sweet

•  sweet->lovable

•  Use of slang or informal usage

UIMA + SEASR: Sentiment Analysis

•  Common Nodes

•  depth of common

UIMA + SEASR: Sentiment Analysis

•  Symmetry of path in common nodes

UIMA + SEASR: Sentiment Analysis

•  Find the shortest path between adjective and each emotion:

•  ['delightful', 'beatific', 'joyful']

•  ['delightful', 'ineffable', 'unspeakable', 'fearful']

•  Pick the emotion with shortest path length

•  tie breaking procedures

UIMA + SEASR: Sentiment Analysis

•  Not a perfect solution

–  still need context to get quality

•  Vain –  ['vain', 'insignificant', 'contemptible', 'hateful'] –  ['vain', 'misleading', 'puzzling', 'surprising’]

•  Animal –  ['animal', 'sensual', 'pleasing', 'joyful'] –  ['animal', 'bestial', 'vile', 'hateful'] –  ['animal', 'gross', 'shocking', 'fearful'] –  ['animal', 'gross', 'grievous', 'sorrowful']

•  Negation –  “My mother was not a hateful person.”

UIMA + SEASR: Sentiment Analysis

•  Process Overview

•  Extract the adjectives (UIMA POS analysis)

•  Read in adjectives (SEASR library)

•  Label each adjective (SynNet)

•  Summarize windows of adjectives

•  lots of experimentation here

•  Visualize the windows

UIMA + SEASR: Sentiment Analysis

•  Visualization

•  New SEASR visualization component

•  Based on flare ActionScript Library

•  http://flare.prefuse.org/

•  Still in development

•  http://demo.seasr.org:1714/public/resources/data/emotions/ev/EmotionViewer.html

UIMA + SEASR: Sentiment Analysis