How computers understand text content
a presentation for the Auckland content strategy meetup
by Anna Divoli@annadivoli
.
Ph.D. in Biomedical Text Mining | Text Analytics Researcher | Head of R&D at Pingar
Who am I?
• 14 years in academia + 4 years in industry• academically exposed to different disciplines:
biomedicine, bioinformatics, computational linguistics, information retrieval, information extraction, semantic technologies, human-computer interaction, search user interface usability, knowledge acquisition, visualizations
• lived in different countries:Greece, UK, US, NZ
• learned English as a second language (hint: I empathize with computer systems)
Anna Divoli Auckland content strategy meetup Aug 2015
Who are you?
• Marketing?• Digital content?• Information Architecture?• Journalists?• UX?• Business Analysis?• Software Development?• CS research (incl. “text” people)?• Other?
Anna Divoli Auckland content strategy meetup Aug 2015
What is “text”? Where is it?w
ww
.nai
lingi
t.com
/im
ages
/web
site
s.jp
g
ww
w.b
u.ed
u/to
day/
files
/201
2/10
/t_j
ourn
als1
.jpg
web
.cla
rku.
edu/
office
s/its
/im
ages
/file
pile
.jpg
ww
w.fl
ickr
.com
/pho
tos/
jlcon
for/
1419
1286
471
Human – Text Content Interaction
Humans:Slow, Inconsistent, Expensive
Text content:Overwhelmingly fast growing, Disseminated across multiple sources
Anna Divoli Auckland content strategy meetup Aug 2015
NLP Artificial Intelligence∈
Machine Learning
NLP
Computational Linguistics
Applied Text
Analytics
Storage
Memory
Security
Friendly UIs
Visualizations
Anna Divoli Auckland content strategy meetup Aug 2015
So, what’s in the text?
• Entities• Facts• Relations• Themes/topics• Opinions & sentiment• …
+ Time/Location dimensions:• Trends & paradigm shifts• Networks• …
Anna Divoli Auckland content strategy meetup Aug 2015
Named Entity Recognition
Find and classify names…
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
Anna Divoli Auckland content strategy meetup Aug 2015
Named Entity Recognition
Find and classify names…
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
People Locations Organizations
Methods: lexicon-based (gazeteers)grammar-based (rule-based)
✓ statistical models (machine learning: algorithms + features)
✓ hybrids Anna Divoli Auckland content strategy meetup Aug 2015
Named Entity Recognition
Find and classify names…
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
People DatesLocations Organizations
Who? Where?
When?
Anna Divoli Auckland content strategy meetup Aug 2015
Disambiguation & Normalization:Word Sense Disambiguation & Text Normalization
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
Word Sense Disambiguation: identifying which sense/meaning of a word is used in a sentence, when the word has multiple meanings. Synonyms & homonyms. Use context!!
Text normalization: transforming text into a single canonical form that it might not have had before.
Anna Divoli Auckland content strategy meetup Aug 2015
Word Sense Disambiguation & Text Normalization
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
Sam Arlington initiated partnership discussions during his visit to Eureka offices in July.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
J. Smith went to Washington DC to see the Smithsonian Institute and also met up with Virginia Peterson for a coffee.
Anna Divoli Auckland content strategy meetup Aug 2015
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
Sam Arlington initiated partnership discussions during his visit to Eureka office in July.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
J. Smith went to Washington DC to see the Smithsonian Institute and also met up with Virginia Peterson for a coffee.
Word Sense Disambiguation & Text Normalization
Anna Divoli Auckland content strategy meetup Aug 2015
Fact & Relationship extraction
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
What?
Anna Divoli Auckland content strategy meetup Aug 2015
Deeper knowledge & Sentiment
S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.
How? Why? How do we feel about it?
S. Arlington visited the Eureka’s Ltd offices last month to initiate partnership discussions.
John Smith was delighted to go to Washington to see the Smithsonian and also met up with Virginia for a coffee.
Anna Divoli Auckland content strategy meetup Aug 2015
Sentiment analysis & opinion mining
• Dictionary-based (e.g. LIWC)• Statistical• Hybrid
• Polarity & strength • Feelings• Mood• Aspects• Who has this sentiment (source)• What is the target of the sentiment
Pos | Neu | Neg & scoreAngry, sad…Happy, depressed…Location, cleanliness…Employees, customers…Product, event, person…
Anna Divoli Auckland content strategy meetup Aug 2015
So, what’s in the text?
Anna Divoli Auckland content strategy meetup Aug 2015
• Entities• Facts• Relations• Themes/topics no training or ontologies need!
can utilize web resources (e.g., Wikipedia)• Opinions & sentiment• …
+ Time/Location dimensions:• Trends & paradigm shifts• Networks• …
So, what ELSE is in the text?• Ambiguity• Metaphors• Sarcasm• Colloquialism/Slang• Negation• Hedging• Conditional statements• Inconsistencies/Bad grammar• Text speak• Anaphora• Humor
I want an apple.He drowned in a sea of grief.George W Bush. Love him!I slept like crap last night. I am not sure I want to go to NYC.The results indicate this.When it rains I feel sad.I think your smart.C u l8r @JacksJohn met with Nick. He was upset. Did you take a bath today? No. Is one missing?
Anna Divoli Auckland content strategy meetup Aug 2015
So, what ELSE is in the text?• Ambiguity• Metaphors• Sarcasm• Colloquialism/Slang• Negation• Hedging• Conditional statements• Inconsistencies/Bad grammar• Text speak• Anaphora• Humor
I want an apple.He drowned in a sea of grief.George W Bush. Love him!I slept like crap last night. I am not sure I want to go to NYC.The results indicate this.When it rains I feel sad.I think your smart.C u l8r @JacksJohn met with Nick. He was upset. Did you take a bath today? No. Is one missing?
Consider: distributed information (dialogue), technical/scientific text, legal text, creative/poetry…
Anna Divoli Auckland content strategy meetup Aug 2015
Human language!
Eye drops off shelf.
Include your children when baking cookies.
Turn right here.
John saw the man on the mountain with a telescope.
He gave her cat food.
They are hunting dogs. Anna Divoli Auckland content strategy meetup Aug 2015
Examples: Biology…
Looking for: interactions between SAF and viral LTR elements(SAF is a transcription factor, LTR stands for ‘long terminal repeat’)(Also: SAF = single and free, LTR = long term relationship)
Gene names:tinman, lilliputian, dreadlocks, lush, cheap date, methuselah, Van Gogh, maggie, brainiac, grim, reaper, cleopatra, swiss cheese, fucK, out cold, ken and barbie, kenny, lava lamp, hamlet, sonic hedgehog, werewolf, half pint, drop dead, chardonnay, agnostic, I’m not dead yet…
Anna Divoli Auckland content strategy meetup Aug 2015
Current State of NLP
• Rule-based systems for high precision results
• Hybrid systems for more robust performance (rules + dictionaries/ontologies + statistical models)
• Limitation: specialized systems perform better (much like humans!)
• Workflows offer work-around for more generic systemse.g., check language check category choose model
Anna Divoli Auckland content strategy meetup Aug 2015
Examples of applications
(some are very specialized!)
Anna Divoli Auckland content strategy meetup Aug 2015
Take home messages
• Machines can do a lot of consistent, fast information extraction
• Specialization is needed in several fields but systems can have internal workflows
• Big data + statistics = magic!
• Always room for improvement
• Information management AND decisions AND predictions