Post on 31-Aug-2018
transcript
Social Media Analysis and Reccomending Systems:
An Introduction to Question Answering
Roberto Basili (Università di Roma, Tor Vergata)
Master in Big Data, June 2016
Most slides from the teaching material of “Introduction to Information Retrieval”, Manning, Raghavan & Schutze, Cambridge University Press 2008, by C. Manning
Overview
• From query-based and keyword-based IR to Question Answering
• QA over Structured data
• From NL questions to SQL queries
• Major Approaches to QA
• Knowledge-based approaches to QA
• Text-based QA systems
• An architecture for a text-based QA system
• Question Classification and Answer matching as a classification task
Motivations
Parsing, Semantic
Interpretation,
Language Processing
NERC, Relation Extraction
Coreference
Trend Analysis, Community Detection,
Recommending
Social Media Analysis
Opinion Mining, EmotionalAnalysis,
ReputationManagement
Bayesianmodeling, SVM,
kernel machines, NN
Machine Learning
Clustering, Language modeling
Embeddings
Classification, Indexing,
Search, QA
Information Retrieval
Ranking, User Modeling
Motivations
Parsing, Semantic
Interpretation,
Language Processing
NERC, Relation Extraction
Coreference
• Unstructured Data are Epistemologically Opaque
• Queries are often too poor surrogates for expressing user needs
• The Web provides a context for any search interaction able to bettercharacterize any query
• NL interactions, i.e. interactive Question Answering, is a possible and viableevolution of Web search, actually very promising
Web Search in 2020?
The web, it is a changing.
What will people do in 2020?
• Type key words into a search box?
• Use the Semantic Web?
• Speak to your computer with natural language search?
• Use social or “human powered” search?
GoogleWhat’s been happening? 2014
• New search index at Google: “Hummingbird”
• Answering long, “natural language” questions better
• Partly to deal with spoken queries on mobile
• More use of the Google Knowledge Graph
• Concepts versus words
What’s been happening
• Google Knowledge Graph
• Facebook Graph Search
• Bing’s Satori
• Things like Wolfram Alpha
Common theme: Doing graph search over structured knowledge rather than traditional text search
What’s been happening
• More semi-structured information embedded in web pages
• schema.org
What’s been happening
• Move to mobile favors a move to speech which favors “natural language information search”
• Will we move to a time when over half of searches are spoken?
Towards intelligent agents
Two goals
• Things not strings
• Inference not only search
Two paradigms for question answering
• Text-based approaches
• TREC QA, IBM Watson
• Structured knowledge-based approaches
• Apple Siri, Wolfram Alpha, Facebook Graph Search
(And, of course, there are hybrids, including some of the above.)
At the moment, structured knowledge is back in fashion, but it may or may not last
Example from Fernando Pereira (GOOG)
Patrick Pantel talk(Then) Current experience
Desired experience: Towards actions
Learning actions from web usage logs
Entity disambiguation and linking
• Key requirement is that entities get identified
• Named entity recognition (e.g., Stanford NER!)
• and disambiguated
• Entity linking (or sometimes “Wikification”)
• e.g., Michael Jordan the basketballer or the ML guy
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mentions, Meanings, Mappings [G. Weikum]
Sergio means Sergio_LeoneSergio means Serge_GainsbourgEnnio means Ennio_AntonelliEnnio means Ennio_MorriconeEli means Eli_(bible)Eli means ExtremeLightInfrastructureEli means Eli_WallachEcstasy means Ecstasy_(drug)Ecstasy means Ecstasy_of_Goldtrilogy means Star_Wars_Trilogytrilogy means Lord_of_the_Ringstrilogy means Dollars_Trilogy… … …
KB
Eli (bible)
Eli Wallach
Mentions
(surface names)
Entities
(meanings)
Dollars Trilogy
Lord of the Rings
Star Wars Trilogy
Benny Andersson
Benny Goodman
Ecstasy of Gold
Ecstasy (drug)
?
• and linked to a canonical reference
• Freebase, dbPedia, Yago2, (WordNet)
3 approaches to question answering:Knowledge-based approaches (Siri)
• Build a semantic representation of the query
• Times, dates, locations, entities, numeric quantities
• Map from this semantics to query structured data or resources (SQL or SparQL)
• Geospatial databases
• Ontologies (Wikipedia infoboxes, dbPedia, WordNet, Yago)
• Restaurant review sources and reservation services
• Scientific databases
• Wolfram Alpha
27
28
Types of Questions in Modern Systems
• Factoid questions
• Who wrote “The Universal Declaration of Human Rights”?
• How many calories are there in two slices of apple pie?
• What is the average age of the onset of autism?
• Where is Apple Computer based?
• Complex (narrative) questions:
• In children with an acute febrile illness, what is the efficacy of acetaminophen in reducing fever?
• What do scholars think about Jefferson’s position on dealing with pirates?
Text-based (mainly factoid) QA
• QUESTION PROCESSING
• Detect question type, answer type, focus, relations
• Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
• Retrieve ranked documents
• Break into suitable passages and rerank
• ANSWER PROCESSING
• Extract candidate answers (as named entities)
• Rank candidates
• using evidence from relations in the text and external sources
Hybrid approaches (IBM Watson)
• Build a shallow semantic representation of the query
• Generate answer candidates using IR methods
• Augmented with ontologies and semi-structured data
• Score each candidate using richer knowledge sources
• Geospatial databases
• Temporal reasoning
• Taxonomical classification
30
31
IR-based Factoid QA
Document
DocumentDocument
Document
Document
Document
Document
Document
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocument
Document
IR-based Factoid QA
• QUESTION PROCESSING
• Detect question type, answer type, focus, relations
• Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
• Retrieve ranked documents
• Break into suitable passages and rerank
• ANSWER PROCESSING
• Extract candidate answers
• Rank candidates
using evidence from the text and external sources
Question ProcessingThings to extract from the question
• Answer Type Detection
• Decide the named entity type (person, place) of the answer
• Query Formulation
• Choose query keywords for the IR system
• Question Type classification
• Is this a definition question, a math question, a list question?
• Focus Detection
• Find the question words that are replaced by the answer
• Relation Extraction
• Find relations between entities in the question
33
Question Processing
They’re the two states you could be reentering if you’re crossing Florida’s northern border
• Answer Type: US state
• Query: two states, border, Florida, north
• Focus: the two states
• Relations: borders(Florida, ?x, north)
34
Answer Type Detection: Named Entities
• Who founded Virgin Airlines?
• PERSON
• What Canadian city has the largest population?
• CITY.
Answer Type Taxonomy
• 6 coarse classes
• ABBEVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION, NUMERIC
• 50 finer classes
• LOCATION: city, country, mountain…
• HUMAN: group, individual, title, description
• ENTITY: animal, body, color, currency…
36
Xin Li, Dan Roth. 2002. Learning Question Classifiers. COLING'02
37
Part of Li & Roth’s Answer Type Taxonomy
LOCATION
NUMERIC
ENTITY HUMAN
ABBREVIATIONDESCRIPTION
country city state
date
percent
money
sizedistance
individual
title
group
food
currency
animal
definition
reasonexpression
abbreviation
38
Answer Types
Answer types in Jeopardy
• 2500 answer types in 20,000 Jeopardy question sample
• The most frequent 200 answer types cover < 50% of data
• The 40 most frequent Jeopardy answer types
he, country, city, man, film, state, she, author, group, here, company, president, capital, star, novel, character, woman, river, island, king, song, part, series, sport, singer, actor, play, team, show, actress, animal, presidential, composer, musical, nation, book, title, leader, game
39
Ferrucci et al. 2010. Building Watson: An Overview of the DeepQA Project. AI Magazine. Fall 2010. 59-
79.
Knowledge:Not just semantics but pragmatics
Pragmatics = taking account of context in determining meaning
Search engines are great because they inherently take into account pragmatics (“associations and contexts”)
• [the national] The National (a band)
• [the national ohio] The National - Bloodbuzz Ohio – YouTube
• [the national broadband] www.broadband.gov
Task – Answer Sentence Selection
• Given a factoid question, find the sentence that
• Contains the answer
• Can sufficiently support the answer
Q: Who won the best actor Oscar in 1973?
S1: Jack Lemmon was awarded the Best Actor Oscar for Save
the Tiger (1973).
S2: Academy award winner Kevin Spacey said that Jack
Lemmon is remembered as always making time for others.
Scott Wen-tau Yih (ACL 2013) paper
Assume that there is an underlying alignment
Describes which words in and can be associated
What is the fastest car in the world?
The Jaguar XJ220 is the dearest, fastest and most sought after car on the planet.
Word Alignment for Question AnsweringTREC QA (1999-2005)
See if the (syntactic/semantic) relations support the answer
[Harabagiu & Moldovan, 2001]
Full NLP QA: LCC (Harabagiu/Moldovan) [below is the architecture of LCC’s QA system circa 2003]
Question Parse
Semantic
Transformation
Recognition of
Expected Answer
Type (for NER)
Keyword Extraction
Factoid
Question
List
Question
Named Entity
Recognition
(CICERO LITE)
Answer Type
Hierarchy
(WordNet)
Question Processing
Question Parse
Pattern Matching
Keyword Extraction
Question ProcessingDefinition
Question Definition
Answer
Answer Extraction
Pattern Matching
Definition Answer Processing
Answer Extraction
Threshold Cutoff
List Answer ProcessingList
Answer
Answer Extraction (NER)
Answer Justification
(alignment, relations)
Answer Reranking
(~ Theorem Prover)
Factoid Answer Processing
Axiomatic Knowledge
Base
Factoid
AnswerMultiple
Definition
Passages
Pattern
Repository
Single Factoid
Passages
Multiple
List
Passages
Passage Retrieval
Document Processing
Document Index
Document
Collection
Question Answering: IBM’s Watson
• Won Jeopardy on February 16, 2011!
WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
Bram Stoker
IBM Watson: between Intelligence and Data
• IBM’s Watson
• http://www-03.ibm.com/innovation/us/watson/science-behind_watson.shtml
45
Jeopardy!
Watson
Watson
WatsonWILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
Semantic Inference in Watson QA
… Intelligence in Watson
51
Watson: a DeepQA architecture
52
Pronti per Jeopardy!
53
Riferimenti
• NLP, IR & ML:
• «Statistical Methods for Speech Recognition», F. Jelinek, MIT Press, 1998
• «Speech and Language Processing”, D. Jurafsky and J. H .Martin, Prentice-Hall, 2009.
• “Introduction to Information Retrieval”, Manning, Raghavan & Schutze, Cambridge University Press 2008.
• Reti Sociali e Data Analytics
• Community Detection and Mining in Social Media, Lei Tang, Huan Liu, Morgan & Claypool Publishers, 2010.
• Analyzing the Social Web, Jennifer Golbeck, Elsevier, 2015.
• Sitografia:
• SAG, Univ. Roma Tor Vergata: http://sag.art.uniroma2.it/