Social Media Analysis and Reccomending Systems An...

Post on 31-Aug-2018

213 views 0 download

transcript

Social Media Analysis and Reccomending Systems:

An Introduction to Question Answering

Roberto Basili (Università di Roma, Tor Vergata)

Master in Big Data, June 2016

Most slides from the teaching material of “Introduction to Information Retrieval”, Manning, Raghavan & Schutze, Cambridge University Press 2008, by C. Manning

Overview

• From query-based and keyword-based IR to Question Answering

• QA over Structured data

• From NL questions to SQL queries

• Major Approaches to QA

• Knowledge-based approaches to QA

• Text-based QA systems

• An architecture for a text-based QA system

• Question Classification and Answer matching as a classification task

Motivations

Parsing, Semantic

Interpretation,

Language Processing

NERC, Relation Extraction

Coreference

Trend Analysis, Community Detection,

Recommending

Social Media Analysis

Opinion Mining, EmotionalAnalysis,

ReputationManagement

Bayesianmodeling, SVM,

kernel machines, NN

Machine Learning

Clustering, Language modeling

Embeddings

Classification, Indexing,

Search, QA

Information Retrieval

Ranking, User Modeling

Motivations

Parsing, Semantic

Interpretation,

Language Processing

NERC, Relation Extraction

Coreference

• Unstructured Data are Epistemologically Opaque

• Queries are often too poor surrogates for expressing user needs

• The Web provides a context for any search interaction able to bettercharacterize any query

• NL interactions, i.e. interactive Question Answering, is a possible and viableevolution of Web search, actually very promising

Web Search in 2020?

The web, it is a changing.

What will people do in 2020?

• Type key words into a search box?

• Use the Semantic Web?

• Speak to your computer with natural language search?

• Use social or “human powered” search?

GoogleWhat’s been happening? 2014

• New search index at Google: “Hummingbird”

• Answering long, “natural language” questions better

• Partly to deal with spoken queries on mobile

• More use of the Google Knowledge Graph

• Concepts versus words

What’s been happening

• Google Knowledge Graph

• Facebook Graph Search

• Bing’s Satori

• Things like Wolfram Alpha

Common theme: Doing graph search over structured knowledge rather than traditional text search

What’s been happening

• More semi-structured information embedded in web pages

• schema.org

What’s been happening

• Move to mobile favors a move to speech which favors “natural language information search”

• Will we move to a time when over half of searches are spoken?

Towards intelligent agents

Two goals

• Things not strings

• Inference not only search

Two paradigms for question answering

• Text-based approaches

• TREC QA, IBM Watson

• Structured knowledge-based approaches

• Apple Siri, Wolfram Alpha, Facebook Graph Search

(And, of course, there are hybrids, including some of the above.)

At the moment, structured knowledge is back in fashion, but it may or may not last

Example from Fernando Pereira (GOOG)

Patrick Pantel talk(Then) Current experience

Desired experience: Towards actions

Learning actions from web usage logs

Entity disambiguation and linking

• Key requirement is that entities get identified

• Named entity recognition (e.g., Stanford NER!)

• and disambiguated

• Entity linking (or sometimes “Wikification”)

• e.g., Michael Jordan the basketballer or the ML guy

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mentions, Meanings, Mappings [G. Weikum]

Sergio means Sergio_LeoneSergio means Serge_GainsbourgEnnio means Ennio_AntonelliEnnio means Ennio_MorriconeEli means Eli_(bible)Eli means ExtremeLightInfrastructureEli means Eli_WallachEcstasy means Ecstasy_(drug)Ecstasy means Ecstasy_of_Goldtrilogy means Star_Wars_Trilogytrilogy means Lord_of_the_Ringstrilogy means Dollars_Trilogy… … …

KB

Eli (bible)

Eli Wallach

Mentions

(surface names)

Entities

(meanings)

Dollars Trilogy

Lord of the Rings

Star Wars Trilogy

Benny Andersson

Benny Goodman

Ecstasy of Gold

Ecstasy (drug)

?

• and linked to a canonical reference

• Freebase, dbPedia, Yago2, (WordNet)

3 approaches to question answering:Knowledge-based approaches (Siri)

• Build a semantic representation of the query

• Times, dates, locations, entities, numeric quantities

• Map from this semantics to query structured data or resources (SQL or SparQL)

• Geospatial databases

• Ontologies (Wikipedia infoboxes, dbPedia, WordNet, Yago)

• Restaurant review sources and reservation services

• Scientific databases

• Wolfram Alpha

27

28

Types of Questions in Modern Systems

• Factoid questions

• Who wrote “The Universal Declaration of Human Rights”?

• How many calories are there in two slices of apple pie?

• What is the average age of the onset of autism?

• Where is Apple Computer based?

• Complex (narrative) questions:

• In children with an acute febrile illness, what is the efficacy of acetaminophen in reducing fever?

• What do scholars think about Jefferson’s position on dealing with pirates?

Text-based (mainly factoid) QA

• QUESTION PROCESSING

• Detect question type, answer type, focus, relations

• Formulate queries to send to a search engine

• PASSAGE RETRIEVAL

• Retrieve ranked documents

• Break into suitable passages and rerank

• ANSWER PROCESSING

• Extract candidate answers (as named entities)

• Rank candidates

• using evidence from relations in the text and external sources

Hybrid approaches (IBM Watson)

• Build a shallow semantic representation of the query

• Generate answer candidates using IR methods

• Augmented with ontologies and semi-structured data

• Score each candidate using richer knowledge sources

• Geospatial databases

• Temporal reasoning

• Taxonomical classification

30

31

IR-based Factoid QA

Document

DocumentDocument

Document

Document

Document

Document

Document

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocument

Document

IR-based Factoid QA

• QUESTION PROCESSING

• Detect question type, answer type, focus, relations

• Formulate queries to send to a search engine

• PASSAGE RETRIEVAL

• Retrieve ranked documents

• Break into suitable passages and rerank

• ANSWER PROCESSING

• Extract candidate answers

• Rank candidates

using evidence from the text and external sources

Question ProcessingThings to extract from the question

• Answer Type Detection

• Decide the named entity type (person, place) of the answer

• Query Formulation

• Choose query keywords for the IR system

• Question Type classification

• Is this a definition question, a math question, a list question?

• Focus Detection

• Find the question words that are replaced by the answer

• Relation Extraction

• Find relations between entities in the question

33

Question Processing

They’re the two states you could be reentering if you’re crossing Florida’s northern border

• Answer Type: US state

• Query: two states, border, Florida, north

• Focus: the two states

• Relations: borders(Florida, ?x, north)

34

Answer Type Detection: Named Entities

• Who founded Virgin Airlines?

• PERSON

• What Canadian city has the largest population?

• CITY.

Answer Type Taxonomy

• 6 coarse classes

• ABBEVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION, NUMERIC

• 50 finer classes

• LOCATION: city, country, mountain…

• HUMAN: group, individual, title, description

• ENTITY: animal, body, color, currency…

36

Xin Li, Dan Roth. 2002. Learning Question Classifiers. COLING'02

37

Part of Li & Roth’s Answer Type Taxonomy

LOCATION

NUMERIC

ENTITY HUMAN

ABBREVIATIONDESCRIPTION

country city state

date

percent

money

sizedistance

individual

title

group

food

currency

animal

definition

reasonexpression

abbreviation

38

Answer Types

Answer types in Jeopardy

• 2500 answer types in 20,000 Jeopardy question sample

• The most frequent 200 answer types cover < 50% of data

• The 40 most frequent Jeopardy answer types

he, country, city, man, film, state, she, author, group, here, company, president, capital, star, novel, character, woman, river, island, king, song, part, series, sport, singer, actor, play, team, show, actress, animal, presidential, composer, musical, nation, book, title, leader, game

39

Ferrucci et al. 2010. Building Watson: An Overview of the DeepQA Project. AI Magazine. Fall 2010. 59-

79.

Knowledge:Not just semantics but pragmatics

Pragmatics = taking account of context in determining meaning

Search engines are great because they inherently take into account pragmatics (“associations and contexts”)

• [the national] The National (a band)

• [the national ohio] The National - Bloodbuzz Ohio – YouTube

• [the national broadband] www.broadband.gov

Task – Answer Sentence Selection

• Given a factoid question, find the sentence that

• Contains the answer

• Can sufficiently support the answer

Q: Who won the best actor Oscar in 1973?

S1: Jack Lemmon was awarded the Best Actor Oscar for Save

the Tiger (1973).

S2: Academy award winner Kevin Spacey said that Jack

Lemmon is remembered as always making time for others.

Scott Wen-tau Yih (ACL 2013) paper

Assume that there is an underlying alignment

Describes which words in and can be associated

What is the fastest car in the world?

The Jaguar XJ220 is the dearest, fastest and most sought after car on the planet.

Word Alignment for Question AnsweringTREC QA (1999-2005)

See if the (syntactic/semantic) relations support the answer

[Harabagiu & Moldovan, 2001]

Full NLP QA: LCC (Harabagiu/Moldovan) [below is the architecture of LCC’s QA system circa 2003]

Question Parse

Semantic

Transformation

Recognition of

Expected Answer

Type (for NER)

Keyword Extraction

Factoid

Question

List

Question

Named Entity

Recognition

(CICERO LITE)

Answer Type

Hierarchy

(WordNet)

Question Processing

Question Parse

Pattern Matching

Keyword Extraction

Question ProcessingDefinition

Question Definition

Answer

Answer Extraction

Pattern Matching

Definition Answer Processing

Answer Extraction

Threshold Cutoff

List Answer ProcessingList

Answer

Answer Extraction (NER)

Answer Justification

(alignment, relations)

Answer Reranking

(~ Theorem Prover)

Factoid Answer Processing

Axiomatic Knowledge

Base

Factoid

AnswerMultiple

Definition

Passages

Pattern

Repository

Single Factoid

Passages

Multiple

List

Passages

Passage Retrieval

Document Processing

Document Index

Document

Collection

Question Answering: IBM’s Watson

• Won Jeopardy on February 16, 2011!

WILLIAM WILKINSON’S

“AN ACCOUNT OF THE PRINCIPALITIES OF

WALLACHIA AND MOLDOVIA”

INSPIRED THIS AUTHOR’S

MOST FAMOUS NOVEL

Bram Stoker

IBM Watson: between Intelligence and Data

• IBM’s Watson

• http://www-03.ibm.com/innovation/us/watson/science-behind_watson.shtml

45

Jeopardy!

Watson

Watson

WatsonWILLIAM WILKINSON’S

“AN ACCOUNT OF THE PRINCIPALITIES OF

WALLACHIA AND MOLDOVIA”

INSPIRED THIS AUTHOR’S

MOST FAMOUS NOVEL

Semantic Inference in Watson QA

… Intelligence in Watson

51

Watson: a DeepQA architecture

52

Pronti per Jeopardy!

53

Riferimenti

• NLP, IR & ML:

• «Statistical Methods for Speech Recognition», F. Jelinek, MIT Press, 1998

• «Speech and Language Processing”, D. Jurafsky and J. H .Martin, Prentice-Hall, 2009.

• “Introduction to Information Retrieval”, Manning, Raghavan & Schutze, Cambridge University Press 2008.

• Reti Sociali e Data Analytics

• Community Detection and Mining in Social Media, Lei Tang, Huan Liu, Morgan & Claypool Publishers, 2010.

• Analyzing the Social Web, Jennifer Golbeck, Elsevier, 2015.

• Sitografia:

• SAG, Univ. Roma Tor Vergata: http://sag.art.uniroma2.it/