CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language...

Post on 09-May-2020

9 views 0 download

transcript

CS6120/CS4120:NaturalLanguageProcessing

Instructor:Prof.LuWangCollegeofComputerandInformationScience

NortheasternUniversityWebpage:www.ccs.neu.edu/home/luwang

QuestionAnswering

QuestionAnswering

What do worms eat?

worms

eat

what

worms

eat

grass

Worms eat grass

worms

eat

grass

Grass is eaten by wormsbirds

eat

worms

Birds eat worms

horses

eat

grass

Horses with worms eat grass

with

worms

Ques%on: Poten%al-Answers:

OneoftheoldestNLPtasks(punchedcardsystemsin1961)Simmons,Klein,McConlogue.1964.IndexingandDependencyLogicforAnsweringEnglishQuestions.AmericanDocumentation15:30,196-204

QuestionAnswering:IBM’sWatson

• WonJeopardy onFebruary16,2011!

WILLIAMWILKINSON’S“ANACCOUNTOFTHEPRINCIPALITIESOF

WALLACHIAANDMOLDOVIA”INSPIREDTHISAUTHOR’SMOSTFAMOUSNOVEL

BramStoker

Apple’sSiri

WolframAlpha

TypesofQuestionsinModernSystems

• Factoidquestions• Whowrote“TheUniversalDeclarationofHumanRights”?• Howmanycaloriesarethereintwoslicesofapplepie?• Whatistheaverageageoftheonsetofautism?• WhereisAppleComputerbased?

• Complex(narrative)questions:• Inchildrenwithanacutefebrileillness,whatistheefficacyofacetaminopheninreducingfever?

• WhatdoscholarsthinkaboutJefferson’spositionondealingwithpirates?

Commercialsystems:mainlyfactoidquestions

WhereistheLouvreMuseumlocated? InParis,France

What’stheabbreviation forlimitedpartnership?

L.P.

What arethenamesofOdin’sravens? Huginn andMuninn

What currencyisusedinChina? Theyuan

Whatkindofnutsareusedinmarzipan? almonds

WhatinstrumentdoesMaxRoachplay? drums

WhatisthetelephonenumberforStanfordUniversity?

650-723-2300

ParadigmsforQA

•InformationRetrieval(IR)-basedapproaches•TREC;IBMWatson;Google

•Knowledge-basedandHybridapproaches• IBMWatson;AppleSiri;WolframAlpha;TrueKnowledgeEvi

Manyquestionscanalreadybeansweredbywebsearch

IR-basedQuestionAnswering

• a

IR-basedFactoidQA

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

IR-basedFactoidQA

• QUESTIONPROCESSING• Detectquestiontype,answertype,focus,relations• Formulatequeriestosendtoasearchengine

• PASSAGERETRIEVAL• Retrieverankeddocuments• Breakintosuitablepassagesandrerank

• ANSWERPROCESSING• Extractcandidateanswers• Rankcandidates

• usingevidencefromthetextandexternalsources

Knowledge-basedapproaches(Siri)

• Buildasemanticrepresentationofthequery• Times,dates,locations,entities,numericquantities

• Mapfromthissemanticstoquerystructureddataorresources• Geospatialdatabases• Ontologies(Wikipediainfoboxes,dbPedia,WordNet,Yago)• Restaurantreviewsourcesandreservationservices• Scientificdatabases

Hybridapproaches(IBMWatson)

• Buildashallowsemanticrepresentationofthequery• GenerateanswercandidatesusingIRmethods

• Augmentedwithontologiesandsemi-structureddata

• Scoreeachcandidateusingricherknowledgesources• Geospatialdatabases• Temporalreasoning• Taxonomicalclassification

AnswerTypesandQueryFormulation

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

QuestionProcessingThingstoextractfromthequestion

• AnswerTypeDetection• Decidethenamedentitytype(person,place)oftheanswer

• QueryFormulation• ChoosequerykeywordsfortheIRsystem

• QuestionTypeclassification• Isthisadefinitionquestion,amathquestion,alistquestion?

• FocusDetection• Findthequestionwordsthatarereplacedbytheanswer

• RelationExtraction• Findrelationsbetweenentitiesinthequestion

Question ProcessingJeopardy!: They’re the two states you could be reentering if you’re crossing Florida’s northern border

•AnswerType:USstate•Query:twostates,border,Florida,north•Focus:thetwostates•Relations:borders(Florida,?x,north)

AnswerTypeDetection:NamedEntities

•WhofoundedVirginAirlines?• PERSON

•WhatCanadiancityhasthelargestpopulation?• CITY.

AnswerTypeTaxonomy

•6coarseclasses• ABBREVIATION,ENTITY,DESCRIPTION,HUMAN,LOCATION,NUMERIC

•50finerclasses• LOCATION:city,country,mountain…• HUMAN:group,individual,title,description• ENTITY:animal,body,color,currency…

Xin Li,DanRoth.2002.LearningQuestion Classifiers.COLING'02

PartofLi&Roth’sAnswerTypeTaxonomy

LOCATION

NUMERIC

ENTITY HUMAN

ABBREVIATIONDESCRIPTION

country city state

datepercent

money

sizedistance

individual

title

group

food

currency

animal

definition

reason expression

abbreviation

AnswerTypes

MoreAnswerTypes

AnswertypesinJeopardy

• 2500answertypesin20,000Jeopardyquestionsample• Themostfrequent200answertypescover<50%ofdata• The40mostfrequentJeopardyanswertypeshe,country,city,man,film,state,she,author,group,here,company,president,capital,star,novel,character,woman,river,island,king,song,part,series,sport,singer,actor,play,team,show,actress,animal,presidential,composer,musical,nation,book,title,leader,game

Ferrucci etal.2010.BuildingWatson:AnOverviewoftheDeepQA Project.AIMagazine.Fall2010.59-79.

AnswerTypeDetection

•Hand-writtenrules•MachineLearning•Hybrids

AnswerTypeDetection

• Regularexpression-basedrulescangetsomecases:• Who{is|was|are|were}PERSON• PERSON(YEAR– YEAR)

• Otherrulesusethequestionheadword:(theheadwordofthefirstnounphraseafterthewh-word)

• Whichcity inChinahasthelargestnumberofforeignfinancialcompanies?

• Whatisthestateflower ofCalifornia?

AnswerTypeDetection

•Mostoften,wetreattheproblemasmachinelearningclassification•Defineataxonomyofquestiontypes•Annotatetrainingdataforeachquestiontype•Trainclassifiersforeachquestionclassusingarichsetoffeatures.

• featuresincludethosehand-writtenrules!

FeaturesforAnswerTypeDetection

•Questionwordsandphrases•Part-of-speechtags•Parsefeatures(headwords)•NamedEntities•Semanticallyrelatedwords

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

KeywordSelectionAlgorithm

1.Selectallnon-stopwordsinquotations2.SelectallNNPwordsinrecognizednamedentities3.Selectallcomplexnominals withtheiradjectivalmodifiers4.Selectallothercomplexnominals5.Selectallnounswiththeiradjectivalmodifiers6.Selectallothernouns7.Selectallverbs8.Selectalladverbs9.Selectthequestionfocusword(skippedinallprevioussteps)10.Selectallotherwords

DanMoldovan,Sanda Harabagiu,MariusPaca,Rada Mihalcea,RichardGoodrum,RoxanaGirju andVasile Rus.1999.ProceedingsofTREC-8.

Choosingkeywordsfromthequery

Whocoinedtheterm“cyberspace”inhisnovel“Neuromancer”?

1 1

4 4

7

cyberspace/1Neuromancer/1term/4novel/4coined/7

SlidefromMihaiSurdeanu

PassageRetrievalandAnswerExtraction

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

PassageRetrieval

•Step1:IRengineretrievesdocumentsusingqueryterms•Step2:Segmentthedocumentsintoshorterunits

• somethinglikeparagraphs•Step3:Passageranking

• Useanswertypetohelprerank passages

FeaturesforPassageRanking

• NumberofNamedEntitiesoftherighttypeinpassage• Numberofquerywordsinpassage• NumberofquestionN-gramsalsoinpassage• Proximityofquerykeywordstoeachotherinpassage• Longestsequenceofquestionwords• Rankofthedocumentcontainingpassage

Eitherinrule-basedclassifiersorwithsupervisedmachinelearning

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

AnswerExtraction

• Runananswer-typenamed-entitytaggeronthepassages• Eachanswertyperequiresanamed-entitytaggerthatdetectsit• IfanswertypeisCITY,taggerhastotagCITY

• CanbefullNER,simpleregularexpressions,orhybrid

• Returnthestringwiththerighttype:• Who is the prime minister of India (PERSON)Manmohan Singh, Prime Minister of India, had told left leaders that the deal would not be renegotiated.

• How tall is Mt. Everest? (LENGTH)The official height of Mount Everest is 29035 feet

RankingCandidateAnswers

•Butwhatiftherearemultiplecandidateanswers!

Q: Who was Queen Victoria’s second son?•AnswerType:Person

• Passage:TheMariebiscuitisnamedafterMarieAlexandrovna,thedaughterofCzarAlexanderIIofRussiaandwifeofAlfred,thesecondsonofQueenVictoriaandPrinceAlbert

RankingCandidateAnswers

•Butwhatiftherearemultiplecandidateanswers!

Q: Who was Queen Victoria’s second son?•AnswerType:Person

• Passage:TheMariebiscuitisnamedafterMarieAlexandrovna,thedaughterofCzarAlexanderIIofRussiaandwifeofAlfred,thesecondsonofQueenVictoriaandPrinceAlbert

Usemachinelearning:Featuresforrankingcandidateanswers

Answertypematch:Candidatecontainsaphrasewiththecorrectanswertype.Patternmatch:Regularexpressionpatternmatchesthecandidate.Questionkeywords:#ofquestionkeywordsinthecandidate.Keyworddistance:DistanceinwordsbetweenthecandidateandquerykeywordsNoveltyfactor:Awordinthecandidateisnotinthequery.Appositionfeatures:ThecandidateisanappositivetoquestiontermsPunctuationlocation:Thecandidateisimmediatelyfollowedbyacomma,period,quotationmarks,semicolon,orexclamationmark.Sequencesofquestionterms:Thelengthofthelongestsequenceofquestiontermsthatoccursinthecandidateanswer.

CandidateAnswerscoringinIBMWatson

• Eachcandidateanswergetsscoresfrom>50components• (fromunstructuredtext,semi-structuredtext,triplestores)

• logicalform(parse)matchbetweenquestionandcandidate

• passagesourcereliability• geospatiallocation

• California is”southwest ofMontana”• temporalrelationships• taxonomicclassification

CommonEvaluationMetrics

1. Accuracy (doesanswermatchgold-labeledanswer?)2. MeanReciprocalRank

• ForeachqueryreturnarankedlistofMcandidateanswers.• Queryscoreis1/Rankofthefirstcorrectanswer

• Iffirstansweriscorrect:1• elseifsecondansweriscorrect:½• elseifthirdansweriscorrect:⅓,etc.• Scoreis0ifnoneoftheManswersarecorrect

• TakethemeanoverallNqueriesMRR =

1rankii=1

N

N

KnowledgeinQA

RelationExtraction

•Answers:DatabasesofRelations• born-in(“EmmaGoldman”,“June271869”)• author-of(“CaoXue Qin”,“DreamoftheRedChamber”)• DrawfromWikipediainfoboxes,DBpedia,FreeBase,etc.

•Questions:ExtractingRelationsinQuestionsWhosegranddaughterstarredinE.T.?(acted-in ?x “E.T.”)

(granddaughter-of ?x ?y)

Temporal Reasoning

•Relation databases• (andobituaries,biographical dictionaries,etc.)

• IBMWatson”In1594hetook ajob asatax collector inAndalusia”Candidates:

• Thoreau isabad answer (born in1817)• Cervantes ispossible (was alive in1594)

Geospatial knowledge(containment,directionality,borders)

• Beijing isagood answer for”Asiancity”• California is”southwest ofMontana”• geonames.org:

ContextandConversationinVirtualAssistantslikeSiri•Coreferencehelpsresolveambiguities

U:“BookatableatIlFornaio at7:00withmymom”U:“Alsosendher anemailreminder”

•Clarificationquestions:U:“Chicagopizza”S:“DidyoumeanpizzarestaurantsinChicagoorChicago-stylepizza?”