+ All Categories
Home > Documents > Evaluating Answer Validation in multi-stream Question Answering

Evaluating Answer Validation in multi-stream Question Answering

Date post: 08-Jan-2016
Category:
Upload: rusti
View: 25 times
Download: 0 times
Share this document with a friend
Description:
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008. Evaluating Answer Validation in multi-stream Question Answering. Álvaro Rodrigo, Anselmo Peñas , Felisa Verdejo UNED NLP & IR group nlp.uned.es. Content. Context and motivation - PowerPoint PPT Presentation
Popular Tags:
30
Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas , Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008
Transcript
Page 1: Evaluating Answer Validation in multi-stream Question Answering

Evaluating Answer Validation in multi-stream Question

Answering

Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo

UNED NLP & IR group

nlp.uned.es

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)

Tokyo, 16 December 2008

Page 2: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Content

1. Context and motivation• Question Answering at CLEF• Answer Validation Exercise at CLEF

2. Evaluating the validation of answers

3. Evaluating the selection of answers• Correct selection• Correct rejection

4. Analysis and discussion

5. Conclusion

Page 3: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evolution of the CLEF-QA Track

2003

2004 2005 20062007

2008

2009

Target language

s3 7 8 9 10 11

UE Official

Collections

News 1994 +News 1995+ Wikipedia Nov. 2006

JRC-Acquis

Type of questions

200 Factoid

+ Temporal

restrictions

+ Definitions

- Type of

question

+ Lists

+ Linked questions

+ Closed lists

FactoidDefinition

MotivePurposeProcedur

e

Supporting

information

Document SnippetParagrap

h

Pilots and

Exercises

Temporal restrictio

n

Lists

AVEReal TimeWiQA

AVEQAST

AVEQASTWSDQA

GikiCLEFQAST

Page 4: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evolution of Results

2003 - 2006 (Spanish)

OverallBest

result<60%

Definitions

Best result>80% NOT

IR approach

Page 5: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Pipeline Upper Bounds

Use Answer Validation to break the pipeline

Question

Answer

Questionanalysis

PassageRetrieval

AnswerExtraction

AnswerRanking

1.00.8 0.8 0.64x x =

Not enough evidence

Page 6: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Results in CLEF-QA 2006 (Spanish)

Perfect combination

81%

Best system 52,5%

Best with ORGANIZATION

Best with PERSON

Best with TIME

Page 7: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Collaborative architectures

Diferent systems response better different types of questions

• Specialisation• Collaboration

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

Answer Validation &

Selection

Answer

Evaluation Framwork

Page 8: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Collaborative architectures

How to select the good answer?• Redundancy• Voting• Confidence score• Performance history

Why not deeper analysis?

Page 9: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Answer Validation Exercise (AVE)

Objective

Validate the correctness of the answers

Given by real QA systems...

...the participants at CLEF QA

Page 10: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Answer Validation Exercise (AVE)

QuestionAnswering

QuestionCandidate answer

Supporting Text

Textual Entailment

Answer is not correct or not enough evidence

Automatic HypothesisGeneration

QuestionHypothesis

Answer is correct

AVE 2006

AVE 2007 - 2008

Answer Validation

Page 11: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Techniques in AVE 2007

Overview AVE 2007Generates hypotheses 6

Wordnet 3

Chunking 3

n-grams, longest common Subsequences

5

Phrase transformations 2

NER 5

Num. expressions 6

Temp. expressions 4

Coreference resolution 2

Dependency analysis 3

Syntactic similarity 4

Functions (sub, obj, etc) 3

Syntactic transformations 1

Word-sense disambiguation 2

Semantic parsing 4

Semantic role labeling 2

First order logic representation

3

Theorem prover 3

Semantic similarity 2

Page 12: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluation linked to main QA task

Question

Answering

Track

Systems’ answers

Systems’ Supporting Texts

Answer

Validation

Exercise

Questions

Systems’ Validation (YES, NO)

Human Judgements (R,W,X,U)

QA Track results

Mapping(YES, NO)

Evaluation

AVE Track results

Reuse human assessments

Page 13: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Content

1. Context and motivation

2. Evaluating the validation of answers

3. Evaluating the selection of answers

4. Analysis and discussion

5. Conclusion

Page 14: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

Answer Validation &

Selection

Answer

Participant systems in aCLEF – QA

Evaluation of AnswerValidation & Selection

Evaluation Proposed

Page 15: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Collections

<q id="116" lang="EN"><q_str> What is Zanussi? </q_str><a id="116_1" value="">

<a_str> was an Italian producer of home appliances </a_str><t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str>

</a><a id="116_2" value="">

<a_str> who had also been in Cassibile since August 31 </a_str><t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str>

</a><a id="116_4" value="">

<a_str> 3 </a_str><t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>

</a></q>

Page 16: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Validation

ValidationDecide if each candidate answer is correct or not

• YES | NO

Not balanced collections

Approach: Detect if there is enough evidence to accept an answer

Measures: Precision, recall and F over correct answers

Baseline system: Accept all answers

Page 17: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Validation

CRCA

CA

nn

nrecall

Correct Answer

Incorrect

Answer

AnswerAccepte

dnCA nWA

AnswerRejecte

dnCR nWR

WACA

CA

nn

nprecision

precisionrecall

precisionrecallF

2

Page 18: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Selection

Quantify the potential gain of Answer Validation in Question Answering

• Compare AV systems with QA systems

Develop measures more comparable to QA accuracy

questions

correctlyansweredquestions

n

naccuracyqa ___

Page 19: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the selection

Given a question with several candidate answersTwo options:

Selection Select an answer ≡ try to answer the question

• Correct selection: answer was correct• Incorrect selection: answer was incorrect

Rejection Reject all candidate answers ≡ leave question

unanswered• Correct rejection: All candidate answers were incorrect• Incorrect rejection: Not all candidate answers were

incorrect

Page 20: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Selection

n questionsn= nCA + nWA + nWS + nWR + nCR

Question with Correct Answer

Question without

Correct Answer

Question Answered Correctly(One Answer Selected)

nCA -

Question Answered Incorrectly

nWA nWS

Question Unanswered(All Answers Rejected)

nWR nCR

n

naccuracyqa CA_

100__% recallselectionbest

WRWACA

CA

nnn

nrecall

WACA

CA

nn

nprecision

Not comparable to qa_accuracy

Page 21: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Selection

n questionsn= nCA + nWA + nWS + nWR + nCR

Question with Correct Answer

Question without

Correct Answer

Question Answered Correctly(One Answer Selected)

nCA -

Question Answered Incorrectly

nWA nWS

Question Unanswered(All Answers Rejected)

nWR nCR

n

naccuracyqa CA_

n

naccuracyrej CR_

Page 22: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Selection

n

naccuracyqa CA_

n

naccuracyrej CR_

n

n

n

naccuracy CRCA

Rewards rejection(not balanced

cols)

Interpretation for QA: all questions correctly rejected by AV will be answered

correctly

Page 23: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Evaluating the Selection

)(1

n

nnn

nn

n

n

n

n

nestimated CA

CRCACACRCA

n

naccuracyqa CA_

n

naccuracyrej CR_

Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy

proportion

Page 24: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Content

1. Context and motivation

2. Evaluating the validation of answers

3. Evaluating the selection of answers

4. Analysis and discussion

5. Conclusion

Page 25: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Analysis and discussion(AVE 2007 English)

Validation

Selection

QA_acc correlated to R

“Estimated” adjusts it

Page 26: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Multi-stream QA performance (AVE 2007 English)

Page 27: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Analysis and discussion (AVE 2007 Spanish)

Validation

Selection

Comparing AV & QA

Page 28: Evaluating Answer Validation in multi-stream Question Answering

UNED

nlp.uned.es

Conclusion

Evaluation framework for Answer Validation & Selection systems

Measures that reward not only Correct Selection but also Correct Rejection

• Promote improvement of QA systems

Allow comparison between AV and QA systems• In what conditions multi-stream perform better• Room for improvement just using multi-stream-QA• Potential gain that AV systems can provide to QA

Page 29: Evaluating Answer Validation in multi-stream Question Answering

Thanks!

http://nlp.uned.es/clef-qa/ave

http://www.clef-campaign.org

Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)

Page 30: Evaluating Answer Validation in multi-stream Question Answering

Evaluating Answer Validation in multi-stream Question

Answering

Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo

UNED NLP & IR group

nlp.uned.es

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)

Tokyo, 16 December 2008


Recommended