Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards...

Overview of the KBP 2013Slot Filler Validation Track

Hoa Trang DangNational Institute of Standards and Technology

Slot Filler Validation (SFV)

• Track Goals▫ Allow teams without a full slot-filling system to participate, focus

on answer validation rather than document retrieval▫ Evaluate the contribution of RTE systems on KBP slot-filling▫ Allow teams to experiment with system voting and global

• SFV input:▫ Candidate slot filler▫ Possibly additional information about candidate slot fillers

• SFV output:▫ Binary classification (Correct / Incorrect) of each candidate slot

filler• Can only improve precision, not recall of full slot-filling systems • Evaluation metrics depends on SFV use case and availability of

additional information about candidate fillers• TAC RTE KBP Validation task (2011)• TAC KBP Slot Filler Validation task (2012)

TAC RTE KBP Validation task (2011)

1 RTE evaluation pair, where:• T is the entire document

supporting the slot filler• H is a set of synonymous

sentences, representing different realizations of the slot filler

Each slot filler returned by SF systems

Use Case 1: SFV as Textual Entailment (2011)•SFV input:

▫ All regular English slot filling input (slot definitions, queries, source documents)

▫ Individual candidate slot fillers (filler, provenance)•Local Approach:

▫ Generic textual entailment: H is relation implied by candidate slot filler (e.g., “Barack Obama has lived in Chicago”), T is provenance (entire document, or smaller regions defined by justification offsets)

▫ Tailored textual entailment: train on different slot types; could be a validation module for a full slot filling system.

•Evaluation:▫ F score on entire pool of candidate slot fillers (unique slot filler,

provenance)▫ Baseline: All T’s classified as entailing the corresponding H:

P=R=percentage of entailing pairs in the pooled SF responses▫ Weak baseline, easily beat by all SFV systems; not a direct measure of

utility of SFV to SF

Use Case 2: SFV impact on single SF systems

•SFV input:▫ All regular English slot filling input (slot definitions, queries,

source documents)▫ Individual candidate slot fillers (filler, provenance, confidence)

Broken out into individual slot filling runs•Global Approach:

▫ System Voting, leveraging features across multiple SF runs•Evaluation:

▫ Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run

Slot Filler Validation (SFV) 2012

• SFV input:▫ All regular English slot filling input (slot definitions, queries,


Broken out into individual slot filling runs▫ System profile for each SF run▫ Preliminary assessment of 10% of KBP 2013 Slot Filling

queries• SFV output:

▫ Binary classification (Correct / Incorrect) of each candidate slot filler

• Evaluation:• Filter out “Incorrect” slot fillers from each run, and score according

to regular English SF; compare to score for original run




Broken out into individual slot filling runs▫ System profile for each SF run▫ Preliminary assessment of 10% of KBP 2013 Slot Filling queries

• SFV output:▫ Binary classification (Correct / Incorrect) of each candidate slot

filler• Evaluation:

• Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run

• One SFV submission, decreased F1 of almost all SF runs except poorest performing SF runs.




Broken out into individual slot filling runs

• SFV output:▫ Binary classification (Correct / Incorrect) of each candidate

slot filler• Evaluation:

• Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run




Broken out into individual slot filling runs▫ System profile for each SF run▫ Preliminary assessment of 10% of KBP 2013 Slot Filling

queries• SFV output:

▫ Binary classification (Correct / Incorrect) of each candidate slot filler

• Evaluation:• Filter out “Incorrect” slot fillers from each run, and score according

to regular English SF; compare to score for original run• Score only on the 90% of KBP 2013 slot filling queries that didn’t

have preliminary assessments released as part of SFV input

SF System Profile• SF Team ranks in KBP 2009-2012• Did the system extract fillers from the KBP 2013 source corpus?• Do the Confidence Values have meaning?• Is the Confidence Value a probability?• Tools or methods for:

▫ Query expansion▫ Document retrieval▫ Sentence retrieval▫ NER nominal tagging▫ Coreference resolution▫ Third-party relation/event extraction▫ Dependency/Constituent parsing▫ POS tagging▫ Chunking▫ Main slot filling algorithm▫ Learning algorithm▫ Ensemble model▫ External resources

Slot Filler Validation Teams and Approaches

• BIT: Beijing Institute of Technology [local]▫ Generic RTE approach based on word overlap, cosine similarity, and

token edit distance• Stanford: Stanford University [local]

▫ Based on Stanford’s full slot-filling system, especially component for checking consistency and validity of candidate fillers

• UI_CCG: University of Illinois at Urbana-Champaign [local]▫ Tailored RTE approach; check candidate for slot-specific constraints

• jhuapl: Johns Hopkins University Applied Physics Laboratory [weak global]▫ Consider only the confidence value associated with each candidate

filler and aggregate confidence values across systems.• RPI_BLENDER: Rensselaer Polytechnic Institute [strong global]

▫ Based on RPI_BLENDER full slot-filling system (like Stanford), but also leveraged full set of SFV input (including SF system profile and preliminary assessments) to rank systems and apply tier-specific filtering.

Impact of RPI_BLENDER2 SFV on SF Runs SF Run F1 of original SF run F1 after applying SFV filterlsv1 0.371212 0.012212lsv5 0.368462 0.025411lsv3 0.367438 0.029463ARPANI1 0.364683 -0.01695lsv4 0.363441 0.041238RPI_BLENDER3 0.336694 0.025749RPI_BLENDER1 0.333909 0.027718lsv2 0.333333 0.008259RPI_BLENDER5 0.332866 0.017108PRIS20133 0.327384 0.021544NYU1 0.253842 -0.00105UWashington1 0.184026 -0.011544UWashington2 0.156271 -0.004999UWashington3 0.140677 -0.013133SAFT_KRes3 0.134615 -0.004458CMUML3 0.098274 -0.002241TALP_UPC3 0.036237 -0.007019

Top 10 SF runs

Negatively impacted SF runs

Conclusion

• Leveraging global features boosts scores of individual SF runs…. If done discriminately▫ Don’t treat all slot filling systems the same

• Even weak global features (e.g. raw confidence values) may help in some cases

• Caveat: other evaluation metrics also valid depending on use case.▫ RTE KBP validation (2011) metric may be appropriate if goal is to

make assessment more efficient

Date post:	16-Dec-2015
Category:	Documents
Upload:	amice-ramsey
View:	215 times
Download:	1 times

Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards...

Documents