Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | harold-mckenzie |
View: | 217 times |
Download: | 1 times |
1
Automating Slot Filling Validation to
Assist Human Assessment Suzanne Tamang and Heng Ji
Computer Science Department and Linguistics Department, Queens College and the Graduate Center
City University of New York
November 5, 2012
Overview KBP SF validation task Two-step validation
Logistic regression based reranking Predicted confidence adjustment and filtering
Validation features Shallow, Contextual, Emergent (voting)
System combination Perfect setting Limiting conditions Evaluation results Opportunities
SF Validation Task Standard answer format
id, slot, run, docid, filler, start and end offset for filler, start and end offset for justification, confidence
Richmond Flowers, per:title, SFV_10_1, APW_ENG_20070810.1457.LDC2009T13, Attorney General, 336, 351,321,44,1.0
Validation goal Use post-processing methods to label 1 or -1 Step one:
Combine runs, and rerank using a probabilistic classifier Identify a threshold for filtering best candidates
Step two: Automatically assess system quality When available, use deeper contextual information Adjust confidence values to dampen noisy system contribution
4/23
FeaturesFeature Description Value Type
document type provided by document collection as news wire, broadcast news, web log category shallow
*number of tokens count of white spaces (+1) between contiguous character string integer shallow
*acronym identify and concatenate first letter of each token binary shallow *url structural rules to determine if a valid url binary shallow named entity type label with gazetteer category shallow city, *state, *country, *title, ethnicity, religion appears in specific slot-related gazetteer binary shallow *alphanumeric indicate if numbers and letters appear binary shallow
date structural rules to determine if an acceptable date format binary shallow
capitalized first character of token(s) caps binary shallow same if query and fill strings match binary shallow
keywords used primarily for spouse and residence slots binary context dependency parse length from query to answer integer context
** system votes proportion of systems with answer agreement 0-1 emergent
** answer votes proportion of answers with answer agreement 0-1 emergent
* statistically significant predictor in select models** statistically significant predictor in most all models
5/23
Two Phased Validation Approach
Step 1: Classification Training with 2011KBP SF Data Using features extracted from the 2011 KBP results:
Model selection using stepwise procedure and AIC Threshold tuning on predicted confidence estimates
Step 2: Adjustment and filtering Automatic assessment of system quality Adjustment of predicted confidence using quality/DP Contextual analysis with answer provenance offsets
Features – answer, system and group level Shallow, Contextual, Emergent
6
Attribute Distribution in Automatic Slot Filling
7
PER Attribute Distribution
8
ORG Attribute Distribution
SF Performance: Training and Testing
10
Performance, Mean Confidence & Set Size
27 distinct runs; variable F1, size, confidence, &
offset use.
11/23
Results: Slot Filling Validation
12/23
Pre-Post Validation Results:
R P F1LDC 0.72 0.77 0.75
w/o validation 0.71 0.03 0.06validation P1 0.12 0.07 0.09validation P2 0.35 0.08 0.13
Reranking multi-systems Ideal case
Diversity of systems Comparable performance Rich information
Reliable answer context System approach / intermediate system results
KBP SF Task Twenty-seven runs, limited intermediate results, unkown
strategies, and variable performance Inconsistencies paired with `rigid’ framework
Provenance: unavailable, unreliable (off a little and a lot) Confidence may or may not be available
What have we learn that translates to more efficient assessment?
Confidence, provenance, approximating system quality, and flexibility
14
Challenges and Solutions Labor intensive
Training, quality control, tedious and unfulfilling 22% of total answers were redundant 1% gain on recall over systems
Validation Inconsistencies in reporting (provenance / confidence) Lack of intermediate output
Confidence
Uniform weighting Automatic assessment quality: inconsistency, confidence
distributions
R P F TP
LDC 0.72 0.77 0.75 1119
Systems 0.71 0.03 0.06 1081
Answer Key ? 1 ? 1543
16/23
Naïve Estimation of System Quality
Confidence of High and Low Performers
Shallow/emergent features reduce noise at the expense of better systems
18/23
Confidence-based Reranking
Confidence is and important factors to a validator informative at the >90 threshold paired with quality estimates, cull more valid answers
Summary Evaluation of a two-phase SF Validation approach
for KBP 2012 Improves overall F1 before (0.06) /after (0.13) Helps low performers at the expense of better systems Key observations
Shallow features contribute to establishing a baseline Voting features did not generalize, and susceptible to system noise Contextual features are helpful (P1 to P2 gains)
Opportunities Incorporating confidence as a classifier feature or filtering More flexible frameworks for using provenance information Improved methods for naively estimating low and high
performers in the multi-system setting
Thank you