The PYTHY Summarization System: Microsoft Research at DUC 2007
Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi,
Hisami Suzuki, and Lucy Vanderwende
Microsoft Research
April 26, 2007
DUC Main Task Results
• Automatic Evaluations (30 participants)
• Human Evaluations
• Did pretty well on both measures
Criterion Rank ScoreROUGE-2 2 0.12028
ROUGE-SU4 3 0.17074
Criterion RankPyramid 1=
Content 5=
Overview of PYTHY
• Linear sentence ranking model
• Learns to rank sentences based on:• ROUGE scores against model summaries• Semantic Content Unit (SCU) weights of
sentences selected by past peers• Considers simplified sentences
alongside original sentences
Kk
kk sfwsScore..1
)()(
Featureinventor
y
TargetsROUGE Oracle
Pyramid/SCU
ROUGE X 2
Ranking/
Training
Model
Sentences
SimplifiedSentences
DocsDocsDocs
Docs
PYTHYTraining
SentencesDocs
Docs
Featureinventor
y
SimplifiedSentences
DocsDocs
Model
PYTHYTesting
Search
Dynamic
Scoring
Summary
Sentence Simplification
• Extension of simplification method for DUC06• Provides sentence alternatives, rather than
deterministically simplify a sentence• Uses syntax-based heuristic rules• Simplified sentences evaluated alongside originals
• In DUC 2007:• Average new candidates generated: 1.38 per sentence• Simplified sentences generated: 61% of all sents• Simplified sentences in final output: 60%
Featureinventory
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g
Sentence-Level Features
• SumFocus features: SumBasic (Nenkova et al 2006) + Task focus• cluster frequency and topic frequency• only these used in MSR DUC06
• Other content word unigrams: headline frequency• Sentence length features (binary features)• Sentence position features (real-valued and binary)• N-grams (bigrams, skip bigrams, multiword phrases)• All tokens (topic and cluster frequency)• Simplified Sentences (binary and ratio of relative length)• Inverse document frequency (idf)
Featureinventory
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g
Pairwise Ranking
• Define preferences for sentence pairs• Defined using human summaries and SCU weights
• Log-linear ranking objective used in training
• Maximize the probability of choosing the better sentence from each pair of comparable sentences
Featureinventory
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g
[Ofer et al. 03], [Burges et al. 05]
ROUGE Oracle Metric
• Find an oracle extractive summary• the summary with the highest average ROUGE-2
and ROUGE-SU4 scores • All sentences in the oracle are considered
“better” than any sentence not in the oracle• Approximate greedy search used for finding
the oracle summary
Featureinventory
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g
Pyramid-Derived Metric
• University of Ottawa SCU-annotated corpus (Copeck et al 06)
• Some sentences in 05 & 06 document collections are:• known to contain certain SCUs• known not to contain any SCUs
• Sentence score is sum of weights of all SCUs
• for un-annotated sentences, the score is undefined
• A sentence pair is constructed for training s1 > s2 iff w(s1) >w(s2)
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g Feature
inventory
Model Frequency Metrics
• Based on unigram and skip bigram frequency
• Computed for content words only• Sentence si is “better” than sj if
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g Feature
inventory
k
kcpsw )(ˆ)( models
Combining multiple metrics
• From ROUGE oracle
all sentences in oracle summary better than other sentences
• From SCU annotations
sentences with higher avg SCU weights better
• From model frequency
sentences with words occurring in models better• Combined loss: adding the losses according to all metrics
}:{1 ji ssijD
}:{2 ji ssijD
}:{3 ji ssijD
)()()( 321 DLDLDLL
TargetsROUGE OraclePyramid
/SCUROUGE
X 2
Ranking
Training
Model
SentencesSimplifiedSentences
Docs Do
csDocs Doc
s
PYTHYTrainin
g Feature
inventory
Ranking
Training
SentencesDocs
Docs
Featureinventory
SimplifiedSentences
DocsDocs
Model
PYTHYTesting
Search
Dynamic
Scoring
Summary
Dynamic Sentence Scoring
• Eliminate redundancy by re-weighting• Similar to SumBasic (Nenkova et al 2006), re-
weighting given previously selected sentences
• Discounts for features that decompose into word frequency estimates
SearchDynami
c Scoring
Search
• The search constructs partial summaries and scores them:
• The score of a summary does not decompose into an independent sum of sentence scores• Global dependencies make exact search hard
• Used multiple beams for each length of partial summaries• [McDonald 2007]
),..,|(),...,,( 11...1
21 i
niin sssscoresssScore
SearchDynami
c Scoring
Impact of Sentence Simplification
No Simplified SimplifiedR-2 R-SU4 R-2 R-SU4
SumFocus 0.078 0.132 0.078 0.134
PYTHY 0.089 0.140 0.096 0.147
•Trained on 05 data, tested on O6 data
Impact of Sentence Simplification
No Simplified SimplifiedR-2 R-SU4 R-2 R-SU4
SumFocus 0.078 0.132 0.078 0.134
PYTHY 0.089 0.140 0.096 0.147
•Trained on 05 data, tested on O6 data
Impact of Sentence Simplification
No Simplified SimplifiedR-2 R-SU4 R-2 R-SU4
SumFocus 0.078 0.132 0.078 0.134
PYTHY 0.089 0.140 0.096 0.147
•Trained on 05 data, tested on O6 data
Evaluating the Metrics
Criterion Num Pairs
Train Acc Content Only All Words
R-2 R-SU4 R-2 R-SU4Oracle 941K 93.1 0.076 0.107 0.093 0.143
SCUs 430K 62.0 0.078 0.108 0.086 0.134
Model Freq. 6.3M 96.9 0.076 0.106 0.096 0.147All 7.7M 94.2 0.076 0.107 0.096 0.147
Trained on 05 data, tested on 06 dataIncludes simplified sentences
Evaluating the Metrics
Criterion Num Pairs
Train Acc Content Only All Words
R-2 R-SU4 R-2 R-SU4Oracle 941K 93.1 0.076 0.107 0.093 0.143
SCUs 430K 62.0 0.078 0.108 0.086 0.134
Model Freq. 6.3M 96.9 0.076 0.106 0.096 0.147All 7.7M 94.2 0.076 0.107 0.096 0.147
Trained on 05 data, tested on 06 dataIncludes simplified sentences
Update Summarization Pilot• SVM novelty classifier trained on TREC 02 & 03 novelty
track
ROUGE 2 ROUGE-SU4
PYTHY + Novelty (1) 0.07135 0.11164
PYTHY + Novelty (.5) 0.07879 0.12929
PYTHY + Novelty (.1) 0.08721 0.12958
PYTHY 0.08686 0.12876
SumFocus 0.07002 0.11033
)BG|)(novelPr(PrevS|(PrevS|(Score iiPythyi s)sScore)s
Summary and Future Work• Summary
• Combination of different target metrics for training• Many sentence features• Pair-wise ranking function• Dynamic scoring
• Future work• Boost robustness
• Sensitive to cluster properties (e.g., size)• Improve grammatical quality of simplified sentences• Reconcile novelty and (ir)relevance• Learn features over whole summaries rather than individual
sentences
Thank You