Breaking the Resource Bottleneck for Multilingual Parsing
Rebecca Hwa, Philip Resnik and Amy Weinberg
University of Maryland
The Treebank Bottleneck
• High-quality parsers need training examples with hand-annotated syntactic information
• Annotation is labor intensive and time consuming
• There is no sizable treebank for most languages other than English
[[S [NP-SBJ Ford Motor Co. ] [VP acquired [NP [NP 5 % ] [PP of [NP [NP the shares] [PP-LOC in [NP Jaguar PLC]]]]]]] . ]
State of the Art Parsing
Language Treebank Size Parser Performance
English Penn
Treebank
1M words
40k sentences
~90%
Chinese Chinese
Treebank
100K words
4k sentences
~75%
Others(e.g., Hindi, Arabic)
? ? ?
Research Questions
• How can we induce a non-English language treebank quickly and automatically?– Bootstrap from available English resources– Project syntactic dependency relationship
across bilingual sentences
• How good is the resulting treebank?– Can we use it to train a new parser?– How can we improve its quality?
Roadmap
• Overview of the framework– Direct projection algorithm
• Problematic cases
– Post projection transformation• Remaining challenges
– Filtering
• Experiment– Direct evaluation of the projected trees– Evaluation of a Chinese parser trained on the induced
treebank
• Future Work
Overview of Our Frameworkbilingual corpus
English Chinese
Englishdependency
parser
wordalignment
model
dependencyparser
projected Chinesedependency treebank
Filtering
Transformation
Projection
unseenChinese
sentences
train
dependency treesfor unseen sentences
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
Necessary Resources:1. Bilingual Sentences
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objadj
det
det
modmod
Necessary Resources2. English (Dependency) Parser
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objadj
det
det
modmod
Necessary Resources3. Word Alignment
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objadj
det
det
modmod
mod
obj
subj
adj mod
Projected Chinese Dependency Tree
Direct Projection Algorithm
• If there is a syntactic relationship between two English words, then the same syntactic relationship also exists between their corresponding Chinese words
Problematic Case: Unaligned English
thisregarding subject
det
mod
对 此
Problematic Case: Unaligned English
thisregarding subject
det
mod
对 此 *e*det
mod
Problematic Case: many-to-1
thisregarding subject
det
mod
对 此
Problematic Case: many-to-1
thisregarding subject
det
mod
对 此
mod
Problematic Case: Unaligned Chinese
Chinese expressedThe
中国 方面 表示
subj
*e*
*e*
det
Problematic Case: Unaligned Chinese
Chinese expressedThe
中国 方面 表示
subj
*e*
*e*
subj
det
det
Problematic Case: 1-to-many
Chinese expressed
中国 方面 表示
subj
The
*e*
det
Problematic Case: 1-to-many
Chinese expressed
中国 方面 表示*M*
mac
mac
subj
subj
The
*e*
det
det
The Chinese satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objdet det
modmod
obj
subj
Output of theDirect Projection Algorithm
*M**e*mod
moddet
mac
mac
Post Projection Transformation
• Handles One-to-Many mapping– Select head based on (projected) part-of-speech categories
• Handles some Unaligned-Chinese cases– Only addressing close-class words
• Functional words (e.g., aspectual, measure words)
• Easily enumerable lexical categories (e.g., $, RMB, yen)
• Remove empty nodes introduced by the Unaligned-English cases by promoting its head child
Remaining Challenges
• Handling divergences• Incorporating unaligned foreign words into the
projected tree• Removing cross dependencies
A B
a b
C D
d c
Filtering
• Projected treebank is noisy – Mistakes introduced by the projection algorithm
– Mistakes introduced by component errors
• Use aggressive filtering techniques to remove the worst projected trees– Filter out a sentence pair if many English words were
unaligned
– Filter out a sentence pair if many Chinese words are aligned to the same English word
– Filter out a sentence pair if many of the projected links caused crossing dependencies
Experiments
• Direct evaluation of the projection framework– Compare the (pre-filtered) projected trees against
human annotated gold standard
• Evaluation of the projected treebank– Use the (post-filtered) treebank to train a Chinese
parser
– Test the parser on unseen sentences and compare the output to human annotated gold standard
Direct Evaluation
• Bilingual data: 88 Chinese Treebank sentences with their English translations
• Apply projection and transformation under idealized conditions– Given human-corrected English parse trees and hand-drawn
word-alignments
• Apply projection and transformation under realistic conditions– English parse trees generated from Collins parser (trained on
Penn Treebank)– Word-alignments generated from IBM MT Model (trained
on ~56K Hong Kong News bilingual sentences)
Direct Evaluation Results
Condition Accuracy*
Ideal 67%
English parses from the Collins parser
62%
Word-alignments from the IBM MT Model
39%
*Accuracy = f-score based on unlabeled precision & recall
Evaluating Trained Parser
• Bilingual data: 56K sentence pairs from the Hong Kong News parallel corpus
• Apply the DPA (using the Collins Parser and IBM MT Model) to create a projected Chinese treebank
• Filter out badly-aligned sentence pairs to reduce noise• Train a Chinese parser with the (filtered) projected
treebank• Test the Chinese parser on unseen test set (88
Chinese Treebank sentences)
Parser Evaluation Results
Method Training
Corpus
Corpus Size Parser
Accuracy
Modify Prev
(baseline)
- - 13.5
Modify Next
(baseline)
- - 35.7
Stat. Parser HKNews
(Filtered)
5284 42.3
Stat. Parser
(upper bound)
Chinese Treebank
3870 75.6
Conclusion
• We have presented a framework for acquiring Chinese dependency treebanks by bootstrapping from existing linguistic resources
• Although the projected trees may have an accuracy rate of nearly 70% in principle, reducing noise caused by word-alignment errors is still a major challenge
• A parser trained on the induced treebank can outperform some baselines
Future Work
• Obtain larger parallel corpus
• Reduce error rates of the word-alignment models
• Develop more sophisticated techniques to filter out noise in the induced treebank
• Improve the projection algorithm to handle unaligned words and inconsistent trees
Reserve slides
DPA Case 1: One-to-One
A B
ab
DPA Case 2: Many-to-One
a b
A1 BA2 A3C
c
DPA Case 3: One-to-Many
A B
a1b a2 a3*a*
DPA Case 4: Many-to-Many
*a* b
BC
c a1 a2
A1 A2 A3
DPA Case 5: Unaligned English Word
A B
a
C
c
DPA Case 6: Unaligned Foreign Word
A
a b
C
c