A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model
State Maintenance
Zhifei Li and Sanjeev Khudanpur
Johns Hopkins University
Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al.,
2007) Equivalent LM state maintenance We plan to add more functions soon
JOSHUA: a scalable open-source parsing-based MT decoder
New!
Chiang (2007)
Grammar formalism Synchronous Context-free Grammar (SCFG)
Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a goal item is proved.
The hypotheses are stored in a hypergraph.
Chart-parsing
Hypergraph
item
hyperedge
on the mat a cat
X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA
X | 0, 4 | a cat | the mat
X ( 猫 , a cat)
X | 0, 4 | the mat | a cat
X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1)
(X0, X0) S
Goal Item
(X0, X0) S
X ( 垫子 上 , the mat)
X (X0 的 X1, X1 of X0)
X (X0 的 X1, X1 on X0)
猫 3垫子 0 上 1 的 2
of
Hypergraph and Trees
猫 3垫子 0 上 1 的 2
X ( 猫 , a cat) X ( 垫子 上 , the mat)
X (X0 的 X1, X0 X1)
(X0, X0) S
the mat a cat
X ( 猫 , a cat)
猫 3垫子 0 上 1 的 2
X ( 垫子 上 , the mat)
(X0, X0) S
X (X0 的 X1, X0 ’s X1)
the mat ’s a cat
X ( 猫 , a cat)
猫 3
a cat on the mat
垫子 0 上 1 的 2
X ( 垫子 上 , the mat)
(X0, X0) S
X (X0 的 X1, X1 on X0)
X ( 猫 , a cat)
猫 3垫子 0 上 1 的 2
X ( 垫子 上 , the mat)
(X0, X0) S
X (X0 的 X1, X1 of X0)
A cat of the mat
How to Integrate an m-gram LM?
X | 0,1 | the olympic | olympic game
X ( 将 在 X0 举行。 , will be held in X0 .)
X (X0 的 X1, X1 of X0)
X ( 北京 , beijing) X ( 中国 , china)
X | 5, 6 | beijing | NA X | 3, 4 | china | NA
X | 3, 6 | beijing of | of china
X | 1, 7 | will be | china .
S (X0, X0)
S | 0, 1 | the olympic | olympic game
S (S0 X1, S0 X1)
S | 0, 7 | the olympic | china .
S (<s> S0 </s>, <s> S0 </s>)
S | 0, 7 | <s> the | . </s>
X ( 奥运会 , the olympic game)
北京 5奥运会 0 中国 3 的 4将 1 举行。 6在 2
the olympic game will be held in chinaofbeijing .
Three functions Accumulate probability Estimate future cost State extraction
0.4 0.2
New 3-gram
• beijing of china
New 3-grams
• will be held
• be held in
• held in beijing
• in beijing of
0.04=0.4*0.2*0.5
0.5
Future prob
• P(beijing of)=0.01
Estimated total prob
• 0.01*0.04=0.004
Equivalent State Maintenance: overview
X ( 在 X0 的 X1 下 , below X1 of X0)
X | 0, 3 | below cat | some rat
X ( 在 X0 的 X1 下 , below X1 of X0)
X | 0, 3 | below cats | many rat
X ( 在 X0 的 X1 下 , under the X1 of X0) X ( 在 X0 的 X1 下 , below the X1 of X0)
X | 0, 3 | below * | * rat
In a straightforward implementation, different LM state words lead to different items
We merge multiple items into a single item by replacing some LM state words with asterisk wildcard
X ( 在 X0 的 X1 下 , under X1 of X0)
X | 0, 3 | under cat | some rat
X ( 在 X0 的 X1 下 , below X1 of X0)
X | 0, 3 | below cat | many rat
By merging items, we can explore larger hypothesis space using less time.
We only merge items when the length of English span l ≥ m-1
Back-off Parameterization of m-gram LMs
LM probability computation
Observations A larger m leads to more backoff Default backoff weight is 1
For a m-gram not listed, β(.) = 1
-4.250922 party files
-4.741889 party filled
-4.250922 party finance -0.1434139
-4.741889 party financed
-4.741889 party finances -0.2361806
-4.741889 party financially
-3.33127 party financing -0.1119054
-3.277455 party finished -0.4362795
-4.012205 party fired
-4.741889 party fires
Equivalent State Maintenance: Right-side
state words State Prefix IS-A-PREFIX equivalent statefuture words
el-2 el-1 el el-2 el-1 el noel+1el+2el+3…
* el-1 el el-1 el noel+1el+2el+3…
• Why not right to left?
• Whether a word can be ignored depends on both its left and right sides, which complicates the procedure.
Independent from el-2
IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no
• For the case of a 4-gram LM P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el)
* el-1 el
* * el
* * el el noel+1el+2el+3… * * *
Backoff weight is one
Equivalent State Maintenance: Left-side
state words State Suffix IS-A-SUFFFIX equivalent statefuture words
e1 e2 e3 no… e-2e-1e0
e1 e2 * e1 e2 no
e1 e2 e3
Independent from e3
P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)
• For the case of a 4-gram LM P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2)
• Why not left to right?
• Whether a word can be ignored depends on both its left and right sides, which complicates the procedure.
… e-2e-1e0
e1 * * e1 no… e-2e-1e0 * * *
e1 * *
e1 e2 *
Finalized probability
P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0)
Remember to factor in backoff weights later
Equivalent State Maintenance: summary
Modified Cost FunctionOriginal Cost Function
Finalized
probability
Estimated probability
State extraction
Experimental Results: Decoding Speed
System Training Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs
obtain 570k sentence pairs LM training data: Gigaword and English side of bitext
Decoding speed Number of rules: 3M Number of m-grams: 49M
38 times faster than the baseline!
Experimental Results: Distributed LM
Distributed Language Model Eight 7-gram LMs Decoding speed: 12.2 sec/sent
Experimental Results: Equivalent LM States
Search effort versus search quality Equivalent LM State Maintenance
Sparse LM: a 7-gram LM built on about 19M words
Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method.
Backoff happens less frequently Inefficient suffix/prefix information lookup
5070
90 120150 200
30
Summary
We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding
millions of sentences in a large-scale discriminative training task
We propose a method to maintain equivalent LM states
The decoder is available at http://www.cs.jhu.edu/~zfli/
Thanks to Philip Resnik for letting me use the UMD Python decoder
Thanks to UMD MT group members for very helpful discussions
Thanks to David Chiang for Hiero and his original implementation in Python
Acknowledgements
Synchronous Context-free Grammar (SCFG) Ts: a set of source-language terminal symbols
Tt: a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form
a typical rule looks like:
Grammar Formalism
Grammar formalism Synchronous Context-free Grammar (SCFG)
Decoding task is defined as
Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a goal item is proved.
The hypotheses are stored in a structure called hypergraph.
Chart-parsing
m-gram LM Integration
Three Functions Accumulate probability Estimate future cost State extraction
Cost Function
Finalized probability
Estimated probability
State extraction
Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory
Distributed Language Model (DLM) Training
Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs
Maximize the likelihood of a dev set
Decoding Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained
Parallel and Distributed Decoding
Decoding task is defined as
Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a goal item is proved.
The hypotheses are stored in a structure called hypergraph. State of an Item
Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity
Chart-parsing
Hypergraph
X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA
X | 0, 4 | a cat | the mat
X ( 猫 , a cat)
X | 0, 4 | the mat | a cat
A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item
Source span, left-side nonterminal symbol, and left/right LM state
X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1)
hyperedge
item
(X0, X0) S
Goal Item
(X0, X0) S
X ( 垫子 上 , the mat)
X (X0 的 X1, X1 of X0)
X (X0 的 X1, X1 on X0)
on the mata cat
猫 3垫子 0 上 1 的 2