BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING
Shuanhu Bai and Haizhou Li
Institute for Infocomm Research, Republic of Singapore
Outline
• Introduction
• N-gram Model– Bayesian Learning– QB Estimation for Incremental Learning
• Continuous N-gram Model– Bayesian Learning– QB Estimation for Incremental Learning
• Experimental Results
• Conclusions
Introduction
• Assuming ample training data, the n-gram language models are still far from optimal
• Studies show that they are extremely sensitive to changes in the style, topic or genre
• LM adaptation aims at bridging the mismatch between the models and the test domain
• A typical n-gram LM is trained under maximum likelihood estimation (MLE) criterion
Introduction (cont.)
• One typical adaptation technique is called deleted interpolation which combines the flat, reliable general model (baseline model) with the sharp, volatile domain specific model
• In this paper, we will study the Bayesian learning formulation for n-gram LM adaptation
• Under the Bayesian learning framework, an incremental adaptation procedure is also proposed for dynamically updating of cache-based n-gram
N-gram Model
• N-gram model
• The quality of a given n-gram LM on a corpus D of size T is commonly assessed by the log-likelihood probability
• Unigram & Bigram
N-gram Model (cont.)
• MLE
• Smoothing– Backoff
– cache
Bayesian Learning for N-gram Model
• Dirichlet
• The probability of generating a text corpus is obtained by integrating over the parameter space
• MAP
QB Estimation for Incremental Learning for N-gram Model
• It is of practical use to devise such incremental learning mechanism that adapts both parameters and the prior knowledge over time
• Sub-corpus Dn={D1,D2,…,Dn}
• The updating of parameters can be iterated between the reproducible prior and posterior estimates
• ML
• MAP
• QB
Vi
ii C
C
Vii
iii mC
mC
V
ni
nin
i m
m
in
ini
Cmm 1
Continuous N-gram Model
• Continuous n-gram model is also called aggregate Markov model
• We introduce Z hidden variable as the “soft” word classes
• Z=1-> unigram, Z=I -> bigram
• The continuous bigram model has two obviously advantages over the discrete bigram:– Parameters : I x I -> I X Z X 2– Can apply EM to estimate parameters under MLE criterion
Continuous N-gram Model (cont.)
• Parameters
Bayesian Learning for Continuous N-gram Model
• Prior
• After EM algorithm
• Can be interpreted as a smoothing between the known priors and the current observations, or cache corpus
QB Estimation for Incremental Learning for continuous N-gram Model
• Updating of parameters
• Initial parameters
Experimental Results
• Corpus– A: 60 million words from LDC98T30 of finance and business– B: 20 million words from LDC98T30 of sports and fashion for
incremental training – C: A+B for adaptation– D: 20 million words in the same domain of C (open test set)
• Vocabulary: 50,000 words from A + B
Experimental Results (cont.)
Conclusions
• Propose a Bayesian learning approach to n-gram modeling – an interpretation for the smoothing or adaptation of language mo
del as a weighting between prior knowledge and current observations
– The Dirichlet conjugate prior not only leads to a batch adaptation– procedure but also a quasi-Bayes incremental learning strategy f
or on-line language modeling