BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING

BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING

Shuanhu Bai and Haizhou Li

Institute for Infocomm Research, Republic of Singapore

Outline

• Introduction

• N-gram Model– Bayesian Learning– QB Estimation for Incremental Learning

• Continuous N-gram Model– Bayesian Learning– QB Estimation for Incremental Learning

• Experimental Results

• Conclusions

Introduction

• Assuming ample training data, the n-gram language models are still far from optimal

• Studies show that they are extremely sensitive to changes in the style, topic or genre

• LM adaptation aims at bridging the mismatch between the models and the test domain

• A typical n-gram LM is trained under maximum likelihood estimation (MLE) criterion

Introduction (cont.)

• One typical adaptation technique is called deleted interpolation which combines the flat, reliable general model (baseline model) with the sharp, volatile domain specific model

• In this paper, we will study the Bayesian learning formulation for n-gram LM adaptation

• Under the Bayesian learning framework, an incremental adaptation procedure is also proposed for dynamically updating of cache-based n-gram

N-gram Model

• N-gram model

• The quality of a given n-gram LM on a corpus D of size T is commonly assessed by the log-likelihood probability

• Unigram & Bigram

N-gram Model (cont.)

• MLE

• Smoothing– Backoff

– cache

Bayesian Learning for N-gram Model

• Dirichlet

• The probability of generating a text corpus is obtained by integrating over the parameter space

• MAP

QB Estimation for Incremental Learning for N-gram Model

• It is of practical use to devise such incremental learning mechanism that adapts both parameters and the prior knowledge over time

• Sub-corpus Dn={D1,D2,…,Dn}

• The updating of parameters can be iterated between the reproducible prior and posterior estimates

• ML

• MAP

• QB

Vi

ii C

C

Vii

iii mC

mC

V

ni

nin

i m

m

in

ini

Cmm 1

Continuous N-gram Model

• Continuous n-gram model is also called aggregate Markov model

• We introduce Z hidden variable as the “soft” word classes

• Z=1-> unigram, Z=I -> bigram

• The continuous bigram model has two obviously advantages over the discrete bigram:– Parameters : I x I -> I X Z X 2– Can apply EM to estimate parameters under MLE criterion

Continuous N-gram Model (cont.)

• Parameters

Bayesian Learning for Continuous N-gram Model

• Prior

• After EM algorithm

• Can be interpreted as a smoothing between the known priors and the current observations, or cache corpus

QB Estimation for Incremental Learning for continuous N-gram Model

• Updating of parameters

• Initial parameters

Experimental Results

• Corpus– A: 60 million words from LDC98T30 of finance and business– B: 20 million words from LDC98T30 of sports and fashion for

incremental training – C: A+B for adaptation– D: 20 million words in the same domain of C (open test set)

• Vocabulary: 50,000 words from A + B

Experimental Results (cont.)

Conclusions

• Propose a Bayesian learning approach to n-gram modeling – an interpretation for the smoothing or adaptation of language mo

del as a weighting between prior knowledge and current observations

– The Dirichlet conjugate prior not only leads to a batch adaptation– procedure but also a quasi-Bayes incremental learning strategy f

or on-line language modeling

Date post:	08-Jan-2016
Category:	Documents
Upload:	avital
View:	29 times
Download:	1 times

BAYESIAN LEARNING OF N-GRAM STATISTICAL LANGUAGE MODELING

Documents