ICASSP 05
Reference
• Rapid Language Model Development Using External Resources for New Spoken Dialog Domains– Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1
– 1IBM, 2Columbia University
• Maximum Entropy Based Generic Filter for Language Model Adaptation– Dong Yu, Milind Mahajan, PeterMau, Alex Acero
– Microsoft
• Language Model Estimation for Optimizing End-to End Performance of A Natural Language Call Routing System– Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine Deligne, Cheng Wu
– IBM
Introduction
• LM adaptation consists of four steps– 1. Collection of task specific adaptation data
– 2. Normalization step• Abbreviations, data and time, punctuations
– 3. Analyze adaptation data and build a task specific LM
– 4. Interpolate task specific LM with task independent LM
Introduction
• Language Modeling research concentrated in tow directions– 1. Improving the language model probability estimation
– 2. Obtaining additional training material
• The largest data set is the World Wide Web (WWW)– More than 4 billion pages
Introduction
• Using web data for language modeling– Query generation
– Filtering the relevant text from the retrieved pages
• The web counts are certainly less sparse than the counts in a corpus of a fixed size
• The web counts are also likely to be significantly more noisy than counts obtained from a carefully cleaned and normalized corpus
• Retrieve unit– Whole document v.s. sentence (utterance)
Build LM for new domain
• In practice when we start to build an SDS (spoken dialog system) for a new domain, the amount of in-domain data for the target domain is usually small
• Definition– Static resource: corpora collected for other tasks
– Dynamic resource: web data
Flow diagram for the collecting relevant data
Generating search Queries
• Using Google as search engine• The more specific a query is the more relevant the
retrieved pages are.
Similarity based sentence selection
• Machine translation’s BLEU (BiLingual Evaluation Understudy)
• N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, BP is the brevity penalty
• Where r and c are the lengths of the reference and candidate sentences, respectively
Threshold is 0.08
Experimental result
• SCLM: using static corpora for language model
• WWW-20 / WWW-100: predefined limit to 20 / 100 pares per sentence
E-mail corpus
• Dictated and non-dictated
Filtering the corpus
• Filtering out these non-dictated texts is not an easy job in general– Hand-crafted rules (e.g. regular expressions)
– Limitations• It does not generalize well to situations which we have not
encountered
• Rules are usually language dependent
• Developing and testing rules are very costly
Maximum Entropy based filter
• Consider the filtering task as a labeling problem to segment the adaptation data into two categories:– Category D (Dictated text):
• Text which should be used for LM adaptation
– Category N (Non-dictated text):• Text which should not be used for LM adaptation
• Text is divided into a sequence of text units (such as lines)
ti is the text unit, and li is the label associated with ti
Label dependency• Assume that the labels of text units are
independent with each other given the complete sequence of text units
• We further assume that the label for a given unit depends only upon units in a surrounding window of units
• k = 1:
Classification Model
• A MaxEnt model has the form:
where is the vector of model parameters
Classification Model
• Pthresh = 0.5
Features
Space Splitting
Evaluation
• Uses only features RawCompact, EOS, and OOV
• No space splitting
• Filtering is especially important and effective for the adaptation data with high percentage of non-dictated text (U2)
Efficient linear combination for distant n-gram models
David Langlois, Kamel Smaili, Jean-Paul Haton
EUROSPEECH 2003 p409~412
Introduction• Classical n-gram model
• Distant language models
Modelization of distance in SLM
• Cache model (self-relationship)– The former deals with the self-
relationship between a word present in the history and itself: if a word is frequent in the history, it has more chance to appear once again
Modelization of distance in SLM (cont.)
• Trigger model– the relationship between two words– It deals with couple of words v → w
such that if v (the triggering word) is in the history, w (the triggered word) has more chance to appear
– But, in fact, the majority of triggers are self triggers (v → v): a word triggers itself
d-n-gram model
• Nd(.) is the discounted count• 0-n-gram model is the classical n-gram mod
el
didni
ididnid
didniidiid
wwN
wwwN
wwwPwwwP
11
11
1111
...
,...
...|...|
Evaluation
• Voc: 20k words• Training set: 38M words• Development set: 2M words• Test set: 2M words• Baseline classical n-gram models:
Models Perplexity
unigram 739.9
bigram 132.4
trigram 97.8
Integration of distant n-gram models
• Distant n-gram model cannot be used alone. It takes into account only a part of the history– Perplexity is 717 for n=2 and d=4
• Several models with distance up to d are combined with the baseline model
The utility of distant n-gram models decreases with the distance: a distance greater than 2 does not provide more information
Improvement 7.1%
Improvement 3.1%
Distant trigram lead to an improvement, but it is less important than in distant bigram.
overlap between the history of d-trigram and (d+1)-trigram
Backoff smoothing
b_u_z
db_u_z
(b_u_z)˙(db_u_z)
otherwise
if ||
*
ii
iiiiii
wPw
wwNwwfrwwP
11
111
0
otherwise
if ||
*
ii
iiiiii
wPw
wwNwwfrwwP
22
222
0
2112 1 iiiiiii wwPwwPwwwP |||
7.9%
11.6%
Combination weight
• Unique weight
• The model’s weights depend on each history (the class of each sub-history)
Combination of distant n-gram
• In order to combine K models, M1,…,MK, a set of weight 1,…,K is defined and the combination is expressed by:
• Development corpus is not sufficient to estimate a huge number of parameters– Classify histories and set a weight to each class
K
iii hwPhhwP
1
||
K
ii hwPhChwP
1
||
Classification
• Break the history into the several parts (sub-histories). Each sub-history is analyzed in order to estimate its importance in terms of prediction and then put into a class
• Such a class is directly linked to the value of the sub-history frequency– This class gathers all sub-histories which have
approximately the same frequency
8000 classes/115.4
4000 classes/85.2
12.8% improvement to baseline (132.4)
5.3% improvement to the single weight combination (121.9)
12.8% improvement to baseline (97.8)
1.5% improvement to the single weight combination (86.5)