ICASSP 05

ICASSP 05

Reference

• Rapid Language Model Development Using External Resources for New Spoken Dialog Domains– Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1

– 1IBM, 2Columbia University

• Maximum Entropy Based Generic Filter for Language Model Adaptation– Dong Yu, Milind Mahajan, PeterMau, Alex Acero

– Microsoft

• Language Model Estimation for Optimizing End-to End Performance of A Natural Language Call Routing System– Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine Deligne, Cheng Wu

– IBM

Introduction

• LM adaptation consists of four steps– 1. Collection of task specific adaptation data

– 2. Normalization step• Abbreviations, data and time, punctuations

– 3. Analyze adaptation data and build a task specific LM

– 4. Interpolate task specific LM with task independent LM

Introduction

• Language Modeling research concentrated in tow directions– 1. Improving the language model probability estimation

– 2. Obtaining additional training material

• The largest data set is the World Wide Web (WWW)– More than 4 billion pages

Introduction

• Using web data for language modeling– Query generation

– Filtering the relevant text from the retrieved pages

• The web counts are certainly less sparse than the counts in a corpus of a fixed size

• The web counts are also likely to be significantly more noisy than counts obtained from a carefully cleaned and normalized corpus

• Retrieve unit– Whole document v.s. sentence (utterance)

Build LM for new domain

• In practice when we start to build an SDS (spoken dialog system) for a new domain, the amount of in-domain data for the target domain is usually small

• Definition– Static resource: corpora collected for other tasks

– Dynamic resource: web data

Flow diagram for the collecting relevant data

Generating search Queries

• Using Google as search engine• The more specific a query is the more relevant the

retrieved pages are.

Similarity based sentence selection

• Machine translation’s BLEU (BiLingual Evaluation Understudy)

• N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, BP is the brevity penalty

• Where r and c are the lengths of the reference and candidate sentences, respectively

Threshold is 0.08

Experimental result

• SCLM: using static corpora for language model

• WWW-20 / WWW-100: predefined limit to 20 / 100 pares per sentence

E-mail corpus

• Dictated and non-dictated

Filtering the corpus

• Filtering out these non-dictated texts is not an easy job in general– Hand-crafted rules (e.g. regular expressions)

– Limitations• It does not generalize well to situations which we have not

encountered

• Rules are usually language dependent

• Developing and testing rules are very costly

Maximum Entropy based filter

• Consider the filtering task as a labeling problem to segment the adaptation data into two categories:– Category D (Dictated text):

• Text which should be used for LM adaptation

– Category N (Non-dictated text):• Text which should not be used for LM adaptation

• Text is divided into a sequence of text units (such as lines)

ti is the text unit, and li is the label associated with ti

Label dependency• Assume that the labels of text units are

independent with each other given the complete sequence of text units

• We further assume that the label for a given unit depends only upon units in a surrounding window of units

• k = 1:

Classification Model

• A MaxEnt model has the form:

where is the vector of model parameters

Classification Model

• Pthresh = 0.5

Features

Space Splitting

Evaluation

• Uses only features RawCompact, EOS, and OOV

• No space splitting

• Filtering is especially important and effective for the adaptation data with high percentage of non-dictated text (U2)

Efficient linear combination for distant n-gram models

David Langlois, Kamel Smaili, Jean-Paul Haton

EUROSPEECH 2003 p409~412

Introduction• Classical n-gram model

• Distant language models

Modelization of distance in SLM

• Cache model (self-relationship)– The former deals with the self-

relationship between a word present in the history and itself: if a word is frequent in the history, it has more chance to appear once again

Modelization of distance in SLM (cont.)

• Trigger model– the relationship between two words– It deals with couple of words v → w

such that if v (the triggering word) is in the history, w (the triggered word) has more chance to appear

– But, in fact, the majority of triggers are self triggers (v → v): a word triggers itself

d-n-gram model

• Nd(.) is the discounted count• 0-n-gram model is the classical n-gram mod

el

didni

ididnid

didniidiid

wwN

wwwN

wwwPwwwP

11

11

1111

...

,...

...|...|

Evaluation

• Voc: 20k words• Training set: 38M words• Development set: 2M words• Test set: 2M words• Baseline classical n-gram models:

Models Perplexity

unigram 739.9

bigram 132.4

trigram 97.8

Integration of distant n-gram models

• Distant n-gram model cannot be used alone. It takes into account only a part of the history– Perplexity is 717 for n=2 and d=4

• Several models with distance up to d are combined with the baseline model

The utility of distant n-gram models decreases with the distance: a distance greater than 2 does not provide more information

Improvement 7.1%

Improvement 3.1%

Distant trigram lead to an improvement, but it is less important than in distant bigram.

overlap between the history of d-trigram and (d+1)-trigram

Backoff smoothing

b_u_z

db_u_z

(b_u_z)˙(db_u_z)

otherwise

if ||

*

ii

iiiiii

wPw

wwNwwfrwwP

11

111

0

otherwise

if ||

*

ii

iiiiii

wPw

wwNwwfrwwP

22

222

0

2112 1 iiiiiii wwPwwPwwwP |||

7.9%

11.6%

Combination weight

• Unique weight

• The model’s weights depend on each history (the class of each sub-history)

Combination of distant n-gram

• In order to combine K models, M1,…,MK, a set of weight 1,…,K is defined and the combination is expressed by:

• Development corpus is not sufficient to estimate a huge number of parameters– Classify histories and set a weight to each class

K

iii hwPhhwP

1

||

K

ii hwPhChwP

1

||

Classification

• Break the history into the several parts (sub-histories). Each sub-history is analyzed in order to estimate its importance in terms of prediction and then put into a class

• Such a class is directly linked to the value of the sub-history frequency– This class gathers all sub-histories which have

approximately the same frequency

8000 classes/115.4

4000 classes/85.2

12.8% improvement to baseline (132.4)

5.3% improvement to the single weight combination (121.9)

12.8% improvement to baseline (97.8)

1.5% improvement to the single weight combination (86.5)

Date post:	06-Jan-2016
Category:	Documents
Upload:	gavivi
View:	29 times
Download:	0 times

ICASSP 05

Documents