+ All Categories
Home > Documents > ICASSP 05

ICASSP 05

Date post: 06-Jan-2016
Category:
Upload: gavivi
View: 29 times
Download: 0 times
Share this document with a friend
Description:
ICASSP 05. Reference. Rapid Language Model Development Using External Resources for New Spoken Dialog Domains Ruhi Sarikaya 1 , Agustin Gravano 2 , Yuqing Gao 1 1 IBM, 2 Columbia University Maximum Entropy Based Generic Filter for Language Model Adaptation - PowerPoint PPT Presentation
35
ICASSP 05
Transcript
Page 1: ICASSP 05

ICASSP 05

Page 2: ICASSP 05

Reference

• Rapid Language Model Development Using External Resources for New Spoken Dialog Domains– Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1

– 1IBM, 2Columbia University

• Maximum Entropy Based Generic Filter for Language Model Adaptation– Dong Yu, Milind Mahajan, PeterMau, Alex Acero

– Microsoft

• Language Model Estimation for Optimizing End-to End Performance of A Natural Language Call Routing System– Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine Deligne, Cheng Wu

– IBM

Page 3: ICASSP 05

Introduction

• LM adaptation consists of four steps– 1. Collection of task specific adaptation data

– 2. Normalization step• Abbreviations, data and time, punctuations

– 3. Analyze adaptation data and build a task specific LM

– 4. Interpolate task specific LM with task independent LM

Page 4: ICASSP 05

Introduction

• Language Modeling research concentrated in tow directions– 1. Improving the language model probability estimation

– 2. Obtaining additional training material

• The largest data set is the World Wide Web (WWW)– More than 4 billion pages

Page 5: ICASSP 05

Introduction

• Using web data for language modeling– Query generation

– Filtering the relevant text from the retrieved pages

• The web counts are certainly less sparse than the counts in a corpus of a fixed size

• The web counts are also likely to be significantly more noisy than counts obtained from a carefully cleaned and normalized corpus

• Retrieve unit– Whole document v.s. sentence (utterance)

Page 6: ICASSP 05

Build LM for new domain

• In practice when we start to build an SDS (spoken dialog system) for a new domain, the amount of in-domain data for the target domain is usually small

• Definition– Static resource: corpora collected for other tasks

– Dynamic resource: web data

Page 7: ICASSP 05

Flow diagram for the collecting relevant data

Page 8: ICASSP 05

Generating search Queries

• Using Google as search engine• The more specific a query is the more relevant the

retrieved pages are.

Page 9: ICASSP 05

Similarity based sentence selection

• Machine translation’s BLEU (BiLingual Evaluation Understudy)

• N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, BP is the brevity penalty

• Where r and c are the lengths of the reference and candidate sentences, respectively

Threshold is 0.08

Page 10: ICASSP 05

Experimental result

• SCLM: using static corpora for language model

• WWW-20 / WWW-100: predefined limit to 20 / 100 pares per sentence

Page 11: ICASSP 05

E-mail corpus

• Dictated and non-dictated

Page 12: ICASSP 05

Filtering the corpus

• Filtering out these non-dictated texts is not an easy job in general– Hand-crafted rules (e.g. regular expressions)

– Limitations• It does not generalize well to situations which we have not

encountered

• Rules are usually language dependent

• Developing and testing rules are very costly

Page 13: ICASSP 05

Maximum Entropy based filter

• Consider the filtering task as a labeling problem to segment the adaptation data into two categories:– Category D (Dictated text):

• Text which should be used for LM adaptation

– Category N (Non-dictated text):• Text which should not be used for LM adaptation

• Text is divided into a sequence of text units (such as lines)

ti is the text unit, and li is the label associated with ti

Page 14: ICASSP 05

Label dependency• Assume that the labels of text units are

independent with each other given the complete sequence of text units

• We further assume that the label for a given unit depends only upon units in a surrounding window of units

• k = 1:

Page 15: ICASSP 05

Classification Model

• A MaxEnt model has the form:

where is the vector of model parameters

Page 16: ICASSP 05

Classification Model

• Pthresh = 0.5

Page 17: ICASSP 05

Features

Page 18: ICASSP 05

Space Splitting

Page 19: ICASSP 05

Evaluation

• Uses only features RawCompact, EOS, and OOV

• No space splitting

Page 20: ICASSP 05

• Filtering is especially important and effective for the adaptation data with high percentage of non-dictated text (U2)

Page 21: ICASSP 05

Efficient linear combination for distant n-gram models

David Langlois, Kamel Smaili, Jean-Paul Haton

EUROSPEECH 2003 p409~412

Page 22: ICASSP 05

Introduction• Classical n-gram model

• Distant language models

Page 23: ICASSP 05

Modelization of distance in SLM

• Cache model (self-relationship)– The former deals with the self-

relationship between a word present in the history and itself: if a word is frequent in the history, it has more chance to appear once again

Page 24: ICASSP 05

Modelization of distance in SLM (cont.)

• Trigger model– the relationship between two words– It deals with couple of words v → w

such that if v (the triggering word) is in the history, w (the triggered word) has more chance to appear

– But, in fact, the majority of triggers are self triggers (v → v): a word triggers itself

Page 25: ICASSP 05

d-n-gram model

• Nd(.) is the discounted count• 0-n-gram model is the classical n-gram mod

el

didni

ididnid

didniidiid

wwN

wwwN

wwwPwwwP

11

11

1111

...

,...

...|...|

Page 26: ICASSP 05

Evaluation

• Voc: 20k words• Training set: 38M words• Development set: 2M words• Test set: 2M words• Baseline classical n-gram models:

Models Perplexity

unigram 739.9

bigram 132.4

trigram 97.8

Page 27: ICASSP 05

Integration of distant n-gram models

• Distant n-gram model cannot be used alone. It takes into account only a part of the history– Perplexity is 717 for n=2 and d=4

• Several models with distance up to d are combined with the baseline model

Page 28: ICASSP 05

The utility of distant n-gram models decreases with the distance: a distance greater than 2 does not provide more information

Improvement 7.1%

Improvement 3.1%

Page 29: ICASSP 05

Distant trigram lead to an improvement, but it is less important than in distant bigram.

overlap between the history of d-trigram and (d+1)-trigram

Page 30: ICASSP 05

Backoff smoothing

b_u_z

db_u_z

(b_u_z)˙(db_u_z)

otherwise

if ||

*

ii

iiiiii

wPw

wwNwwfrwwP

11

111

0

otherwise

if ||

*

ii

iiiiii

wPw

wwNwwfrwwP

22

222

0

2112 1 iiiiiii wwPwwPwwwP |||

Page 31: ICASSP 05

7.9%

11.6%

Page 32: ICASSP 05

Combination weight

• Unique weight

• The model’s weights depend on each history (the class of each sub-history)

Page 33: ICASSP 05

Combination of distant n-gram

• In order to combine K models, M1,…,MK, a set of weight 1,…,K is defined and the combination is expressed by:

• Development corpus is not sufficient to estimate a huge number of parameters– Classify histories and set a weight to each class

K

iii hwPhhwP

1

||

K

ii hwPhChwP

1

||

Page 34: ICASSP 05

Classification

• Break the history into the several parts (sub-histories). Each sub-history is analyzed in order to estimate its importance in terms of prediction and then put into a class

• Such a class is directly linked to the value of the sub-history frequency– This class gathers all sub-histories which have

approximately the same frequency

Page 35: ICASSP 05

8000 classes/115.4

4000 classes/85.2

12.8% improvement to baseline (132.4)

5.3% improvement to the single weight combination (121.9)

12.8% improvement to baseline (97.8)

1.5% improvement to the single weight combination (86.5)


Recommended