+ All Categories
Home > Documents > Random Forests for Language Modeling

Random Forests for Language Modeling

Date post: 31-Jan-2016
Category:
Upload: clint
View: 45 times
Download: 0 times
Share this document with a friend
Description:
Random Forests for Language Modeling. Peng Xu, Frederick Jelinek. Outline. Basic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM) Experiments - PowerPoint PPT Presentation
Popular Tags:
36
Random Forests for Language Modeling Peng Xu, Frederick Jelinek
Transcript
Page 1: Random Forests for Language Modeling

Random Forests for Language Modeling

Peng Xu, Frederick Jelinek

Page 2: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models

Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)

ExperimentsConclusions and Future Work

Page 3: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Basic Language ModelingEstimate the source probability

from a training corpus: large amount of words chosen for similarity to expected sentences

Parametric conditional modelsNiVwwwwP iii ,...,1,,),,...,|( 11

NwwWWP ,...,),( 1

history

Page 4: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Basic Language ModelingSmooth Models:

Perplexity (PPL):

n-gram Models:

0),...,|( 11 ii wwwP

N

iiiM wwwP

NMPPL

111 ),...,|(log

1exp)(

),...,|(),...,|( 1111 iniiii wwwPwwwP

Page 5: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Estimate n-gram ParametersMaximum Likelihood (ML) estimate:

Best on training data: lowest PPL

)(

)()|(

11

111

ini

inii

nii wC

wCwwP

Data sparseness problem: n=3, |V|=10k, a trillion words needed

Zero probability for almost all test data!

Page 6: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Dealing with SparsitySmoothing: use lower order statistics

Word clustering: reduce the size of VHistory clustering: reduce the number of historiesMaximum entropy: use exponential models

dRNeural network: represent words in real space , use exponential model

Page 7: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Smoothing TechniquesGood smoothing techniques: Deleted

Interpolation, Katz, Absolute Discounting, Kneser-Ney (KN)Kneser-Ney: consistently the best [Chen & Goodman, 1998] )|(

)(

0,)(max)|( 1

2111

1

111

i

niiKNinii

ni

inii

niiKN wwPwwC

DwCwwP

)(

1

11

0)(:11

1

ini

wCwini wC

D

winii

0)(:

2

11

1)(ˆinini wCw

iniwC

Page 8: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Decision Tree Language ModelsGoal: history clustering by a binary

decision tree (DT)Internal nodes: a set of histories, one or two questionsLeaf nodes: a set of historiesNode splitting algorithmsDT growing algorithms

Page 9: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Example DT

{ab,ac,bc,bd}a:3 b:2

{ab,ac}a:2 b:1

{bc,bd}a:1 b:1

Is the first word ‘a’? Is the first word ‘b’?

Training data: aba, aca, acb, bcb, bda

New event ‘cba’: Stuck!

Page 10: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Previous WorkDT is an appealing idea: deal with data sparseness

[Bahl, et al 1989] 20 words in histories, slightly better than 3-gram

[Potamianos and Jelinek, 1998] fair comparison, negative results on letter n-gram

Both are top-down with a stopping criterion

Why doesn’t it work in practice? Training data fragmentation: data sparseness No theoretically founded stopping criterion: early

termination Greedy algorithms: early termination

Page 11: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models

Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)

ExperimentsConclusions and Future Work

Page 12: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Random Forests[Amit & Geman 1997] shape recognition with randomized trees[Ho 1998] random subspace[Breiman 2001] random forests

Random Forest (RF): a classifier consisting of a collection of tree-structured classifiers.

Page 13: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Our GoalMain problems:

Data sparseness Smoothing Early termination Greedy algorithms

Expectations from Random Forests: Less greedy algorithms: randomization and

voting Avoid early termination: randomization Conquer data sparseness: voting

Page 14: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models

Random Forests for Language Models Random Forests n-gram: general approach Structured Language Model (SLM)

ExperimentsConclusions and Future Work

Page 15: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

General DT Growing Approach

Grow a DT until maximum depth using training dataPerform no smoothing during growingPrune fully grown DT to maximize heldout data likelihoodIncorporate KN smoothing during pruning

Page 16: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Node Splitting AlgorithmQuestions: about identities of words in the

history

Definitions:H(p) : the set of histories in a node pposition: distance from a word in the history to predicted wordi(v) : the set of histories with word v in position split: non-empty sets Ai and Bi, consists of i(v)

L(Ai) : training data log-likelihood of a node under split Ai and Bi using relative frequencies

Page 17: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Node Splitting AlgorithmAlgorithm Sketch:1. For each position i

a) Initialization: Ai, Bib) For each i(v) Ai

i. Tentatively move i(v) to Bi

ii. Calculate log-likelihood increase L(Ai- i(v)) - L(Ai)

iii. If the increase is positive, move i(v) and modify counts

c) Carry out the same for each i(v) Bi

d) Repeat b)-c) until no move is possible

2. Split the node according to the best position: the increase in log-likelihood is the largest

Page 18: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Pruning a Decision TreeSmoothing:

Define: L(p) : set of all leaves rooted in p LH(p) : smoothed heldout data log-likelihood in p LH(L(p)) : smoothed heldout data log-likelihood in L(p) potential : LH(L(p)) - LH(p)

Pruning: traverse all internal nodes, prune a subtree rooted in p if potential is negative (similar to CART)

12

111

1

111

1 |)())((

0,))(,(max)(|

i

niiKNiniDTi

niDT

iniDTii

niDTiDT wwPwwC

DwwCwwP

Page 19: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Towards Random ForestsRandomized question selection:

Randomized initialization: Ai, BiRandomized position selection

Generating random forests LM: M decision trees are grown randomly Each DT generates a probability sequence on

test data Aggregation:

M

j

iniDTiDT

iniiRF wwP

MwwP

jj1

11

11 |

1|

Page 20: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Remarks on RF-LMRandom Forest Language Model (RF-LM) :

A collection of randomly constructed DT-LMs

A DT-LM is an RF-LM: small forest

An n-gram LM is a DT-LM: no pruning

An n-gram LM is an RF-LM!

Single compact model

Page 21: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models

Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)

ExperimentsConclusions and Future Work

Page 22: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

A Parse Tree

Page 23: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

The Structured Language Model (SLM)

Page 24: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Partial Parse Tree

Page 25: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

SLM ProbabilitiesJoint probability of words and parse:

Word probabilities:

1

1 111111111 ),,,|(),|()|(),(

n

i

N

j

ij

iiiii

ijiiiiiii

i

pptwTWpPwTWtPTWwPTWP

11

11

)()()(

)()|()|(

11

1111

11111

ii

ii

STii

iiii

STiiiiiiiSLM

TWPTWPTW

TWTWwPWwP

Page 26: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Using RFs for the SLMIdeally: running the SLM one time

Parallel approximation: running the SLM M times

Aggregate M probability sequences

Mm

PPP PARSERm

TAGGERm

PREDICTORm DTDTDT

,,1

)(),(),(

)(),(),( PARSERTAGGERPREDICTOR RFRFRFPPP

Page 27: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models

Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)

ExperimentsConclusions and Future Work

Page 28: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

ExperimentsGoal: Compare with Kneser-Ney (KN)

Perplexity (PPL): UPenn Treebank: 1 million words training, 82k

test Normalized text

Word Error Rate (WER): WSJ text: 20 or 40 million words training WSJ DARPA’93 HUB1 test data: 213

utterances, 3446 words N-best rescoring: standard trigram baseline

on 40 million words

Page 29: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Experiments: trigram perplexity

Baseline: KN-trigramNo randomization: DT-trigram100 random DTs: RF-trigram

Model heldout Test

PPL Gain PPL Gain %

KN-trigram 160.1 - 145.0 -DT-trigram 158.6 0.9% 163.3 -12.6%RF-trigram 126.8 20.8% 129.7 10.5%

Page 30: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Experiments: Aggregating

Improvements within 10 trees!

Page 31: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Experiments: Why does it work?seen event :

KN-trigram: in training dataDT-trigram: in training dataRF-trigram: in training data for any m

)|( 11

inii ww

))(|( 11

iniDTi ww))(|( 11

iniDTi ww

m

Model seen unseen

% PPL % PPL

KN-trigram 45.6% 19.7 54.4% 773DT-trigram 58.1% 26.2 41.9% 2069RF-trigram 91.7% 75.6 8.3% 49818

Page 32: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Experiments: SLM perplexity

Baseline: KN-SLM100 random DTs for each of the componentsParallel approximationInterpolate with KN-trigram

Model =0.0 =0.4 =1.0

KN-SLM 137.9 127.2 145.0

RF-SLM 122.8 117.6 145.0

Gain 10.9% 7.5% -

Page 33: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Experiments: speech recognition

Baseline: KN-trigram, KN-SLM

100 random DTs for RF-trigram, RF-SLM-P (predictor)

Interpolate with KN-trigram (40M)

Model

0.0 0.2 0.4 0.6 0.8KN-trigram(20M) 14.0 13.6 13.3 13.2 13.1RF-trigram(20M) 12.9 12.9 13.0 13.0 12.7KN-trigram(40M) 13.0 - - - -RF-trigram(40M) 12.4 12.7 12.7 12.7 12.7

KN-SLM(20M) 12.8 12.5 12.6 12.7 12.7RF-SLM-P(20M) 11.9 12.2 12.3 12.3 12.6

Page 34: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

ConclusionsNew RF language modeling approach

More general LM: RF DT n-gramRandomized history clustering: non-reciprocal data sharing

Good performance in PPL and WER

Generalize well to unseen data

Portable to other tasks

Page 35: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Future WorkRandom samples of training dataMore linguistically oriented questionsDirect implementation in the SLMLower order random forests

Larger test data for speech recognitionLanguage model adaptation

12

111

1

111

1 |)())((

0,))(,(max)(|

i

niiRFiniDTi

niDT

iniDTii

niDTiDT wwPwwC

DwwCwwP

Page 36: Random Forests for Language Modeling

CLSP, The Johns Hopkins University,Dept. of ECE

Thank you!


Recommended