+ All Categories
Home > Documents > CRANDEM: Conditional Random Fields for ASR

CRANDEM: Conditional Random Fields for ASR

Date post: 02-Jan-2016
Category:
Upload: dexter-neal
View: 40 times
Download: 2 times
Share this document with a friend
Description:
CRANDEM: Conditional Random Fields for ASR. Jeremy Morris 11/21/2008. Outline. Background – Tandem HMMs & CRFs Crandem HMM Phone recognition Word recognition. Background. Conditional Random Fields (CRFs) Discriminative probabilistic sequence model - PowerPoint PPT Presentation
Popular Tags:
26
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008
Transcript
Page 1: CRANDEM: Conditional Random Fields for ASR

1

CRANDEM: Conditional Random Fields for ASR

Jeremy Morris

11/21/2008

Page 2: CRANDEM: Conditional Random Fields for ASR

2

Outline

Background – Tandem HMMs & CRFs Crandem HMM Phone recognition Word recognition

Page 3: CRANDEM: Conditional Random Fields for ASR

3

Background Conditional Random Fields (CRFs)

Discriminative probabilistic sequence model Directly defines a posterior probability of a label

sequence given a set of observations

Page 4: CRANDEM: Conditional Random Fields for ASR

4

Background

Problem: How do we make use of CRF classification for word recognition? Attempt to use CRFs directly? Attempt to fit CRFs into current state-of-the-art

models for speech recognition? Here we focus on the latter approach

How can we integrate what we learn from the CRF into a standard HMM-based ASR system?

Page 5: CRANDEM: Conditional Random Fields for ASR

5

Background Tandem HMM

Generative probabilistic sequence model Uses outputs of a discriminative model (e.g. ANN

MLPs) as input feature vectors for a standard HMM

Page 6: CRANDEM: Conditional Random Fields for ASR

6

Background Tandem HMM

ANN MLP classifiers are trained on labeled speech data Classifiers can be phone classifiers, phonological

feature classifiers Classifiers output posterior probabilities for each

frame of data E.g. P(Q|X), where Q is the phone class label and X is

the input speech feature vector

Page 7: CRANDEM: Conditional Random Fields for ASR

7

Background Tandem HMM

Posterior feature vectors are used by an HMM as inputs

In practice, posteriors are not used directly Log posterior outputs or “linear” outputs are more

frequently used “linear” here means outputs of the MLP with no application

of the softmax function to transform into probabilities Since HMMs model phones as Gaussian mixtures, the

goal is to make these outputs look more “Gaussian” Additionally, Principle Components Analysis (PCA) is

applied to features to decorrelate features for diagonal covariance matrices

Page 8: CRANDEM: Conditional Random Fields for ASR

8

Idea: Crandem Use a CRF classifier to create inputs to a

Tandem-style HMM CRF labels provide a better per-frame accuracy

than input MLPs We’ve shown CRFs to provide better phone

recognition than a Tandem system with the same inputs

This suggests that we may get some gain from using CRF features in an HMM

Page 9: CRANDEM: Conditional Random Fields for ASR

9

Idea: Crandem Problem: CRF output doesn’t match MLP

output MLP output is a per-frame vector of posteriors CRF outputs a probability across the entire

sequence Solution: Use Forward-Backward algorithm to

generate a vector of posterior probabilities

Page 10: CRANDEM: Conditional Random Fields for ASR

10

Forward-Backward Algorithm The Forward-Backward algorithm is already

used during CRF training Similar to the forward-backward algorithm for

HMMs Forward pass collects feature functions for the

timesteps prior to the current timestep Backward pass collects feature functions for the

timesteps following the current timestep Information from both passes are combined

together to determine the probability of being in a given state at a particular timestep

Page 11: CRANDEM: Conditional Random Fields for ASR

11

)(

)),,(),((exp

)|(1

xZ

yyxfyxs

XYP t i jttjjtii

Forward Backward Algorithm

i j

jjii yyxfyxsyyM ))',,(),(exp(]',[

nt

ntifMT

TttT

t

1

1{1

0

0

1{ 1

t

ntifM ttt

)()|( ,,

, xZXyP titi

ti

Page 12: CRANDEM: Conditional Random Fields for ASR

12

Forward-Backward Algorithm

This form allows us to use the CRF to compute a vector of local posteriors y at any timestep t.

We use this to generate features for a Tandem-style system Take log features, decorelate with PCA

)()|( ,,

, xZXyP titi

ti

Page 13: CRANDEM: Conditional Random Fields for ASR

13

Phone Recognition Pilot task – phone recognition on TIMIT

61 feature MLPs trained on TIMIT, mapped down to 39 features for evaluation

Crandem compared to Tandem and a standard PLP HMM baseline model

As with previous CRF work, we use the outputs of an ANN MLP as inputs to our CRF

Various CRF models examined (state feature functions only, state+transition functions), and various input feature spaces examined (phone classifier and phonological feature classifier)

Page 14: CRANDEM: Conditional Random Fields for ASR

14

Phone Recognition

Phonological feature attributes Detector outputs describe phonetic features of a

speech signal Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values

Phone class attributes Detector outputs describe the phone label

associated with a portion of the speech signal /t/, /d/, /aa/, etc.

Page 15: CRANDEM: Conditional Random Fields for ASR

15

Phone Recognition

)(

)),,(),((exp

)|(1

xZ

yyxfyxs

XYP t i jttjjtii

Page 16: CRANDEM: Conditional Random Fields for ASR

16

Phone Recognition - Results

Phonological feature attributes Detector outputs describe phonetic features of a

speech signal Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values

Phone class attributes Detector outputs describe the phone label

associated with a portion of the speech signal /t/, /d/, /aa/, etc.

Page 17: CRANDEM: Conditional Random Fields for ASR

17* Significantly (p<0.05) improvement at 0.6% difference between models

Results (Fosler-Lussier & Morris 08)Model Phone

Accuracy

PLP HMM reference 68.1%

Tandem (61 feas) 70.6%

Tandem (48 feas) 70.8%

CRF (state) 69.9%

CRF (state+trans) 70.7%

Crandem (state) – log 71.1%

Crandem (state+trans) – log 71.7%

Crandem (state) – unnorm 71.2%

Crandem (state+trans) – unnorm 71.8%

Page 18: CRANDEM: Conditional Random Fields for ASR

18* Significantly (p≤0.05) improvement at 0.6% difference between models

Results (Fosler-Lussier & Morris 08)Model Phone

Accuracy

PLP HMM reference 68.1%

Tandem (105 feas) 70.9%

Tandem (48 feas) 71.2%

CRF (state) 71.4%

CRF (state+trans) 71.6%

Crandem (state) – log 71.7%

Crandem (state+trans) – log 72.4%

Crandem (state) – unnorm 71.7%

Crandem (state+trans) – unnorm 72.4%

Page 19: CRANDEM: Conditional Random Fields for ASR

19

Word Recognition Second task – Word recognition

Dictionary for word recognition has 54 distinct phones instead of 48, so new CRFs and MLPs trained to provide input features

MLPs and CRFs again trained on TIMIT to provide both phone classifier output and phonological feature classifier output

Initial experiments – use MLPs and CRFs trained on TIMIT to generate features for WSJ recognition Next pass – use MLPs and CRFs trained on TIMIT to

align label files for WSJ, then train MLPs and CRFs for WSJ recognition

Page 20: CRANDEM: Conditional Random Fields for ASR

20* Significant (p≤0.05) improvement at roughly 1% difference between models

Initial ResultsModel Word

Accuracy

MFCC HMM reference 90.85%

Tandem MLP (54feas) 90.30%

Tandem MFCC+MLP (54feas) 90.90%

Crandem (54feas) (state) 90.95%

Crandem (54feas) (state+trans) 90.77%

Crandem MFCC+CRF (state) 92.29%

Crandem MFCC+CRF (state+tran) 92.40%

Page 21: CRANDEM: Conditional Random Fields for ASR

21* Significant (p≤0.05) improvement at roughly 1% difference between models

Initial ResultsModel Word

Accuracy

MFCC HMM reference 90.85%

Tandem MLP (98feas) 91.26%

Tandem MFCC+MLP (98feas) 92.04%

Crandem (98feas) (state) 91.31%

Crandem (98feas) (state+trans) 90.49%

Crandem MFCC+CRF (state) 92.47%

Crandem MFCC+CRF (state+tran) 92.62%

Page 22: CRANDEM: Conditional Random Fields for ASR

22* Significant (p≤0.05) improvement at roughly 1% difference between models

Initial ResultsModel Word

Accuracy

MFCC HMM reference 90.85%

Tandem MLP (WSJ 54) 90.41%

Tandem MFCC+MLP (WSJ 54) 92.21%

Crandem (WSJ 54) (state) 89.58%

Page 23: CRANDEM: Conditional Random Fields for ASR

23

Word Recognition Problems

Some of the models show slight significant improvement over their Tandem counterpart Unfortunately, what will cause an improvement is not yet

predictable Transition features give slight degredation when used on

their own slight improvement when classifier is mixed with MFCCs

Retraining directly on WSJ data does not give improvement for CRF Gains from CRF training are wiped away if we just

retrain the MLPs on WSJ data

Page 24: CRANDEM: Conditional Random Fields for ASR

24

Word Recognition Problems (cont.)

The only model that gives improvement for the Crandem system is a CRF model trained on linear outputs from MLPs Softmax outputs – much worse than baseline Log softmax outputs – ditto This doesn’t seem right, especially given the results

from the Crandem phone recognition experiments These were trained on softmax outputs I suspect “implementor error” here, though I haven’t

tracked down my mistake yet

Page 25: CRANDEM: Conditional Random Fields for ASR

25

Word Recognition Problems (cont.)

Because of the “linear inputs only” issue, certain features have yet to be examined fully “Hifny”-style Gaussian scores have not provided any

gain – scaling of these features may be preventing them from being useful

Page 26: CRANDEM: Conditional Random Fields for ASR

26

Current Work Sort out problems with CRF models

Why is it so sensitive to the input feature type? (linear vs. log vs. softmax)

If this sensitivity is “built in” to the model, how can I appropriately scale features to include them in the model that works?

Move on to next problem – direct decoding on CRF lattices


Recommended