+ All Categories
Home > Documents > Learning Long-Term Temporal Features

Learning Long-Term Temporal Features

Date post: 01-Feb-2016
Category:
Upload: breena
View: 21 times
Download: 0 times
Share this document with a friend
Description:
Learning Long-Term Temporal Features. A Comparative Study Barry Chen. Log-Critical Band Energies. Log-Critical Band Energies. Conventional Feature Extraction. Log-Critical Band Energies. TRAPS/HATS Feature Extraction. What is a TRAP? (Background Tangent). - PowerPoint PPT Presentation
Popular Tags:
36
May 4, 2004 Speech Lunch Talk Learning Long-Term Temporal Features A Comparative Study Barry Chen
Transcript
Page 1: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learning Long-Term Temporal Features

A Comparative Study

Barry Chen

Page 2: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies

Page 3: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies

ConventionalFeature Extraction

Page 4: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Log-Critical Band Energies

TRAPS/HATSFeature Extraction

Page 5: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

What is a TRAP? (Background Tangent)

• TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP)

• Stands for TempRAl Pattern

• TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long)

Page 6: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Example of TRAPS

Mean Temporal Patterns for 45 phonemes at 500 Hz

Page 7: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

TRAPS Motivation

• Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale

• Information measurements (joint mutual information) show information still exists >100ms away within single critical-band

• Potential robustness to speech degradations

Page 8: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Let’s Explore• TRAPS and HATS are examples of a

specific two-stage approach to learning long-term temporal features

• Is this constrained two-stage approach better than an unconstrained one-stage approach?

• Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary?

Page 9: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn Everything in One Step

Page 10: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 11: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 12: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 13: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 14: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 15: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 16: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 17: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 18: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Learn in Individual Bands

Page 19: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

One-Stage Approach

Page 20: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

2-Stage Linear Approaches

Page 21: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

PCA/LDA Comments

• PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance

• LDA projects in directions that maximize class separability measured by between class covariance over within class covariance

• Keep top 40 dimensions for comparison with MLP-based approaches

Page 22: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

2-Stage MLP-Based Approaches

Page 23: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

MLP Comments• As with the other 2-stage approaches, we first

learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories

• Interpretation of various MLP layers:1. Input to hidden weights – discriminant linear

transformations2. Hidden unit outputs – Non-linear discriminant

transforms 3. Before Softmax – transforms hidden activation space

to unnormalized phone probability space 4. Output Activations – critical band phone probabilities

Page 24: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Experimental Setup• Training: ~68 hours of conversational telephone

speech from English CallHome, Switchboard I, and Switchboard Cellular

– 1/10 used for cross-validation set for MLPs

• Testing: 2001 Hub-5 Evaluation Set (Eval2001) – 2,255,609 frames and 62,890 words

• Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help)

Page 25: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Frame Accuracy Performance

62.0%

63.0%

64.0%

65.0%

66.0%

67.0%

68.0%

1 5 B a n d s x 5 1 F ra me s P C A 4 0 L D A 4 0 H A T S B e fo re S ig mo id H A T S T R A P S B e fo re S o ftma x T R A P S P L P 9 F ra me s

Fra

me

Acc

ura

cy

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames

Page 26: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Standalone Feature System

• Transform MLP outputs by:1. log transform to make features more Gaussian

2. PCA for decorrelation

• Same as Tandem setup introduced by Hermansky, Ellis, and Sharma

• Use transformed MLP outputs as front-end features for the SRI recognizer

Page 27: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Standalone Features

36.0%

38.0%40.0%

42.0%44.0%

46.0%48.0%

50.0%

15B

ands

x

LDA

40

HA

TS

TR

AP

S

Wo

rd E

rro

r R

ate

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames

Page 28: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Combination W/State-of-the-Art Front-End Feature

• SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d)

• Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature– Similar to Qualcom-ICSI-OGI features in

AURORA

Page 29: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Combo W/PLP Baseline Features

32.0%

33.0%

34.0%

35.0%

36.0%

37.0%

38.0%

H L D A (P L P +3 d ) 1 5 B a n d s x 5 1

F ra me s

P C A 4 0 L D A 4 0 H A T S B e fo re

S ig mo id

H A T S T R A P S B e fo re

S o ftma x

T R A P S P L P 9 F ra me s H A T S + P L P 9

F ra me s

Wo

rd E

rro

r R

ate

HLDA(PLP+3d)

15 Bands x 51 Frames

PCA 40

LDA 40

HATS Before Sigmoid

HATS

TRAPS Before Softmax

TRAPS

PLP 9 Frames

HATS + PLP 9 Frames

Page 30: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Ranking Table

System Frame Acc. Standalone Combination15 Bands x 51 Frames 6 6 6PCA 40 5 2 2LDA 40 4 3 2HATS Before Sigmoid 3 4 2HATS 1 1 1TRAPS Before Softmax 2 4 5TRAPS 7 7 7

Page 31: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Observations

• Throughout the three various testing setups:

1. HATS is always #1

2. The one-stage 15 Bands x 51 Frames is always #6 or second last

3. TRAPS is always last

4. PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance

Page 32: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Interpretation• Learning constraints introduced by the 2-stage

approach is helpful if done right.• Non-linear discriminant transform of HATS is

better than linear discriminant transforms from LDA and HATS before sigmoid

• The further mapping from hidden activations to critical-band phone posteriors is not helpful– Perhaps, mapping to critical-band phones is too

difficult and inherently noisy

• Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames.

Page 33: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Page 34: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Frame Accuracy Performance

System Frame Acc. Rel. Improvement15 Bands x 51 Frames 64.7% -

PCA 40 65.5% 1.2%LDA 40 65.5% 1.2%HATS Before Sigmoid 65.8% 1.7%HATS 66.9% 3.4%TRAPS Before Softmax 65.9% 1.7%TRAPS 64.0% -1.2%

PLP 9 Frames 67.6% N/A

Page 35: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Standalone Features WER

System WER Rel. Improvement15 Bands x 51 Frames 48.0% -

PCA 40 45.3% 5.6%LDA 40 46.5% 3.1%HATS Before Sigmoid 45.9% 4.4%HATS 44.5% 7.3%TRAPS Before Softmax 45.9% 4.4%TRAPS 48.2% -0.4%

PLP 9 Frames 41.2% N/A

Page 36: Learning Long-Term Temporal Features

May 4, 2004 Speech Lunch Talk

Combo W/PLP Baseline FeaturesSystem WER Rel. ImprovementHLDA(PLP+3d) 37.2% -

15 Bands x 51 Frames 37.1% 0.3%PCA 40 36.8% 1.1%LDA 40 36.8% 1.1%HATS Before Sigmoid 36.8% 1.1%HATS 36.0% 3.2%TRAPS Before Softmax 36.9% 0.8%TRAPS 37.2% 0.0%PLP 9 Frames 36.1% 3.0%100.0%Inverse Entropy ComboHATS + PLP 9 Frames 34.0% 8.6%


Recommended