+ All Categories
Home > Documents > Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam...

Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam...

Date post: 15-Jan-2016
Category:
Upload: ranae
View: 29 times
Download: 0 times
Share this document with a friend
Description:
Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System. Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam Janin Arindam Mandal Barbara Peskin Chuck Wooters Jing Zheng - PowerPoint PPT Presentation
Popular Tags:
27
July 13, 200 5 NIST Meeting Recognition Workshop 1 Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam Janin Arindam Mandal Barbara Peskin Chuck Wooters Jing Zheng International Computer Science Institute, Berkeley, CA, USA SRI International, Menlo Park, CA, USA Technical University of Catalonia, Barcelona, Spain Brno University of Technology, Czech Republic University of Washington, Seattle, WA, USA
Transcript
Page 1: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 1

Further Progress in Meeting Recognition:The ICSI-SRI Spring 2005

Speech-to-Text Evaluation System

Andreas Stolcke Xavier Anguera Kofi Boakye

Özgür Çetin František Grézl Adam Janin

Arindam Mandal Barbara Peskin Chuck Wooters Jing Zheng

International Computer Science Institute, Berkeley, CA, USA

SRI International, Menlo Park, CA, USA

Technical University of Catalonia, Barcelona, Spain

Brno University of Technology, Czech Republic

University of Washington, Seattle, WA, USA

Page 2: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 2

Overview• Data and Tasks• Audio preprocessing

– New MDM delay-sum processing– New IHM cross-talk suppression

• SRI decoding architecture• Acoustic modeling

– Baseline improvement– Model adaptation + MLP feature adaptation– CTS/BN model combination

• Language modeling• Lecture recognition system• Overall results• Conclusions and future work

Page 3: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 3

Data Sets• eval05: NIST RT-05S conference meetings

– One previously unseen meeting source (VT)

• eval04: NIST RT-04S conference meetings– Unbiased test set– Includes lapel channel in ICSI meeting

• dev04a: RT-04S devtest meetings + 2 AMI excerpts– Used for development and tuning– In spite of lapel mics in CMU and LDC meetings

• Meeting training data– AMI (35 meetings, 16 hours)– CMU (17 meetings, 11 hours) – Lapel personal mics, no distant mics– ICSI (73 meetings, 74 hours)– NIST (15 meetings, 14 hours)

• Acoustic background training data– CTS (Switchboard + Fisher, 2300 hours)– BN (Hub-4 + TDT2 + TDT4, 900 hours)

Page 4: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 4

Evaluation Tasks

Conference room meetings:• MDM Multiple distant microphones (primary)• IHM Individual headset microphones (required contrast)• SDM Single distant microphone (optional)

Lecture room meetings (in addition to above):• MSLA Multiple source-localization arrays (optional)• MM3A Multiple Mark III microphone arrays (optional)

– Available only for CHIL devtest data (single array)

Results reported are for conference room unless noted otherwise.

(Lecture results at the end.)

Page 5: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 5

Development Strategy: Base System

AcousticModels &Features

DecodingSRI CTSSystem

LanguageModels

Training

Recognition

TelephoneTest Data

“I I think thatuh…”

Web data

TelephoneTraining

Transcripts

TelephoneTraining

Data

Page 6: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 6

Development Strategy: Meeting System

MeetingTest Data

Meeting-AdaptedAcousticModels &Features

DecodingSRI CTSSystem

Meeting-Adapted

LanguageModels

MeetingTraining Data

CTS/BNAcousticModels

CTSLanguage

Models

Proceedings data(Lecture system)

Training

Recognition

“First on theagenda…”

MeetingWeb data

Noise Filtering

SegmentationArray Processing

(MDM Only)

Preprocessing

Page 7: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 7

Acoustic PreprocessingRecognition• Distant microphones

– Noise reduction using Wiener filtering on all input channels– Delay-sum beamforming of all channels, into single enhanced channel (MDM)– Waveform segmentation (speech-nonspeech HMM decoding)– Segment clustering (for cepstral normalization, unsupervised adaptation)

• Close-talking (personal) microphones– No noise reduction (tried it, no gains)– Waveform segmentation– Crosstalk detection and filtering

Training• Distant microphones

– Eliminate overlapping speech (based on personal mic word alignment times)– Noise filtering– No delay-sum processing– Models trained on all distant channels

Page 8: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 8

Improved MDM Delay-Summing• RT-04S: Delay-summing after segmentation

– One Time delay of arrival (TDOA) per segment

• RT-05S: Delay-summing before segmentation– Smoothed TDOA estimation every 250ms

– SDM channel used as reference for all other channels

– GCC-PHAT, window size 500ms

– Details in diarization system talk & paper

• Results on eval04 (multi-channel meetings only), RT-04S models:– 6.5% relative gain between old and new MDM processing

– Mostly from improved delay-sum, not from segmentation

Input Delay-sum Segmentation on WER

SDM none SDM 48.9

MDM old SDM 42.9

MDM new SDM 40.3

MDM new delay-summed 40.1

Page 9: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 9

IHM Crosstalk Filtering• Energy-based detection of foreground vs. background speech

– Subtract minumum energy (noise floor) from each channel

– For each channel, subtract average energy of all other channels

– Threshold at zero

– Intersect foreground segments with speech/nonspeech detector output

• Results using RT-05S eval models:

– Crosstalk filter reduces error difference to reference by 1/3

– Mostly by reducing insertion errors

– eval05 ICSI result got worse in filtering (20.5 -> 24.5)

– eval05 NIST meeting had speakerphone, 3 empty channels: WER: 34.5, Reference: 21.4

– Using Sheffield segmentation on NIST meetings: WER: 28.3

Crosstalk

Filter

eval04 eval05

WER WER Sub Del Ins

No 35.4 29.3 11.0 10.3 8.0

Yes 34.3 25.9 11.0 11.5 3.4

Reference 32.1 19.5 11.2 6.7 1.6

Page 10: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 10

“Fast” Decoding Architecture

MFCnonCW

Thinlattices

PLPCW

3xRTOutput

Legend

Decoding/rescoring step

Hyps for MLLR or output

Lattice generation/use

Lattice or 1-best output

Conf. Network combinationRuntime: 3xRT on 3.4GHz Xeon CPU

Lattice rescoring:• 4-gram LM

N-best rescoring:• 4-gram LM• Word duration model• Pause duration model

Page 11: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 11

Full System Architecture

MFCnonCW

Thinlattices

PLPCW

3xRTOutput

PLPCW

MFCCW

MFCnonCW

Thicklattices

MFCCW

PLPCW

20xRTOutput

Legend

Decoding/rescoring step

Hyps for MLLR or output

Lattice generation/use

Lattice or 1-best output

Conf. Network combinationRuntime: 12xRT (for CTS, Gaussian shortlists)25xRT (on meetings, no Gaussian shortlists)

Page 12: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 12

Acoustic Features and Models• MFCC within- and crossword triphone models

– Augmented with 2 x 5 voicing features (5 frames around current frame)– Augmented with 25-dim Tandem/HATS phone posterior features estimated by

multilayer perceptron (MLP features)

• PLP crossword triphone models• Normalization and adaptation:

– CMN + CVN, VTLN– HLDA– CMLLR (SAT) in training and test (except in first decoding)– MLLR with phone-loop in first MFCC and PLP decoding– MLLR cross-adaptation in subsequent steps

• Baseline RT-04F models trained on 1400h of CTS data– All Hub-5 (Switchboard, CallHome English)– Every other Fisher utterance (complementary sets for MFCC and PLP models)– Gender-dependent

• Also: PLP models trained on 900h of TDT BN data– Gender-independent

Page 13: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 13

Baseline Model Improvements• Compare RT-04S baseline CTS models to RT-04F CTS models• Model improvements:

– Added Fisher training data

– MLP features

– MPE training instead of MMIE

– Decision tree-based triphone clustering instead of bottom-up clustering

• Improvement on Fisher CTS data: 28% relative WER reduction• Results using RT-05S eval language model:

Models

MDM IHM

eval04 eval05 eval04 eval05

RT-03F CTS 48.3 40.2 33.2 30.8

RT-04F CTS 41.4 34.5 28.9 28.6

Rel. reduction 14.3% 14.2% 10.2% 7.1%

Page 14: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 14

Acoustic Adaptation• Model adaptation using MAP, Gaussian means-only• MLP feature adaptation:

– 3 additional backprop iterations on CTS MLP, using meeting data

– Due to lack of time: adapted using only ICSI+CMU+NIST close-talking data

– New MLP using for both IHM and MDM feature generation

• Model and MLP adaptation give similar gains• Model and MLP adaptation are partly additive

Base Models

Gaussians adapted?

MLP adapted?

MDM IHM

eval04 eval05 eval04 eval05

RT-04F CTS No No 41.4 34.5 28.9 28.6

RT-04F CTS No Yes 41.1 34.2 28.4 27.0

RT-04F CTS Yes No n/a n/a 28.6 26.9

RT-04F CTS Yes Yes 40.3 32.2 28.3 26.2

Total relative reduction 2.7% 6.7% 2.1% 8.4%

Page 15: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 15

MLP Feature Portability

• How well do MLP feature trained on CTS generalize?• MLP features improve system by 2% absolute, 10% relative on

RT-04F CTS eval data• Compare effectiveness using meeting-adapted CTS models and

MLP features:

• Improvement on eval05 comparable to CTS

Base Models

MLP features?

IHM

eval04 eval05

RT-04F CTS No 29.4 28.6

RT-04F CTS Yes 28.3 26.2

Relative reduction 3.7% 8.4%

Page 16: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 16

Combining CTS and BN Base Models• CTS models match meeting domain in speaking style• BN models better matched to acoustic properties

– Signal bandwidth

– (Some) noisy environments

– Distant microphones

• Use meeting-adapted BN models in PLP decoding

• Substantial gains on distant microphones• Inconsistent gains on close-talking mics (not used in eval system)• Bug: adapted BN models only to male speakers (only 0.1% loss)

Base Models MDM IHM

MFCC PLP eval04 eval05 eval04 eval05

CTS CTS 40.3 32.2 28.3 26.2

CTS BN 37.1 30.2 28.0 26.3

Relative reduction 7.9% 6.2% 1.1% -0.4%

Page 17: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 17

Discriminative MAP Adaptation• Preserve discriminative properties of MPE base models in

adaptation• Best results using MMIE criterion on training data with i-smoothing

of MPE base parameters (MMI-MAP, Povey et al., Eurospeech 2003)

• MDM eval system used ML-MAP models (due to lack of time)

Base Models Gaussian adaptation

IHM

eval04 eval05

RT-04F CTS ML-MAP 28.3 26.2

RT-04F CTS MMI-MAP 27.9 25.9

Relative reduction 1.4% 1.1%

Page 18: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 18

Test-time MLLR Improvement• After evaluation, improved unsupervised MLLR• Replaced hand-designed phonetic regression class tree with more

standard data-driven decision tree (generated in model training)• Part of Arindam Mandal’s UW thesis work on speaker-dependent

MLLR class prediction (Eurospeech 2005)

Regression tree method

MDM IHM

eval04 eval05 eval04 eval05

Hand-designed 37.1 30.2 27.9 25.9

Decision tree 36.8 29.8 28.1 25.6

Relative reduction 0.8% 1.3% -0.7% 1.1%

Page 19: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 19

Conference Meeting LMs• Linearly interpolated mixture N-gram LMs

– Multiword bigram for lattice generation– Multiword trigram for lattice decoding– Word-based 4-gram for rescoring

All LMs entropy-pruned

• Conference meeting LM components(1) Switchboard CTS transcripts (6.5M words)

(2) Fisher CTS (23M)

(3) Hub4 and TDT4 BN transcripts (140M)

(4) AMI, CMU, ICSI, and NIST meeting transcripts (1M)

(5) Web data selected to match meeting (268M) and Fisher (530M) transcripts

• Perplexity optimized on held-out 2004 (non-AMI) data• Vocabulary: 54K words

– All words in Switchboard, RT-04S meeting transcripts– All non-singletons in Fisher, AMI devtest– OOV rates: 0.40% on eval04, 0.19% on AMI devtest

Page 20: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 20

Lecture Meeting LMs• Similar to conference meeting LM, but

– Added TED oral transcripts (0.1M words)– Speech conference proceedings (28M)– Removed Fisher web data– Added web data based on speech conference proceedings (120M)

• Vocabulary: added 3781 words from conference proc.– OOV rate on CHIL devtest: 0.18%

• Perplexity optimized on CHIL devtest– Jackknifed for development testing– On full devtest for eval system

• No CHIL transcripts used for N-gram training

Page 21: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 21

LM Results (see paper for perplexity results)

Language model

eval04 2005 devtest

All CMU ICSI LDC NIST AMI CHIL

RT-04F (CTS) 28.7 32.1 24.5 34.1 21.5 39.8 37.6

RT-04S (mtg) 28.7 33.1 22.0 35.7 21.1 38.3 31.5

RT-05S, no web 28.9 33.4 22.0 36.0 21.5 38.4 27.6

RT-05S, w/web 27.9 32.5 21.4 34.9 20.2 37.3 26.9• Conference WER reduced 1.2% compared to 2004 LM, due to– AMI training data

– New web data

– Additional Fisher data

• Lecture WER reduced 4.6% without using CHIL N-gram data• Web data still useful to reduce WER

– Conference meetings: 1.0-1.1%

– Lecture meetings: 0.7%, redundant with proceedings ?

• No significant gains from source-specific LMs– But note best results for CMU & LDC achieved using CTS LM !

Page 22: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 22

Overall Results: Conference Meetings• Relative WER reduction on

eval04 data:– 17.4% for MDM– 16.9% for IHM– (SDM results not directly comparable

due to system differences)

• Narrowing IHM/MDM gap ?– Only 17% rel. difference on eval05– But unchanged on eval04, in spite of

improved MDM preprocessing and modeling

• AMI data different in character, but not much in system performance

– Possible issues with array mics

• Performed well on “blind” test– Virginia Tech WER better than avg

• NIST IHM results due to unusual meeting properties

System MDM SDM IHM

eval04

RT-04S 44.9 51.3(fast system)

33.6

RT-05S 37.1 43.0 27.9

eval05

RT-05S 30.2 40.9 25.9

AMI CMU ICSI NIST VT

MDM 33.4 31.7 29.9 29.2 27.8

IHM 23.3 23.3 24.5 34.5 23.3

WER by Source (RT-05S)

Page 23: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 23

Lecture System Development• Development on CHIL Jan’ 2005 data (NIST devtest)• No CHIL data used for acoustic training or adaptation

– Reused meeting-adapted models

• Transcripts used for LM mixture weight estimation (jackknifed testing)

• Model score weights not optimized• Cross-talk filtering never tested (just ran it)

– Devtest data only had one IHM channel

Findings• Best to use a single speaker cluster for distant mic

– Lecture speaker dominates and/or clustering is not good enough

• Delay-sum processing on tabletop mics worse than single mic with best SNR

– SNRs vary greatly between tabletop mics

Page 24: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 24

Lecture Recognition Results

• IHM results comparable to conference meetings

• Distant mic WER > 10% absolute worse

– Lecture room acoustics more challenging than conference room

• Error rates comparable to results reported by CHIL team

• MDM better than SDM on devtest, but not on eval

– MDM uses single mic with best SNR

• MSLA little gain on devtest, but works much better on eval

• MM3A somewhat better than MSLA on devtest

– Where is the eval data, John ?

MDM MSLA MM3A SDM IHM

2005 devtest

51.6 51.0 49.7(one array)

56.2 26.9

eval05

52.0 44.8 ? 51.9 28.0

Page 25: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 25

Conclusions• Considerable progress in meeting recognition: 17%

relative WER reduction on conference meetings• Leveraged LVCSR progress in CTS and BN

– 2-3% absolute gain on MDM by combining CTS and BN-based models

• Improved preprocessing for MDM (delay-sum) and IHM (cross-talk suppression)

• Effective adaptation of all modeling components– Features (MLP-based)– Acoustic models– Language models (web data is important)

• System generalized well to– New conference meeting type with training data (AMI)– Meeting source & domain without training data (VT)– Lecture task (adapting only language model)

Page 26: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 26

Future Work• Fix the things we didn’t have time for

– MMI-MAP for distant mic models– Adapt MLPs for distant mic features

• For IHM: forget recognition (already works well); solve the speaker segmentation problem!

• Do a better job on lecture recognition– Use TED acoustic data– Get MDM to work better than SDM– Estimate model weights (LM weight, insertion penalty, …) properly

• Explore feature mapping techniques (e.g. MLLR) to reduce mismatch of background training data

• Adapt model to non-native speakers of Amer. English– Germans, Brits, Scots, …– cf. Arlo Faria’s poster, MLME-05

• More generally, more adaptation to meeting type & content

Page 27: Andreas Stolcke     Xavier Anguera     Kofi Boakye Özgür Çetin     František Grézl     Adam Janin

July 13, 2005NIST Meeting Recognition Workshop 27

Thank You!


Recommended