Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam...

July 13, 2005NIST Meeting Recognition Workshop 1

Further Progress in Meeting Recognition:The ICSI-SRI Spring 2005

Speech-to-Text Evaluation System

Andreas Stolcke Xavier Anguera Kofi Boakye

Özgür Çetin František Grézl Adam Janin

Arindam Mandal Barbara Peskin Chuck Wooters Jing Zheng

International Computer Science Institute, Berkeley, CA, USA

SRI International, Menlo Park, CA, USA

Technical University of Catalonia, Barcelona, Spain

Brno University of Technology, Czech Republic

University of Washington, Seattle, WA, USA


Overview• Data and Tasks• Audio preprocessing

– New MDM delay-sum processing– New IHM cross-talk suppression

• SRI decoding architecture• Acoustic modeling

– Baseline improvement– Model adaptation + MLP feature adaptation– CTS/BN model combination

• Language modeling• Lecture recognition system• Overall results• Conclusions and future work


Data Sets• eval05: NIST RT-05S conference meetings

– One previously unseen meeting source (VT)

• eval04: NIST RT-04S conference meetings– Unbiased test set– Includes lapel channel in ICSI meeting

• dev04a: RT-04S devtest meetings + 2 AMI excerpts– Used for development and tuning– In spite of lapel mics in CMU and LDC meetings

• Meeting training data– AMI (35 meetings, 16 hours)– CMU (17 meetings, 11 hours) – Lapel personal mics, no distant mics– ICSI (73 meetings, 74 hours)– NIST (15 meetings, 14 hours)

• Acoustic background training data– CTS (Switchboard + Fisher, 2300 hours)– BN (Hub-4 + TDT2 + TDT4, 900 hours)


Evaluation Tasks

Conference room meetings:• MDM Multiple distant microphones (primary)• IHM Individual headset microphones (required contrast)• SDM Single distant microphone (optional)

Lecture room meetings (in addition to above):• MSLA Multiple source-localization arrays (optional)• MM3A Multiple Mark III microphone arrays (optional)

– Available only for CHIL devtest data (single array)

Results reported are for conference room unless noted otherwise.

(Lecture results at the end.)


Development Strategy: Base System

AcousticModels &Features

DecodingSRI CTSSystem

LanguageModels

Training

Recognition

TelephoneTest Data

“I I think thatuh…”

Web data

TelephoneTraining

Transcripts

TelephoneTraining

Data


Development Strategy: Meeting System

MeetingTest Data

Meeting-AdaptedAcousticModels &Features

DecodingSRI CTSSystem

Meeting-Adapted

LanguageModels

MeetingTraining Data

CTS/BNAcousticModels

CTSLanguage

Models

Proceedings data(Lecture system)

Training

Recognition

“First on theagenda…”

MeetingWeb data

Noise Filtering

SegmentationArray Processing

(MDM Only)

Preprocessing


Acoustic PreprocessingRecognition• Distant microphones

– Noise reduction using Wiener filtering on all input channels– Delay-sum beamforming of all channels, into single enhanced channel (MDM)– Waveform segmentation (speech-nonspeech HMM decoding)– Segment clustering (for cepstral normalization, unsupervised adaptation)

• Close-talking (personal) microphones– No noise reduction (tried it, no gains)– Waveform segmentation– Crosstalk detection and filtering

Training• Distant microphones

– Eliminate overlapping speech (based on personal mic word alignment times)– Noise filtering– No delay-sum processing– Models trained on all distant channels


Improved MDM Delay-Summing• RT-04S: Delay-summing after segmentation

– One Time delay of arrival (TDOA) per segment

• RT-05S: Delay-summing before segmentation– Smoothed TDOA estimation every 250ms

– SDM channel used as reference for all other channels

– GCC-PHAT, window size 500ms

– Details in diarization system talk & paper

• Results on eval04 (multi-channel meetings only), RT-04S models:– 6.5% relative gain between old and new MDM processing

– Mostly from improved delay-sum, not from segmentation

Input Delay-sum Segmentation on WER

SDM none SDM 48.9

MDM old SDM 42.9

MDM new SDM 40.3

MDM new delay-summed 40.1


IHM Crosstalk Filtering• Energy-based detection of foreground vs. background speech

– Subtract minumum energy (noise floor) from each channel

– For each channel, subtract average energy of all other channels

– Threshold at zero

– Intersect foreground segments with speech/nonspeech detector output

• Results using RT-05S eval models:

– Crosstalk filter reduces error difference to reference by 1/3

– Mostly by reducing insertion errors

– eval05 ICSI result got worse in filtering (20.5 -> 24.5)

– eval05 NIST meeting had speakerphone, 3 empty channels: WER: 34.5, Reference: 21.4

– Using Sheffield segmentation on NIST meetings: WER: 28.3

Crosstalk

Filter

eval04 eval05

WER WER Sub Del Ins

No 35.4 29.3 11.0 10.3 8.0

Yes 34.3 25.9 11.0 11.5 3.4

Reference 32.1 19.5 11.2 6.7 1.6


“Fast” Decoding Architecture

MFCnonCW

Thinlattices

PLPCW

3xRTOutput

Legend

Decoding/rescoring step

Hyps for MLLR or output

Lattice generation/use

Lattice or 1-best output

Conf. Network combinationRuntime: 3xRT on 3.4GHz Xeon CPU

Lattice rescoring:• 4-gram LM

N-best rescoring:• 4-gram LM• Word duration model• Pause duration model


Full System Architecture

MFCnonCW

Thinlattices

PLPCW

3xRTOutput

PLPCW

MFCCW

MFCnonCW

Thicklattices

MFCCW

PLPCW

20xRTOutput

Legend

Decoding/rescoring step

Hyps for MLLR or output

Lattice generation/use

Lattice or 1-best output

Conf. Network combinationRuntime: 12xRT (for CTS, Gaussian shortlists)25xRT (on meetings, no Gaussian shortlists)


Acoustic Features and Models• MFCC within- and crossword triphone models

– Augmented with 2 x 5 voicing features (5 frames around current frame)– Augmented with 25-dim Tandem/HATS phone posterior features estimated by

multilayer perceptron (MLP features)

• PLP crossword triphone models• Normalization and adaptation:

– CMN + CVN, VTLN– HLDA– CMLLR (SAT) in training and test (except in first decoding)– MLLR with phone-loop in first MFCC and PLP decoding– MLLR cross-adaptation in subsequent steps

• Baseline RT-04F models trained on 1400h of CTS data– All Hub-5 (Switchboard, CallHome English)– Every other Fisher utterance (complementary sets for MFCC and PLP models)– Gender-dependent

• Also: PLP models trained on 900h of TDT BN data– Gender-independent


Baseline Model Improvements• Compare RT-04S baseline CTS models to RT-04F CTS models• Model improvements:

– Added Fisher training data

– MLP features

– MPE training instead of MMIE

– Decision tree-based triphone clustering instead of bottom-up clustering

• Improvement on Fisher CTS data: 28% relative WER reduction• Results using RT-05S eval language model:

Models

MDM IHM

eval04 eval05 eval04 eval05

RT-03F CTS 48.3 40.2 33.2 30.8

RT-04F CTS 41.4 34.5 28.9 28.6

Rel. reduction 14.3% 14.2% 10.2% 7.1%


Acoustic Adaptation• Model adaptation using MAP, Gaussian means-only• MLP feature adaptation:

– 3 additional backprop iterations on CTS MLP, using meeting data

– Due to lack of time: adapted using only ICSI+CMU+NIST close-talking data

– New MLP using for both IHM and MDM feature generation

• Model and MLP adaptation give similar gains• Model and MLP adaptation are partly additive

Base Models

Gaussians adapted?

MLP adapted?

MDM IHM


RT-04F CTS No No 41.4 34.5 28.9 28.6

RT-04F CTS No Yes 41.1 34.2 28.4 27.0

RT-04F CTS Yes No n/a n/a 28.6 26.9

RT-04F CTS Yes Yes 40.3 32.2 28.3 26.2

Total relative reduction 2.7% 6.7% 2.1% 8.4%


MLP Feature Portability

• How well do MLP feature trained on CTS generalize?• MLP features improve system by 2% absolute, 10% relative on

RT-04F CTS eval data• Compare effectiveness using meeting-adapted CTS models and

MLP features:

• Improvement on eval05 comparable to CTS

Base Models

MLP features?

IHM

eval04 eval05

RT-04F CTS No 29.4 28.6

RT-04F CTS Yes 28.3 26.2

Relative reduction 3.7% 8.4%


Combining CTS and BN Base Models• CTS models match meeting domain in speaking style• BN models better matched to acoustic properties

– Signal bandwidth

– (Some) noisy environments

– Distant microphones

• Use meeting-adapted BN models in PLP decoding

• Substantial gains on distant microphones• Inconsistent gains on close-talking mics (not used in eval system)• Bug: adapted BN models only to male speakers (only 0.1% loss)

Base Models MDM IHM

MFCC PLP eval04 eval05 eval04 eval05

CTS CTS 40.3 32.2 28.3 26.2

CTS BN 37.1 30.2 28.0 26.3

Relative reduction 7.9% 6.2% 1.1% -0.4%


Discriminative MAP Adaptation• Preserve discriminative properties of MPE base models in

adaptation• Best results using MMIE criterion on training data with i-smoothing

of MPE base parameters (MMI-MAP, Povey et al., Eurospeech 2003)

• MDM eval system used ML-MAP models (due to lack of time)

Base Models Gaussian adaptation

IHM

eval04 eval05

RT-04F CTS ML-MAP 28.3 26.2

RT-04F CTS MMI-MAP 27.9 25.9

Relative reduction 1.4% 1.1%


Test-time MLLR Improvement• After evaluation, improved unsupervised MLLR• Replaced hand-designed phonetic regression class tree with more

standard data-driven decision tree (generated in model training)• Part of Arindam Mandal’s UW thesis work on speaker-dependent

MLLR class prediction (Eurospeech 2005)

Regression tree method

MDM IHM


Hand-designed 37.1 30.2 27.9 25.9

Decision tree 36.8 29.8 28.1 25.6

Relative reduction 0.8% 1.3% -0.7% 1.1%


Conference Meeting LMs• Linearly interpolated mixture N-gram LMs

– Multiword bigram for lattice generation– Multiword trigram for lattice decoding– Word-based 4-gram for rescoring

All LMs entropy-pruned

• Conference meeting LM components(1) Switchboard CTS transcripts (6.5M words)

(2) Fisher CTS (23M)

(3) Hub4 and TDT4 BN transcripts (140M)

(4) AMI, CMU, ICSI, and NIST meeting transcripts (1M)

(5) Web data selected to match meeting (268M) and Fisher (530M) transcripts

• Perplexity optimized on held-out 2004 (non-AMI) data• Vocabulary: 54K words

– All words in Switchboard, RT-04S meeting transcripts– All non-singletons in Fisher, AMI devtest– OOV rates: 0.40% on eval04, 0.19% on AMI devtest


Lecture Meeting LMs• Similar to conference meeting LM, but

– Added TED oral transcripts (0.1M words)– Speech conference proceedings (28M)– Removed Fisher web data– Added web data based on speech conference proceedings (120M)

• Vocabulary: added 3781 words from conference proc.– OOV rate on CHIL devtest: 0.18%

• Perplexity optimized on CHIL devtest– Jackknifed for development testing– On full devtest for eval system

• No CHIL transcripts used for N-gram training


LM Results (see paper for perplexity results)

Language model

eval04 2005 devtest

All CMU ICSI LDC NIST AMI CHIL

RT-04F (CTS) 28.7 32.1 24.5 34.1 21.5 39.8 37.6

RT-04S (mtg) 28.7 33.1 22.0 35.7 21.1 38.3 31.5

RT-05S, no web 28.9 33.4 22.0 36.0 21.5 38.4 27.6

RT-05S, w/web 27.9 32.5 21.4 34.9 20.2 37.3 26.9• Conference WER reduced 1.2% compared to 2004 LM, due to– AMI training data

– New web data

– Additional Fisher data

• Lecture WER reduced 4.6% without using CHIL N-gram data• Web data still useful to reduce WER

– Conference meetings: 1.0-1.1%

– Lecture meetings: 0.7%, redundant with proceedings ?

• No significant gains from source-specific LMs– But note best results for CMU & LDC achieved using CTS LM !


Overall Results: Conference Meetings• Relative WER reduction on

eval04 data:– 17.4% for MDM– 16.9% for IHM– (SDM results not directly comparable

due to system differences)

• Narrowing IHM/MDM gap ?– Only 17% rel. difference on eval05– But unchanged on eval04, in spite of

improved MDM preprocessing and modeling

• AMI data different in character, but not much in system performance

– Possible issues with array mics

• Performed well on “blind” test– Virginia Tech WER better than avg

• NIST IHM results due to unusual meeting properties

System MDM SDM IHM

eval04

RT-04S 44.9 51.3(fast system)

33.6

RT-05S 37.1 43.0 27.9

eval05

RT-05S 30.2 40.9 25.9

AMI CMU ICSI NIST VT

MDM 33.4 31.7 29.9 29.2 27.8

IHM 23.3 23.3 24.5 34.5 23.3

WER by Source (RT-05S)


Lecture System Development• Development on CHIL Jan’ 2005 data (NIST devtest)• No CHIL data used for acoustic training or adaptation

– Reused meeting-adapted models

• Transcripts used for LM mixture weight estimation (jackknifed testing)

• Model score weights not optimized• Cross-talk filtering never tested (just ran it)

– Devtest data only had one IHM channel

Findings• Best to use a single speaker cluster for distant mic

– Lecture speaker dominates and/or clustering is not good enough

• Delay-sum processing on tabletop mics worse than single mic with best SNR

– SNRs vary greatly between tabletop mics


Lecture Recognition Results

• IHM results comparable to conference meetings

• Distant mic WER > 10% absolute worse

– Lecture room acoustics more challenging than conference room

• Error rates comparable to results reported by CHIL team

• MDM better than SDM on devtest, but not on eval

– MDM uses single mic with best SNR

• MSLA little gain on devtest, but works much better on eval

• MM3A somewhat better than MSLA on devtest

– Where is the eval data, John ?

MDM MSLA MM3A SDM IHM

2005 devtest

51.6 51.0 49.7(one array)

56.2 26.9

eval05

52.0 44.8 ? 51.9 28.0


Conclusions• Considerable progress in meeting recognition: 17%

relative WER reduction on conference meetings• Leveraged LVCSR progress in CTS and BN

– 2-3% absolute gain on MDM by combining CTS and BN-based models

• Improved preprocessing for MDM (delay-sum) and IHM (cross-talk suppression)

• Effective adaptation of all modeling components– Features (MLP-based)– Acoustic models– Language models (web data is important)

• System generalized well to– New conference meeting type with training data (AMI)– Meeting source & domain without training data (VT)– Lecture task (adapting only language model)


Future Work• Fix the things we didn’t have time for

– MMI-MAP for distant mic models– Adapt MLPs for distant mic features

• For IHM: forget recognition (already works well); solve the speaker segmentation problem!

• Do a better job on lecture recognition– Use TED acoustic data– Get MDM to work better than SDM– Estimate model weights (LM weight, insertion penalty, …) properly

• Explore feature mapping techniques (e.g. MLLR) to reduce mismatch of background training data

• Adapt model to non-native speakers of Amer. English– Germans, Brits, Scots, …– cf. Arlo Faria’s poster, MLME-05

• More generally, more adaptation to meeting type & content


Thank You!

Date post:	15-Jan-2016
Category:	Documents
Upload:	ranae
View:	29 times
Download:	0 times

Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam...

Documents