+ All Categories
Home > Documents > ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin...

ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin...

Date post: 26-Dec-2015
Category:
Upload: winifred-barton
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
65
ASRU 2007 Survey Presented by Shih-Hsiang
Transcript

ASRU 2007 Survey

Presented by Shih-Hsiang

2

Outline

• LVCSR– Building a Highly Accurate Mandarin Speech Recognizer

Univ. of Washington, SRI, ICSI, NTU

– Development of the 2007 RWTH Mandarin LVCSR System RWTH

– The TITECH Large Vocabulary WFST Speech Recognition System Tokyo Institute of Technology

– Development of a Phonetic System for Large Vocabulary Arbic Speech Recognition Cambridge

– Uncertainty in Training Large Vocabulary Speech Recognizers (Focus on Graphical Model)

Univ. of Washington

– Advances in Arabic Broadcast News Transcription at RWTH RWTH

– The IBM 2007 Speech Transcription System for European Parliamentary Speeches (Focus on Language Adaptation)

IBM, Univ. of Southern California

– An Algorithm for Fast Composition of Weighted Finite-State Transducers Univ. of Saarland, Univ. of Karlsruhe

3

Outline (cont.)

– A Mandarin Lecture Speech Transcription System for Speech Summarization Univ. of Science and Technology, Hong Kong

4

Outline (cont.)

• Spoken Document Retrieval and Summarization– Fast Audio Search using Vector-Space Modeling

IBM

– Soundbite Identification Using Reference and Automatic Transcripts of Broadcast News Speech

Univ. of Texts at Dallas

– A System for Speech Driven Information Retrieval Universidad de Valladolid

– SPEECHFIND for CDP: Advances in Spoken Document Retrieval for the U.S. Collaborative Digitization Program

Univ. of Texas at Dallas

– A Study of Lattice-Based Spoken Term Detection for Chinese Spontaneous Speech (Spoken Term Detection)

Microsoft Research Asia

5

Outline (cont.)

• Speaker Diarization– Never-Endind Learning System System for On-Line Speaker Diarization

NICT-ATR

– Multiple Feature Combination to Improve Speaker Diarization of Telephone Conversations

Centre de Recherche Informatique de Montreal

– Efficient Use of Overlap Information in Speaker Diarization Univ. of Washington

• Others– SENSEI: Spoken English Assessment for Call Center Agents

IBM

– The LIMSI QAST Systems: Comparison Between Human and Automatic Rules Generation for Question for Question-Answering on Speech Transcriptions

LIMSI

– Topic Identification from Audio Recordings using Word and Phone Recognition Lattices

MIT

6

7

Reference

• [Ref 1] X. Lei, et al, “Improved tone modeling for Mandarin broadcast news speech recognition,” in Proc. Interspeech, 2006

• [Ref 2] F. Valente, H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theory of evidence,” in Proc. ICASSP, 2007

• [Ref 3] A. Zolnay, et al., “Acoustic feature combination for robust speech recognition,” in Proc. ICASSP, 2005

• [Ref 4] F. Wessel, et al., “Explicit word error minimization using word hypothesis posterior probabilities,” in Proc. ICASSP, 2001

9

• Acoustic Data– 866 hours of speech data collected by LDC (Training)

Mandarin Hub4 (30 hours), TDT4 (89 hours), and GALE Year 1 (747 hours) corpora for training our acoustic models

Span from 1997 through July 2006, from shows on CCTV, RFA, NTDTV, PHOENIX, ANHUI, and so on

– Test on three different test sets DARPA EARS RT-04 evaluation set (eval04), DARPA GALE 2006 evaluation

set (eval06), and GALE 2007 development set (dev07)

Corpora Description

10

Corpora Description (cont.)

• Text Corpora– The transcriptions of the acoustic training data, LDC Mandarin Gigaword cor

pus, GALE-related Chinese web text releases, and so on (1 billion words)

• Lexicon– Step1: Starting from the BBN-modified LDC Chinese word lexicon, and man

ually augment it with a few thousand new words (both Chinese and English words)(70,000 words)

– Step2: Re-segmenting the text corpora using longest-first match algorithm and train a unigram LM (Choose the most frequent 60,000 words)

– Step3: Using ML word segmentation on the training text to extract the out-of-vocabulary (OOV) words

– Step4: Retraining the N-gram LMs using the modified Kneser-Ney smoothing

11

Acoustic Systems

• Create two subsystems having approximately the same error rate performance but with error behaviors as different as possible

in order to compensate for each other

• System ICSI– Phoneme Set

70 phones for pronunciations Additionally, there is one phone designated for silence, and another one for noise

s, laughter, and unknown foreign speech (context-independent)

– Front-end features (74 dimensions per frame) 13-dim MFCC, and its first- and second-order derivatives spline smoothed pitch feature , and its first- and second-order derivatives 32-dim phoneme-posterior features generated by multi-layer perceptrons (MLP)

12

Acoustic Systems (cont.)

• Spline smoothed pitch feature [Ref 1]

– Since pitch is present only in voiced segments, the F0 needs to be interpolated in unvoiced regions to avoid variance problems in recognition

Interpolate the F0 contour with piecewise cubic Hermite interpolating polynomial (PCHIP)

PCHIP spline interpolation has no overshoots and less oscillation than conventional spline interpolation

Take the log of F0 Moving window normalization (MWN)

subtracts the moving average of a long-span window (1-2 secs) normalize out phrase-level intonation effects

5-point moving average (MA) smoothing smoothing reduces the noise in F0 features

raw f0 feature raw final feature(符合中美二國根本利益 )

System ICSI

13

Acoustic Systems (cont.)

• MLP feature– Providing discriminative phonetic information at the frame level

– It involves three main steps For each frame, concatenate its neighboring 9 frames of PLP and pitch features a

s the input to an MLP (43*9 inputs, 15000 hidden units, and 71 outputs units) Each output unit of the MLP models the likelihood of the central frame belonging to a c

ertain phone (Tandem phoneme posteriors features) Excluded the nose phone (It’s not a very discriminable class)

Next, they separately construct a two-stage MLP where the first stage contains 19 MLPs and the second stage one MLP

The purpose of each MLP in the first stage, with 60 hidden units each, is to identify a different class of phonemes, based on the log energy of a different critical band across a long temporal context (51 frames ~ 0.5 seconds)

The second stage of MLP then combines the information from all of the hidden units (60*19) from the first stage to make a grand judgment on the phoneme identity for the central frame (8,000 hidden units) (HATs phoneme posteriors features)

Finally, the 71-dim Tandem and HATs posterior vectors are combined using the Dumpster-Shafer algorithm

Logarithm is then applied to the combined posteriors, followed by Principal component analysis (PCA) (dimension de-corrlection and dimension reduction)

System ICSI

14

Acoustic Systems (cont.)System ICSI

15

Acoustic Systems (cont.)

• System-PLP– The system contains 42-dimension features with static, first- and second-

order derivatives of PLP features

– In order to compete with the ICSI-model which has a stronger feature representation, an fMPE feature transform is learned for the PLP-model.

The fMPE transform is trained by computing the high-dimension Gaussian posteriors of 5 neighboring frames, given a 3500x32 cross-word tri-phone ML-trained model with an SAT transform (3500*32*5=560K)

– For tackling spontaneous speech, they additionally introduce a few diphthongs in the PLP-Model

System PLP

16

Acoustic Systems (cont.)

• Acoustic model in more detail– Decision-tree based HMM state clustering

3500 shared states, each with 128 Gaussians

– A cross-word tri-phone model with the ICSI-feature is trained with an MPE objective function

– SAT feature transform based on 1-class constrained MLLR

17

Decoding Architecture

18

Decoding Architecture (cont.)

• Acoustic Segmentation– Their segmenter is run with a finite state grammar

– Their segmenter makes use of broad phonetic knowledge of Mandarin and models the input recording with five words

silence, noise, a Mandarin syllable with a voiceless initial, a Mandarin syllable with a voiced initial, and a non-Mandarin word

Each pronunciation phone (bg, rej, I1, I2, F, forgn) is modeled by a 3-state HMM, with 300 Gaussian per state

The minimum speech duration is reduced to 60 ms

19

Decoding Architecture (cont.)

• Auto Speaker Clustering– Using Gaussian mixture models of static MFCC features and K-means clust

ering

• Search with Trigrams and Cross Adaptation– The decoding is composed of three trigram recognition passes

ICSI-SI Speaker independent (SI) within-word tri-phone MPE-trained ICSI-model and the highly

pruned trigram LM gives a good initial adaptation hypothesis quickly

PLP-Adapt Use the ICSI hypothesis to learn the speaker-dependent SAT transform and to perform

MLLR adaptation per speaker, on the cross-word tri-phone SAT+fMPE MPE trained PLP-model

ICSI-Adapt Using the top 1 PLP hypothesis to adapt the cross-word tri-phone SAT MPE trained IC

SI-model

20

Decoding Architecture (cont.)

• Topic-Based Language Model Adaptation– Using a Latent Dirichlet Allocation (LDA) topic model

During decoding, they infer the topic mixture weights dynamically for each utterance

Then Select the top few most relevant topics above a threshold, and use their weights in θ to interpolate with the topic independent N-gram background language model

Weight the words in w based on an N-best-list derived confidence measure Include words not only from the utterance being rescored but also from surrounding

utterances in the same story chunk via a decay factor

The adapted n-gram is then used to rescore the N-best list

When the entire system is applied to eval06, the CER is 15.3%

21

22

Corpora Description

• Phoneme Set– The phoneme set is a subset of SAMPA-C

14 vowels and 26 consonants

– Tone information is included following the main-vowel principle Tone 3 and 5 are merged for all vowels For the phoneme @’, they merge tone 1 and 2 Resulting phoneme set consist of 81 tonemes, and additional garbage phone

and silence

• Lexicon– Based on LC-Star Mandarin lexicon (96k words)

– The unknown word are segmented into a sequence of known words by applying a longest-match segmenter

• Language models are as the same as Univ. Washington and SRI– Recognition experiments pruned 4-gram LMs

– Word graph rescoring full LMs

23

Acoustic Modeling

• The final system consists of four independent subsystems– System1 (s1)

MFCC features (+segment-wise CMVN) For each frame, concatenating its neighboring 9 frames and projected to a 45

dimensional feature space (done by LDA) Tone feature and its first and second derivative are also augmented to the feature

vector

– System 2 (s2) and system 3 (s3) are equal to s1 beside the based features s2 uses PLPs feature s3 uses gammatone cepstral coefficients

– System 4 (s4) stats with the same acoustic front-end as s1, but the features are augmented with phoneme posterior features produced by a neural network

The input of the net are multiple time resolution features (MRASTA) The dimension of the phoneme posterior features is reduced by a PCA to 24

24

Acoustic Modeling

• Acoustic Training– The acoustic models for all systems are based on tri-phones with cross-word

context Modeled by a 3-state left-to right HMM

– A decision tree based stat tying is applied (4,500 generalized tri-phone states)

– The filter-banks of the MFCC and PLP feature extraction are normalized by applying a 2-pass VTLN (not for s3 system)

– Speaker variations are compensated by using SAT/CMLLR

– Additionally, in recognition MLLR is applied to update the mean of the AMs

– MPE is used for discriminative AMs training

25

System Development

• Acoustic Feature Combination– The literature contains several way to combine different feature streams

Concatenate the individual feature vectors Feed the features streams into a single LDA Perform the integration in a log-linear model

• For fewer data, the log-linear model combination gives some nice improvement over the simple concatenation approach

• But with more training data the benefit declines and for the 870 hours setup we see no improvement at all

26

System Development (cont.)

• Consensus Decoding And System Combination – min.fWER (minimum time frame error) based consensus decoding

– min.fWER combination

– ROVER with confidence scores.

• The approximated character boundary times effectively work as good as the boundaries derived from a forced alignment• For almost all experiments, there is no difference in minimizing WER or CER• Only ROVER seems to benefit from switching to characters

27

Decoding Framework

• Multi-Pass recognition– 1. pass: no adaptation

– 2. pass: 2-pass-VTLN

– 3. pass: SAT/CMLLR

– 4. pass: MLLR

– 5. pass: LM rescoring

• The five passes result in an overall reduction in CER of about 10% relative for eval06 and about 9% for dev07

• The MPE trained models give a further reduction in the CER resulting in a 12% to 15% relative decrease over all passes

• Adding the 358 hours of extra data to the MPE training slightly decreases the CER and the total relative improvement is about 16% for both corpora

• LM.v2 (4-gram) outperforms LM.v1(5-gram) by about 0.8% absolute in CER consistently for all systems and passes

28

Decoding Framework (cont.)

• Cross Adaptation– Use s4 to adapt s1, s2 and s3

29

30

Introduction

• The goal is to build a fast, scalable, flexible decoder to operate on weighted finite state transducers (WFSTs) search spaces

• WFSTs provide a common and natural representation for HMM models context dependency pronunciation dictionaries grammars and alternative recognition outputs

• Within the WFSTs paradigm all the knowledge sources in the search space are combined together to form a static search network– The composition often happens off-line before decoding and there exist pow

erful operations to manipulate and optimise the search networks

– The fully composed networks can often be very large and therefore at both composition and decode time large amounts of memory can be required

Solution: on-the-fly composition of the network , disk based search networks and so on

31

Evaluations

• Evaluations were carried out using the Corpus of Spontaneous Japanese (CSJ)– contains a total of 228 hours of training data from 953 lectures

– 38 dimensional feature vectors with a 10ms frame rate and 25ms window size

– The language model was back-off trigram with a vocabulary of 25k words

• On the testing data, the language model perplexity was 57.8 and the out of vocabulary rate was 0.75%– 2328 utterances which spanned 10 lectures

• The experiments were conducted on a 2.40GHz Intel Core2 machines with 2GB of memory and an Nvidia 8800GTX graphics processor running Linux

32

Evaluations (cont.)

• HLevel and CLevel WFST Evaluations– CLevel (C。 L。 G)

– HLevel (H。 C。 L。 G)C: context dependency, L: lexiconG:LMs, H:ACs

• Recognition experiments were run with the beam width varied from 100 to 200

• CLevel – 2.1M states and 4.3M arcs required 150MBs ~ 170 MBs memory

• HLevel – 6.2M states and 7.7M arcs required 330MBs ~ 400 MBs memory

•Julius – required 60MBs ~ 100 MBs memory

*For narrow beams the HLevel decoder was slightly faster and achieved a marginally higher accuracy, showing the better optimized HLevel networks can be used with small overhead using singleton arcs.

33

Evaluations (cont.)

• Multiprocessor Evaluations– The decoder was additionally run in multi-threaded mode using one and two

threads to take advantages of both of the cores in the processor The multi-threaded decoder using two threads is able to achieve higher accuracy

for the same beam when compared to a single-threaded decoder There are parts of the decoding where each thread uses a local best cost for pruning

and not the absolute best cost at that point in time

34

Evaluations (cont.)

35

36

Introduction

• The authors presented a two-stage method for fast audio search and spoken term detection– Using vector-space modeling approach to retrieve a short list of candidate

audio segments for a query Word lattice based

– The list of candidate segments is then searched using a word based index for known words and a phone-based index for out-of-vocabulary words

37

Lattice-Based Indexing for VSM

• For vector-space modeling, it is necessary to extract an unordered list of terms of interest from each document in the database– raw count, TF/IDF, … etc.

• In order to accomplish this for lattices, We can extract the expected counts of each term

• The training documents are using reference transcripts, instead of lattices or the 1-best output of a recognizer

• The unordered list of terms also extract from the most frequently occurring 1-gram tokens in the training documents

• However, this does not account for OOV terms in a query

j

jLl

ilLji wCXlPdwETC |,

the complete set of paths in the lattice

the count of term wi in path l

38

Experimental Results

• Experiment Setup– Broadcast news audio search task

2.79 hours / 1107 query terms 1408 segments

– Two ASR systems ASR System 1: 250K, SI+SA decode

6000 quinphone context-dependent states, 250k Guassians

ASR System 2: 30K, SI only decode 3000 triphone context dependent states, 30K

Both of these systems use a 4-gram language model, built from a 54M n-gram corpus

39

Experimental Results

40

41

Introduction

• Soundbite identification in broadcast news is important for locating information – useful for question answering, mining opinions of a particular person, an

d enriching speech recognition output with quotation marks

• This paper presents a systematic study of this problem under a classification framework– Problem formulation for classification

– Feature extraction

– The effect of using automatic speech recognition (ASR) output

– Automatic sentence boundary detection

42

Classification Framework for Soundbite Identification

support vector machine

(SVM) classifier

43

Classification Framework for Soundbite Identification (cont.)

• Problem formulation– Binary classification

Soundbite versus not

– Three-way classification Anchor, reporter, or a soundbite

• Feature Extraction (each speech turn is represented as a feature vector)– Lexical features

LF-1 Unigram and bigram features in the first and the last sentence of the current speech tur

n for speaker roles

LF-2 Unigram and bigram features from the last sentence of the previous turn and from the fi

rst sentence of the following turn functional transition among different speakers

– Structural features Number of words in the current speech turn Number of sentences in the current speech turn Average number of words in each sentence in the current speech turn

44

Classification Framework for Soundbite Identification (cont.)

• Feature Weighting– Notation

N is the number of speech turns in the training collection M is the total number of features fik is the frequency of feature φi in the k-th speech turn

ni denotes the number of speech turns containing feature φi

F(φi ) means the frequency of feature φi in the collection

wik is the weight assigned to feature φi in the k-th turn

– Frequency Weighting

– TF-IDF Weighting

– TF-IWF Weighting

– Entropy Weighting

ikik fw

iikik nNfw /log*

M

jijikik FFfw

1

/log*

iikik entropyfw 1*0.1log

N

i i

ij

i

iji f

f

f

f

Nentropy

1

loglog

1

45

Experimental Results

• Experimental Setup– TDT4 Mandarin broadcast news data

335 news shows

– Performance Measure Accuracy, precision, recall, f-measure

46

Experimental Results

• Comparison of Different Weighting Methods– using global information generally

perform much better than simply using local information

– different problem formulations seem to prefer different weighting methods

– entropy-based weighting is moretheoretic and seems to be a promisingweighting choice

• Contribution of Different Types of Features– adding contextual features improves

the performance

– Removing low-frequency features(i.e., Cutoff-1) helps in classification

47

Experimental Results

• Impact of Using ASR Output– Speech recognition errors hurt the

system performance

– Automatic sentence boundary detection degrades performanceeven more

• Three-way classification strategy generally outperforms the binary setup

REF: human transcriptsASR_ASB: ASR output and automatic sentence segmentationASR_RSB: ASR output and manually segmentation

48

49

Introduction

• Speech driven information retrieval is a more difficult task than text-based information retrieval– Because spoken queries contain less redundancy to overcome speech reco

gnition errors Longer queries are more robust to errors than shorter ones

• Three types of errors that affect retrieval performance– out of vocabulary (OOV) words

– errors produced by words in a foreign language

– regular speech recognition errors

• Solutions– OOV problem

Two-pass strategy to adapt the Lexicons and LMs

– Foreign words problem Added the pronunciation of foreign words to pronunciation lexicon

50

System Overview

IR Engine: VSM + Rocchio’s pseudo relevance feedback

51

Experimental Setup

• Corpus– CLEF’01 (Cross-Language Evaluation Forum) Spanish monolingual test

suite The evaluation set includes a document collection, a set of topics and

relevance judgments 215,738 documents of the year 1994 from EFE newswire agency (511 Mb).

49 topics» each of them has three parts: a brief title statement, a one-sentence description and a

more complex narrative

– Queries were formulated from the description field of each topic

– 10 different speakers reading the queries (5 male and 5 female)

• Baseline System– ASR best hypothesis was processed by the IR engine to obtain the list

of documents relevant to that query (top 1000 most relevance docs) Three types of error

Type I: errors produced by OOV words Type II: errors caused by words in a foreign language Type III: regular speech recognition errors

52

Experiment Results

• To reduce the type1 error– Vocabulary adaptation

Created a list with every word that appeared in the documents retrieved in the first pass (the average number of words in a document is about 27000)

Added the most frequent words from our general vocabulary until we reached a vocabulary of 60,000 words

– Language Model Adaptation Trained a new LM with the documents obtained in the first pass Interpolated this new LM with the general LM, using the adapted vocabulary

Linear interpolation (0.5)

53

Experiment Results (cont.)

• Inclusion of foreign words pronunciation– Mapping English phonemes to Spanish ones

– Included the pronunciation of English words in the pronunciation lexicon CMU pronouncing dictionary

Add 8,891 English words

– One-pass strategy with alternate pronunciations reduced the number of type II errors, however some new type III errors appeared

• Final System– Combined the two-pass strategy with foreign words modeling

In the first pass, obtained the 1000 most relevant documents to the query (using the pronunciation lexicon that included English words pronunciation)

Then, adapted the vocabulary and the LM and expanded the pronunciation lexicon to include English words pronunciation

54

55

Introduction

• SpeechFind is a SDR system serving as the platform for several programs across the United States for audio indexing and retrieval – the National Gallery of the Spoken Word (NGSW), the Collaborative Digitizat

ion Program

• The system includes the following modules– An audio spider and transcoder

automatically fetching available audio archives from a range of available servers and converting the incoming audio files into the designed audio formats

parses the metadata and extracts relevant information into a “rich” transcript database to guide future information retrieval

– Spoken documents transcriber includes an audio segmenter and transcriber

– Transcription database

– An online public accessible search engine responsible for information retrieval tasks

a web-based user interface, search and index engines

56

Introduction (cont.)

57

Structure of CDP Audio Corpus

• The structure of CDP audio corpus– CDP audio files include Interviews, discussions/debates, and lectures, each

with 2-5 speakers participants

– The recorded audio documents are spontaneously articulated with many overlapping speakers, and burst noise events such as clapping, laughing, etc

– Recordings were conducted from the 1960s to 2000s and held at library of offices, classrooms, homes

58

Transcript Verification Process with CDP

• An online web-interface was developed in order to improve the quality of the ASR-generated transcripts

• The transcript verification process is as follows– Automatic Transcription

– Online Verification

– Model Enhancement

59

Transcript Improvement via Feature/Model Enhancement

• Speech/Feature Enhancement• Lexicon Update and Language Model Adaptation• Acoustic Model Adaptation Using Selective Training Set

– Document-dependent acoustic conditions speaker dependent characteristics, time varying/short-term background noise and

channel interference, and others

– Document-across acoustic conditions gender/age/accent dependent speech traits and the background noise/channel

distortions observed broadly

60

Notes

• The Dempster-Shafer (DS) Theory of Evidence allows representation and combination of different measures of evidence [ref2] [back]

61

Piecewise cubic Hermite interpolating polynomial (PCHIP)

Matlab Code:x = -3:3; y = [-1 -1 -1 0 1 1 1];t = -3:.01:3; p = pchip(x,y,t); s = spline(x,y,t); plot(x,y,'o',t,p,'-',t,s,'-.') legend('data','pchip','spline',4)

62

Log-Linear Model

• Different acoustic features are combined indirectly via the log-linear combination of acoustic probabilities [Ref 3][back]

• In the case of log-linear model combination, the posterior probability has the following form

• The feature functions are defined as– Language model

– Acoustic model

WXP if

if|

ifi

XWPf |

','exp

,exp

|

Wj

jj

jjj

f

if

if

if

i

XWg

XWg

XWP

WPXWg ifj log,LM

WXPXWg i

i

i ff

fj |log,AM

i

ff

Wopt

ifi

i

LM WXPWPW |maxarg

63

Minimum Time Frame Error (MTFE)

• The time frame errors are caused either by word deletions, insertions, and substitutions or by differing word boundaries [Ref 4][back]

• MTFE is to overcome the mismatch between Bayes’ decision rule which aims at minimizing the expected sentence error rate and the word error rate which is used to assess the performance of speech recognition systems

• The decision rule is rewritten as follows

Standard approach – minimize expected SER

64

Composition

• Composition is the transducer operation for combining dierent levels of representation [Ref 5][back]

– e.g. a pronunciation lexicon can be composed with a word level grammar to produce a phone to word transducer whose word sequences are restricted to the grammar

Grammar G

Lexical L

Composition G。 L

65

Speaker Diarization

• Problem formulation: the “who spoke when” task on an continuous audio stream (NIST RT03 Spring Eval.) [back]


Recommended