+ All Categories
Home > Documents > Rapid and Accurate Spoken Term Detection

Rapid and Accurate Spoken Term Detection

Date post: 31-Dec-2015
Category:
Upload: gil-levine
View: 31 times
Download: 0 times
Share this document with a friend
Description:
Rapid and Accurate Spoken Term Detection. David R. H. Miller BBN Technolgies 14 December 2006. Overview of Talk. BBN English system description Evaluation results Development experiments - PowerPoint PPT Presentation
Popular Tags:
26
Rapid and Accurate Rapid and Accurate Spoken Term Detection Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006
Transcript
Page 1: Rapid and Accurate Spoken Term Detection

Rapid and Accurate Rapid and Accurate Spoken Term DetectionSpoken Term Detection

David R. H. Miller

BBN Technolgies14 December 2006

Page 2: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 2

Overview of TalkOverview of Talk

• BBN English system description

• Evaluation results

• Development experiments

• BBN explored STD across languages, but with limited evaluation resources we chose to field systems only in CTS for each language.

Page 3: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 3

BBN Evaluation TeamBBN Evaluation Team

Core Team• Chia-lin Kao• Owen Kimball• Michael Kleber• David Miller

Additional assistance• Thomas Colthurst• Herb Gish• Steve Lowe• Rich Schwartz

Page 4: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 4

BBN System OverviewBBN System Overview

Byblos STT

indexer

detector

decider

latticesphonetic-transcripts

indexscored

detectionlists

final outputwith YES/NO

decisions

audiosearc

hterms

ATWV cost

parameters

indexing searching

Page 5: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 5

BBN System Overview: STTBBN System Overview: STT

Byblos STT

indexer

detector

decider

latticesphonetic-transcripts

indexscored

detectionlists

final outputwith YES/NO

decisions

audiosearc

hterms

ATWV cost

parameters

Page 6: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 6

Primary STT configurationPrimary STT configuration

• STT generates a lattice of hypotheses and a phonetic transcript for each input audio file.

• 2300-hour EARS RT04 CTS acoustic model training corpus

• 946M words language model training

• 14.9% WER on Std.Dev06 CTS data

Page 7: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 7

Primary STT English ArchitechturePrimary STT English Architechture

Segmentation+

FeatureExtraction

Forward-BackwardDecoding

LatticeRescoring

Waveform

Fw SI STM AM,bigram LM

Bw SI SCTM AM,approx.trigram LM

RDLT Features

Final LatticeFinal 1-best

SI crossword SCTM AM, trigram LM

Adaptation Parameters

System described in detail in B. Zhang, et al. “Discriminatively trained region dependent feature transforms for speech recognition”. Proc. ICASSP 2006, Toulouse, France.

N-best Hypothesis

Trigram Lattice

Speaker Adaptation

Forward-BackwardDecoding

LatticeRescoring

Trigram Lattice

Fw HLDA-SAT STM AM, bigram LM

Bw HLDA-SAT SCTM AM,approx.trigram LM

HLDA-SATcrossword SCTMAM, trigram LM

Page 8: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 8

BBN System Overview: IndexerBBN System Overview: Indexer

Byblos STT

indexer

detector

decider

latticesphonetic-transcripts

indexscored

detectionlists

final outputwith YES/NO

decisions

audiosearc

hterms

ATWV cost

parameters

Page 9: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 9

IndexerIndexer

• Indexer precomputes single-word detection records from lattices. – Stores as hashed sorted lists for fast lookup.

• Computes fraction of likelihood that flows over each arc.– Uses forward-backward algorithm.– Optimistic posterior: ignores possibility true word is missing from lattice.

• Clusters detections with same word, close times, summing their scores

WHICH [a=-205 l=-5] CAT [a=-170 l=-2] IS [a=-18 l=-2]

THAT [a=-92 l=-3]

A [a=-12 l=-2]

WITCH [a=-200 l=-4]

WITCH [a=-203 l=-4]

CUT [a=-175 l=-3]

Page 10: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 10

Index StructureIndex Structure

phonetictranscripts

CAT

WITCH

WHICH

file9: b=39.1 d=0.3 p=0.83file9: b=39.1 d=0.3 p=0.83

file9: b=39.1 d=0.3 p=0.83file9: b=39.1 d=0.3 p=0.83

file9: b=39.1 d=0.3 p=0.83file9: b=39.1 d=0.3 p=0.83

file9: b=39.1 d=0.3 p=0.83file3: b=25.2 d=0.1 p=0.77

file5: b=173.8 d=0.2 p=0.52file5: b=173.8 d=0.2 p=0.52

file5: b=173.8 d=0.2 p=0.52file5: b=173.8 d=0.2 p=0.52

file5: b=173.8 d=0.2 p=0.52file5: b=173.8 d=0.2 p=0.52

Page 11: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 11

BBN System Overview: DetectorBBN System Overview: Detector

Byblos STT

indexer

detector

decider

latticesphonetic-transcripts

indexscored

detectionlists

final outputwith YES/NO

decisions

audiosearc

hterms

ATWV cost

parameters

Page 12: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 12

DetectorDetector

• Detector generates a sorted, scored list of candidate detection records for each search term supplied.

• For single-word IV terms, performs trivial retrieval from index.

• For multi-word IV terms, looks for acceptable sequences of single-word detections

– Component detections must satisfy adjacency timing constraints– Assigns minimum component score to the multi-word detection.

• OOV not a significant factor in English CTS – see Levantine talk.

Audio File Begin Duration Score

fsh_60262_exA 83.1 0.23 0.93

fsh_61228_exA 29.7 0.18 0.85

fsh_60844_exA 101.5 0.28 0.47

fsh_60650_exA 2.71 0.30 0.13

fsh_61228_exA 55.9 0.21 0.01

candidates for term “bombing”

Page 13: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 13

BBN System Overview: DeciderBBN System Overview: Decider

Byblos STT

indexer

detector

decider

latticesphonetic-transcripts

indexscored

detectionlists

final outputwith YES/NO

decisions

audiosearc

hterms

ATWV cost

parameters

Page 14: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 14

DeciderDecider

Audio File Begin Duration Score YES/NO

fsh_60262_exA 83.1 0.23 0.93 ?

fsh_61228_exA 29.7 0.18 0.85 ?

fsh_60844_exA 101.5 0.28 0.47 ?

fsh_60650_exA 2.71 0.30 0.13 ?

fsh_61228_exA 55.9 0.21 0.01 ?

• Decider picks and applies a score threshold for each list to make YES/NO decisions.– Processes each list of candidates independently– Processes all detection records in a list jointly– Aims to maximize ATWV metric

candidates for term “bombing”

Page 15: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 15

Primary Evaluation MetricPrimary Evaluation Metric

• “Actual Term Weighted Value” is primary metric

000,1secondsin corpussearch ofduration

:where

)(N

)(N)(P

)(N

)(N1)(P

)(P)(P1)Value(

)Value(N

1

true

spuriousFA

true

correctMiss

FAMiss

terms

speech

speech

T

termT

termterm

term

termterm

termtermterm

termATWV

Page 16: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 16

Understanding ATWVUnderstanding ATWV

• Perfect ATWV = 1.0

• Mute detector has ATWV = 0.0

• Negative ATWV is possible.

• Motivated by application-based costs:

true

spuriouscorrect

N

NN

V

CVValue

• All search terms are weighted equally• False alarm cost is almost constant, but miss cost varies by term.

– Missing an instance of a rare term is expensive.– Missing an instance of a frequent term cheap.

Page 17: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 17

Decider TheoryDecider Theory

• Given unbiased, independent posterior probabilities on detections and known constant value/cost on outcome, optimal decision threshold satisfies

)0( alarm false a ofcost

)0(hit correct a of value

01

fafa

hit

fahit

fafahit

CC

VhitV

CV

CCV

• In ATWV metric, if Ntrue(term) > 0

)(N)(N

1

truetrue termTC

termV

speechfahit

Page 18: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 18

Decider ApproximationsDecider Approximations

• Ntrue(term) unknown, and detection scores biased.

• For each term, estimate from detections Di:

fahit

fa

speech

fahit

i

om latticemissing fr

CV

C

termTC

termV

Dpterm

P

DpDp

ˆˆ

ˆˆ

)(N̂ˆ

)(N̂

)(ˆ)(N̂

1

)()(ˆ

truetrue

true

Page 19: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 19

2006 STD Evaluation English Results 2006 STD Evaluation English Results

SiteAccuracy

ATWVSearch Speed

(sec.p/Hs)Indexing time

(Hp / Hs)Index size (MB/Hs)

BBN:P 0.83 0.004 43.0 1.0BBN:C 0.76 0.004 2.7 0.5but:p 0.52 0.038 126.8 688.6dod:p -0.41 0.077 16.1 0.4ibm:p 0.74 0.004 7.6 0.3idiap:p -6.19 11.312 0.3 24.5ogi:p 0.65 0.456 0.3 7.2qut:p 0.09 0.330 18.1 558.2sri:p 0.67 1.383 10.7 19.7stbu:p 0.22 13.580 157.7 688.6stell:p 0.00 2.992 0.2 8.7tub:p 0.16 0.173 0.2 0.8

English CTS Results

Page 20: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 20

NIST English DET curvesNIST English DET curves

Page 21: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 21

Effect of STT Error RateEffect of STT Error Rate

• Loss of 2.5 WER caused ATWV to drop 0.6-0.9– Magnified effect because changes in lattice word posteriors don’t show up in WER

• WER affected by scoring conventions. – Contraction, hyphenation normalization– Rigorous match definition for this eval causes WER to increase by 0.5

System WERDev06

ATWV

DryRun06

ATWV

BBN primary 18.0 0.786 0.766

BBN contrast 15.5 0.847 0.852

• STT WER has strong effect on ATWV:

Page 22: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 22

Importance of Lattice OutputImportance of Lattice Output

• Lattice searching reduces Pmiss – 8-fold increase in number of candidate detections from STT

• Improves estimate of Ntrue for decisions– Holds PFA down

Dev06 DryRun06

1-best lattices 1-best lattices

primary 0.787 0.847 0.735 0.852contrast 0.740 0.786 0.704 0.766

• Search lattices is more accurate than searching 1-best transcripts

Page 23: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 23

Effect of Multi-word Detection LogicEffect of Multi-word Detection Logic

• Exact detection of multi-word search terms is possible:– Store full lattice– Search for words on adjacent edges– Use fw-bw to get true posterior probability

• Approximate multi-word detection:– Store only individual words, forget topology– Search for words ordered & close in time– Pr(phrase) = min Pr(words in phrase)

Effect of Approximate Multi-word Detection

Search time Index size ATWV

decreased by 99.5%

decreased by 97% increased by 0.01

Page 24: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 24

BBN STD SummaryBBN STD Summary

• Accurate detection (83% of perfect ATWV)

• Fast search time

• Small index size

• Configurable indexing speed – Fast index speed maintains good accuracy.

• Encapsulated decision logic– Easy to tailor for cost metrics other than ATWV

Page 25: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 25

Contrast STT configuration Contrast STT configuration

• 2300hrs/800hrs/1500hrs AM training data (complementary MPE).

• Same LM training data as primary system

• Somewhat smaller model than primary

• 18.1 % WER on Std.Dev06 CTS data– compared to 14.9% for primary

Page 26: Rapid and Accurate Spoken Term Detection

14-Dec-06Rapid and Accurate Spoken Term Detection 26

Contrast STT English ArchitechtureContrast STT English ArchitechtureSegmentation

+Feature

Extraction

Forward-BackwardDecoding

Speaker Adaptation

LatticeRescoring

Waveform

Fw SI STM AM,bigram LM

Bw SI SCTM AM,approx.trigram LM

Cepstra + Energy

Trigram Lattice

Final Result

HLDA-SATcrossword SCTMAM, trigram LM

Cepstra + Energy

1-best Hypothesis

Adaptation Parameters

Architechture same as S. Matsoukas et al “The 2004 BBN 1xRT Recognition Systems for English Broadcast News and Conversational Telephone Speech”

Proc. Interspeech 2005, Lisboa, Portugal.


Recommended