Indexation, Retrieval and Detection Techniques for Spoken ... ·...

Indexation, Retrieval and Detection Techniquesfor Spoken Term Detection

Doğan Can

Boğaziçi UniversityDepartment of Electrical & Electronics Engineering

BUSIM Lab

January 26, 2010

OutlineIntroductionLattice Indexation/Search Framework

PreliminariesSpoken Utterance Retrieval with Factor Transducer2-Pass Spoken Term Detection with Factor TransducerSpoken Term Detection with Modified Factor TransducerSpoken Term Detection with Timed Factor TransducerExperimental Results

RetrievalQuery Forming and Expansion for Phonetic SearchExperimental Results

Thresholding for Spoken Term DetectionGlobal ThresholdingTerm Weighted Value Based Term Specific ThresholdingScore Distribution Based Term Specific ThresholdingExperimental Results

2 / 45

Application: Sign Dictionary

3 / 45

demo.movMedia File (video/quicktime)

Comparison of Speech Retrieval TasksSpoken Document Retrieval vs. Spoken Utterance Retrieval vs. Spoken Term Detection

Query Relation Return (Text Analogue)

SDR long lexical+semantic documents (relevant pages)

SUR short inclusion (exact) utterances (sentences)

STD short exact match occurrences (positions)

4 / 45

Challenges of the Spoken Term Detection Task

I Aim: Open vocabulary searchReference: “Taipei night view"

I Challenge: Unreliable transcriptionsASR Output: “tie bay light view"

1. High error rate of one-best transcripts

Alternative transcriptions: [tie bay [light 0.6, night 0.4] view]

2. Out-Of-Vocabulary queries

Phonetic search: /t ay b ey n ay t v iy w/

3. Boost in false alarms due to 1 and 2

5 / 45

Efficient Indexing and Search Framework for STD

Utilizing Weighted OOV Query Pronunciations

3. Boost in false alarms due to 1 and 2Exploiting Score Statistics for STD

5 / 45

Previous Work in the Field

How to Index and Search LatticesI General Indexation of Weighted Automata [Saraclar and

Sproat, 2004,Allauzen et al., 2004]I Position Specific Posterior Lattices (PSPL) [Chelba and

Acero, 2005]I Time-based Merging for Indexing (TMI) [Zhou et al., 2006]

How to Alleviate OOV IssueI Search on Sub-word Decoding [Saraclar and Sproat,

2004,Siohan and Bacchiani, 2005,Mamou et al., 2007]I Search on the sub-word representation of word decoding

[Chaudhari and Picheny, 2007]I Phonetic query expansion [Li et al., 2000]

6 / 45

Anatomy of a Spoken Term Detection (STD) System

User

Query

Preprocess SearchEngine

largerthan τ?

Return

Omit

SpeechDatabase

ASROutput

Index

ASR

INDEXINGRETRIEVAL

DETECTION

yes

no

7 / 45

User

Query

largerthan τ?

Return

Omit

SpeechDatabase

ASROutput

Index

ASR

INDEXINGRETRIEVAL

DETECTION

yes

no

7 / 45

User

Query

largerthan τ?

Return

Omit

SpeechDatabase

ASROutput

Index

ASR

INDEXINGRETRIEVAL

DETECTION

yes

no

7 / 45

User

Query

largerthan τ?

Return

Omit

SpeechDatabase

ASROutput

Index

ASR

INDEXINGRETRIEVAL

DETECTION

yes

no

7 / 45

User

Query

largerthan τ?

Return

Omit

SpeechDatabase

ASROutput

Index

ASR

INDEXINGRETRIEVAL

DETECTION

yes

no

7 / 45

Notation & DefinitionsSemirings

DefinitionA system (K,⊕,⊗, 0̄, 1̄) is a semiring if:

I (K,⊕, 0̄) is a commutative monoid with identity element 0̄;I (K,⊗, 1̄) is a monoid with identity element 1̄;I ⊗ distributes over ⊕;I 0̄ is an annihilator for ⊗: for all a ∈ K, a⊗ 0̄ = 0̄⊗ a = 0̄.

Set (K) ⊕ ⊗ 0̄ 1̄Boolean B {0, 1} ∨ ∧ 0 1

Probability R R+ + × 0 1Log L R ∪ {+∞} ⊕loga + +∞ 0

Tropical T R ∪ {+∞} min + +∞ 0Tropical T ′ R ∪ {−∞} max + −∞ 0

a a⊕log b = − log(e−a + e−b)

8 / 45

Notation & DefinitionsWeighted Finite-State Automata

DefinitionA weighted finite-state transducer T over asemiring K is an 8-tupleT = (Σ,∆, Q, I, F,E, λ, ρ) :

I Σ : input alphabet; i.e. Σ = {m, e}I ∆ : output alphabet; i.e. ∆ = {h, a, v}I Q : set of states;I I ⊆ Q : set of initial states;I F ⊆ Q : set of final states;I E ⊆ Q× (Σ ∪ {�})× (∆ ∪ {�})×K×Q :

set of transitions;I λ : I → K : initial weight function;I ρ : F → K : final weight function.

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

qa/1

qb

qc/1

m:h/.5

e:a/.5

e:a/.7

e:v/.3

�:�/1

9 / 45

Notation & DefinitionsFactor Automaton

DefinitionGiven two strings u, v ∈ Σ∗, v is a factor (substring) of u ifu = xvy for some x, y ∈ Σ∗. More generally, v is a factor ofL ⊆ Σ∗ if v is a factor of some u ∈ L.

DefinitionThe factor automaton F (u) of u is the minimal deterministicfinite-state acceptor recognizing exactly Xu, the set of factors of u.

DefinitionSimilarly, the factor automaton F (A) of A is the minimaldeterministic finite-state acceptor recognizing exactly XA, the setof factors of the strings recognized by A.

10 / 45

Indexing WFSA for Spoken Utterance Retrieval

0

1

2

good/1

evening/.6 morning/.4

Setup:For each speech utterance ui, i = 1, ..., n,

I A weighted automaton Ai over Σ and L;i.e. word lattice output by ASR.

Objective:Create a full index to directly search for anyfactor of these automata; i.e. evening, goodmorning, etc.Notes:

I different from classical indexation,input data is uncertain

I must make use of weights

11 / 45

Factor Transducer Construction: Toy ExampleFactor Selection

10 2

s

e

lose : �/.4

find : �/.6 yourself : �/1

1. Replace each transition (p, a, w, q) by (p, a, �, w, q)2. Create a new state s 6∈ Qi and make s the unique initial state3. Create a new state e 6∈ Qi and make e the unique final state4. Create a new transition (s, �, �, d[q], q) for each state q ∈ Qi5. Create a new transition (q, �, i, f [q], e) for each state q ∈ Qi

12 / 45

10 2

s

e

lose : �/.4

12 / 45

10 2

s

e

lose : �/.4

12 / 45

10 2

s

e

lose : �/.4

12 / 45

10 2

s

e

lose : �/.4

� : �/1

12 / 45

10 2

s

e

lose : �/.4

� : �/1

� : i/1� : i/1

� : i/1

12 / 45

Factor Transducer Construction: Toy ExampleOptimization

1. Weighted �-removal2. Weighted determinization over L3. Weighted minimization over L by viewing Ti as an acceptor

Before

10 2

s

e

lose: �/.4

find: �/.6 yourself: �/1

� : �/1

� : i/1

After

1 2

0

3

lose: �/.4

find: �/.6 yourself: �/1

� : i/1

yourself: �/1

� : i/1

13 / 45

Full Factor Transducer Construction & SearchFull Factor Transducer:Given Ti for each Ai, i = 1 . . . n:1. Take union,

U =⋃i

Ti, i = 1, . . . , n

2. Weighted �-removal, determinization (and minimization)3. Define T as the transducer obtained after sorting input labels

Search:1. User query X can be any weighted automaton! (Regular

expressions)2. Compose X with T on the input side and project result onto

the output labels,P = Π2(X ◦ T )

3. Weighted �-removal + pruning or n-shortest paths + sort withshortest path algorithm

14 / 45

Full Factor Transducer Construction & SearchFull Factor Transducer:Given Ti for each Ai, i = 1 . . . n:1. Take union,

U =⋃i

Ti, i = 1, . . . , n

2. Weighted �-removal, determinization (and minimization)3. Define T as the transducer obtained after sorting input labels

Search:1. User query X can be any weighted automaton! (Regular

expressions)2. Compose X with T on the input side and project result onto

the output labels,P = Π2(X ◦ T )

3. Weighted �-removal + pruning or n-shortest paths + sort withshortest path algorithm

14 / 45

Spoken Utterance Retrieval with Factor Transducer[Allauzen et al., 2004]

Database:1. “a a"

0 1 2

a/1 a/1

2. “[b .6, a .4] a"

0 1 2

b/.6

a/.4

a/1

0Query: 1

a/1

Index:

0

1

2

5

3

4

a:�/1

b:�/1

a:�/1

�:1/2

�:2/1.4

�:2/.6

a:�/1

�:1/1

�:2/.4

�:2.6

15 / 45

Database:1. “a a"

0 1 2

a/1 a/1

2. “[b .6, a .4] a"

0 1 2

b/.6

a/.4

a/1

0Query: 1

a/1

Index:

0

1

2

5

3

4

a:�/1

b:�/1

a:�/1

�:1/2

�:2/1.4

�:2/.6

a:�/1

�:1/1

�:2/.4

�:2.6

15 / 45

Database:1. “a a"

0 1 2

a/1 a/1

2. “[b .6, a .4] a"

0 1 2

b/.6

a/.4

a/1

0Query: 1

a/1

Index:

0

1

2

5

3

4

a:�/1

b:�/1

a:�/1

�:1/2

�:2/1.4

�:2/.6

a:�/1

�:1/1

�:2/.4

�:2.6

15 / 45

Database:1. “a a"

0 1 2

a/1 a/1

2. “[b .6, a .4] a"

0 1 2

b/.6

a/.4

a/1

0Query: 1

a/1

Results:

0 1 2a:�/1

�:1/2

�:2/1.4

(Utterance ID, Expected Count):1. (1,2)2. (2,1.4)

15 / 45

2-pass STD with Factor Transducer[Parlak and Saraclar, 2008,Can et al., 2009]

ProcedureI For each query:

I Obtain (utterance ID, expected count) pairs (1st pass)I For each utterance with expected count > τ :

I Align the query with the utterance → time interval (2nd pass)[Parlak and Saraclar, 2008]

I Align the query with the lattice → time interval (2nd pass)[Can et al., 2009]

I Return (utterance ID, time interval, expected count) triplet

ProblemsI 2nd pass takes time → slowI Multiple occurrences of a query in the same utterance

contribute to the same expected count.I Ideal for Spoken Utterance RetrievalI Not so for Spoken Term Detection

16 / 45

Modified Factor Transducer Construction: Toy ExampleFactor Selection

L0 L1 L20 1 3

10 2

s

e

lose : 0-1/.4

find : 0-1/.6 yourself : 1-3/1

1. Replace each transition (p, a, w, q) by (p, a, Li[p]-Li[q], w, q)2. Create a new state s 6∈ Qi and make s the unique initial state3. Create a new state e 6∈ Qi and make e the unique final state4. Create a new transition (s, �, �, d[q], q) for each state q ∈ Qi5. Create a new transition (q, �, i, f [q], e) for each state q ∈ Qi

17 / 45

L0 L1 L20 1 3

10 2

s

e

lose : 0-1/.4

17 / 45

L0 L1 L20 1 3

10 2

s

e

lose : 0-1/.4

17 / 45

L0 L1 L20 1 3

10 2

s

e

lose : 0-1/.4

17 / 45

L0 L1 L20 1 3

10 2

s

e

lose : 0-1/.4

� : �/1

17 / 45

L0 L1 L20 1 3

10 2

s

e

lose : 0-1/.4

� : �/1

� : i/1� : i/1

� : i/1

17 / 45

Factor Transducer vs. Modified Factor Transducer(After Optimization)

Factor Transducer

1 2

0

3

lose:�/.4

find:�/.6

yourself:�/1

� : i/1

yourself:�/1

� : i/1

Timed Factor Transducer

1 2

0

3

lose:0-1/.4

find:0-1/.6

yourself:1-3/1

� : i/1

yourself:1-3/1

� : i/1

18 / 45

Spoken Term Detection with Modified Factor Transducer[Can et al., 2009]

Database:1. “a a"

0 1 2

a:0.1-1/1 a:1-1.8/.6

a:1-1.9/.4

2. “[b .6, a .4] a"

0 1 2

b:0.2-1/.6

a:0.1-1/.4

a:1-1.9/1

Index:

0

1

2

6 5

3

4

a:0-1/1

a:1-2/1

b:0-1/1

a:1-2/1

�:1/1

�:2/.4

�:2/1�:1/1

�:2/.6

a:1-2/1

�:1/1

�:2/.4

�:2/.6

19 / 45

Database:1. “a a"

0 1 2

a:0.1-1/1 a:1-1.9/.6

a:1-1.9/.4

2. “[b .6, a .4] a"

0 1 2

b:0.2-1/.6

a:0.1-1/.4

a:1-1.9/1

CLUSTERING

Index:

0

1

2

6 5

3

4

a:0-1/1

a:1-2/1

b:0-1/1

a:1-2/1

�:1/1

�:2/.4

�:2/1�:1/1

�:2/.6

a:1-2/1

�:1/1

�:2/.4

�:2/.6

19 / 45

Database:1. “a a"

0 1 2

a:0-1/1 a:1-2/.6

a:1-2/.4

2. “[b .6, a .4] a"

0 1 2

b:0-1/.6

a:0-1/.4

a:1-2/1

QUANTIZATION

Index:

0

1

2

6 5

3

4

a:0-1/1

a:1-2/1

b:0-1/1

a:1-2/1

�:1/1

�:2/.4

�:2/1�:1/1

�:2/.6

a:1-2/1

�:1/1

�:2/.4

�:2/.6

19 / 45

Database:1. “a a"

0 1 2

a:0-1/1 a:1-2/.6

a:1-2/.4

2. “[b .6, a .4] a"

0 1 2

b:0-1/.6

a:0-1/.4

a:1-2/1

Index:

0

1

2

6 5

3

4

a:0-1/1

a:1-2/1

b:0-1/1

a:1-2/1

�:1/1

�:2/.4

�:2/1�:1/1

�:2/.6

a:1-2/1

�:1/1

�:2/.4

�:2/.6

19 / 45

Database:1. “a a"

0 1 2

a:0-1/1 a:1-2/.6

a:1-2/.4

2. “[b .6, a .4] a"

0 1 2

b:0-1/.6

a:0-1/.4

a:1-2/1

0Query: 1

a/1

Index:

0

1

2

6 5

3

4

a:0-1/1

a:1-2/1

b:0-1/1

a:1-2/1

�:1/1

�:2/.4

�:2/1�:1/1

�:2/.6

a:1-2/1

�:1/1

�:2/.4

�:2/.6

19 / 45

Database:1. “a a"

0 1 2

a:0-1/1 a:1-2/.6

a:1-2/.4

2. “[b .6, a .4] a"

0 1 2

b:0-1/.6

a:0-1/.4

a:1-2/1

0Query: 1

a/1

Results:

0

1

2

3

a:0-1/1

a:1-2/1

�:1/1

�:2/.4

�:2/1

�:1/1

(Utterance ID, Time Interval,Posterior Probability):1. (1,0-1,1)2. (1,1-2,1)3. (2,0-1,.4)4. (2,1-2,1)

19 / 45

1-pass STD with Modified Factor Transducer[Can et al., 2009]

I Obtain (utterance ID, time interval, posterior probability)I Return triplets with posterior probability > τ

HighlightsI No 2nd pass → fastI No multiple occurrence problem, every distinct interval leads

to another index entry → overlapping intervals are clusteredI Time interval mismatches → common paths are reduced →

larger index → time intervals are quantized

ProblemsI Index non-deterministic!

20 / 45

Timed Factor Transducer Construction: Toy ExampleFactor Selection

L0 L1 L20 1 3

10 2

s

e

lose:1/.4,0,0

find:1/.6,0,0 yourself:1/1,0,0

1. Replace each arc weight w ∈ L with {w, 1, 1} ∈ L × T × T ′

2. Create a new state s 6∈ Qi and make s the unique initial state3. Create a new state e 6∈ Qi and make e the unique final state4. Create a new arc (s, �, �, {d[q], Li[q], 1}, q) for q ∈ Qi5. Create a new arc (q, �, i, {f [q], 1, Li[q]}, e) for q ∈ Qi

21 / 45

L0 L1 L20 1 3

10 2

s

e

lose:1/.4,0,0

21 / 45

L0 L1 L20 1 3

10 2

s

e

lose:1/.4,0,0

21 / 45

L0 L1 L20 1 3

10 2

s

e

lose:1/.4,0,0

21 / 45

L0 L1 L20 1 3

10 2

s

e

lose:1/.4,0,0

� : �/1,0,0

� : �/1,1,0

� : �/1,3,0

21 / 45

L0 L1 L20 1 3

10 2

s

e

lose:1/.4,0,0

� : �/1,0,0

� : �/1,1,0

� : �/1,3,0

� : i/1,0,0� : i/1,0,1

� : i/1,0,3

21 / 45

Factor Transducer vs. Timed Factor Transducer(After Optimization)

Factor Transducer

1 2

0

3

lose: �/.4

find: �/.6

yourself: �/1

� : i/1

yourself: �/1

� : i/1

Timed Factor Transducer

1 2

0

3

lose: 1/.4,0,1

find: 1/.6,0,1

yourself: 1/1,1,3

� : i/1,0,0

yourself: 1/1,0,2

� : i/1,0,0

22 / 45

Spoken Term Detection with Timed Factor TransducerDatabase:1. “a b",[0.1,1,1.8]

0 1 2

a:a/1 b:b/1

2. “b a",[0.2,1,1.9]

0 1 2

b:b/1 a:a/1

Index:

0

1

2

5

3

4

a:1/1,.1,1

b:1/1,0,.8

a:1/1,0,.9

b:1/1,.2,1

�:1/1,0,0

�:2/1,.9,.9�:1/1,0,0

�:2/1,0,0

�:1/1,.8,.8

�:2/1,0,0

23 / 45

0 1 2

a:1/1 b:1/1

2. “b a",[0.2,1,1.9]

0 1 2

b:1/1 a:1/1

CLUSTERING

Index:

0

1

2

5

3

4

a:1/1,.1,1

b:1/1,0,.8

a:1/1,0,.9

b:1/1,.2,1

�:1/1,0,0

�:2/1,.9,.9�:1/1,0,0

�:2/1,0,0

�:1/1,.8,.8

�:2/1,0,0

23 / 45

0 1 2

a:1/1 b:1/1

2. “b a",[0.2,1,1.9]

0 1 2

b:1/1 a:1/1

Index:

0

1

2

5

3

4

a:1/1,.1,1

b:1/1,0,.8

a:1/1,0,.9

b:1/1,.2,1

�:1/1,0,0

�:2/1,.9,.9�:1/1,0,0

�:2/1,0,0

�:1/1,.8,.8

�:2/1,0,0

23 / 45

0 1 2

a:1/1 b:1/1

2. “b a",[0.2,1,1.9]

0 1 2

b:1/1 a:1/1

0Query: 1

b/1,1,1

Index:

0

1

2

5

3

4

a:1/1,.1,1

b:1/1,0,.8

a:1/1,0,.9

b:1/1,.2,1

�:1/1,0,0

�:2/1,.9,.9�:1/1,0,0

�:2/1,0,0

�:1/1,.8,.8

�:2/1,0,0

23 / 45

0 1 2

a:1/1 b:1/1

2. “b a",[0.2,1,1.9]

0 1 2

b:1/1 a:1/1

0Query: 1

b/1,1,1

Results:

0 1 2

b:1/1,.2,1

�:2/1,0,0

�:1/1,.8,.8

(Utterance ID, Time Interval,Posterior Probability):1. (1,1-1.8,1)2. (2,0.2-1,1)

23 / 45

1-pass STD with Timed Factor Transducer

I Obtain (utterance ID, time interval, posterior probability)I Return triplets with posterior probability > τ

HighlightsI No 2nd pass → fastI No multiple occurrence problem, every distinct interval leads

to another index entry → overlapping intervals are clusteredI No time interval mismatch problem → efficient optimization

(almost deterministic)

24 / 45

Index Size vs. Beam WidthBUTBN-R data-set > 160 hours

1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000

3500

4000

4500

Beam Width

Inde

x Si

ze (i

n M

Bs)

Timed Factor TransducerModified Factor TransducerFactor Transducer

25 / 45

Search Time vs. Beam WidthBUTBN-R data-set > 160 hours, R-IV query-set: 4400 IV terms

1 2 3 4 5 6 7 8 9 10400

600

800

1000

1200

Beam Width

Tota

l Sea

rch

Tim

e (in

sec

onds

)

Factor Transducer (2−Stage)

1 2 3 4 5 6 7 8 9 105

10

15

20

25

30

Beam Width

Tota

l Sea

rch

Tim

e (in

sec

onds

)

Timed Factor TransducerModified Factor Transducer

26 / 45

Per Query Search Time w.r.t. Query LengthBUTBN-R data-set > 160 hours, R-IV query-set: 4400 IV terms

1 2 3 4 5 6 7 8 9 102

4

6

8

10

12Query Length = 1

Beam Width

Aver

age

Sear

ch T

ime

(in m

s)

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2Query Length = 2

Beam Width

Aver

age

Sear

ch T

ime

(in m

s)

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2Query Length = 3

Beam Width

Aver

age

Sear

ch T

ime

(in m

s)

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2Query Length = 4

Beam Width

Aver

age

Sear

ch T

ime

(in m

s)

27 / 45

Per Result Search Time vs. Query LengthBUTBN-R data-set > 160 hours, R-IV query-set: 4400 IV terms

0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Beam Width = 4

Query Length

Aver

age

Sear

ch T

ime

(in m

s)

28 / 45

Summary

I WFST-based indexing provides a fast, mathematically soundretrieval solution for the STD task.

I Modified Factor Transducer is disk space friendly butnon-deterministic.

I Timed Factor Transducer is “almost deterministic” →search-time linear in query length.

29 / 45

Query Forming for Phonetic Search

Motivation: To search for OOV queries

PreparationI Convert word/subword lattices to phonetic latticesI Build a phonetic index

How to search for OOVs?I Orthographic form (text) available, we need phonetic form

(pronunciation)I Use a letter-to-sound (L2S) system to obtain likely

pronunciationsI Use multiple pronunciations to search for OOV queries

30 / 45

L2S Pronunciations

L2S SystemI n-gram model over (letter, phone) pairsI Scores have a wide dynamic range due to the conditional

independence assumptionI Pointless to use L2S scores as they are

Unweighted L2S Pronunciations

1. Obtain weighted pronunciations from the L2S transducer2. Pick n-best alternatives and remove weights3. Search: Compose the unweighted automaton representing

alternatives with the phonetic index

31 / 45

Weighted L2S Pronunciations (Query: Taipei)

1. Obtain weighted pronunciations from L2S transducer[/t ay b ey/ .5, /t ay p ey/ .05, /d ay b ey/ 0.005, ...]

2. Pick n-best alternatives to prevent false alarms[/t ay b ey/ .5, /t ay p ey/ .05] n = 2

3. Scale the weights with query length

[/t ay b ey/ 6√.5, /t ay p ey/ 6

√.05] query length = 6

4. Normalize scaled weights to obtain posterior scores[/t ay b ey/ .6, /t ay p ey/ .4]

5. Search: Compose the weighted automaton representingalternatives with the phonetic index

32 / 45

[/t ay b ey/ 6√.5, /t ay p ey/ 6

32 / 45

[/t ay b ey/ 6√.5, /t ay p ey/ 6

32 / 45

[/t ay b ey/ 6√.5, /t ay p ey/ 6

32 / 45

[/t ay b ey/ .9, /t ay p ey/ .6] query length = 6

32 / 45

Experiment I - Reference Lexicon (Reflex) PronunciationsMSTD data-set, MSTD-OOV query-set: 1290 OOVs, phonetic indexes

Actual Term Weighted Value

Subwords obtained by pruning a phone n-gram model

Data P(FA) P(Miss) ATWVWord 1-best .00001 .770 .215Word Consensus Nets .00002 .687 .294Word Lattices .00002 .657 .322Fragment 1-best .00001 .680 .306Fragment Consensus Nets .00003 .584 .390Fragment Lattices .00003 .485 .484

33 / 45

Experiment I - Reference Lexicon (Reflex) PronunciationsMSTD data-set, MSTD-OOV query-set: 1290 OOVs, phonetic indexes

Actual Term Weighted Value

Subwords obtained by pruning a phone n-gram model

Data P(FA) P(Miss) ATWVWord 1-best .00001 .770 .215Word Consensus Nets .00002 .687 .294Word Lattices .00002 .657 .322Fragment 1-best .00001 .680 .306Fragment Consensus Nets .00003 .584 .390Fragment Lattices .00003 .485 .484

33 / 45

Experiment II - ATWV vs N-best L2S PronunciationsMSTD data-set, MSTD-OOV query-set: 1290 OOVs, phonetic indexes

0 1 2 3 4 5 6 7 8 9 100.2

0.25

0.3

0.322

0.35

0.4

0.45

0.4840.5

N

ATW

V

Fragment Lattices + Weighted L2S PronunciationsFragment Lattices + Unweighted L2S PronunciationsWord Lattices + Weighted L2S PronunciationsWord Lattices + Unweighted L2S Pronunciations

Fragment Lattices + Reflex

Word Lattices + Reflex

34 / 45

Combined DET Plot for Weighted L2S PronunciationsMSTD data-set, MSTD-OOV query-set: 1290 OOVs, phonetic indexes

98

95

90

80

60

40.1.05.02.01.004.001.0001

Miss

pro

babi

lity (i

n %

)

False Alarm probability (in %)

Combined DET Plot: Weighted Letter-to-Sound 1-5 Best Fragment Lattices

1-best, MTWV=0.334, ATWV=0.3722-best, MTWV=0.354, ATWV=0.4223-best, MTWV=0.352, ATWV=0.4404-best, MTWV=0.339, ATWV=0.4475-best, MTWV=0.316, ATWV=0.451

35 / 45

Maximum Term Weighted Valuew/ Global Thresholding

peaks at 2-best

98

95

90

80

60

40.1.05.02.01.004.001.0001

Miss

pro

babi

lity (i

n %

)

35 / 45

Actual Term Weighted Valuew/ Term Specific Thresholding

(β = 1000)

98

95

90

80

60

40.1.05.02.01.004.001.0001

Miss

pro

babi

lity (i

n %

)

35 / 45

Summary

I Lattice indexes perform better than CN indexes in OOVretrieval task.

I Phone indexes generated from sub-word (fragment) latticesrepresent OOVs better.

I Using multiple pronunciations from the L2S system improvesthe performance, particularly when they are properly weighted.

36 / 45

Global Thresholding

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

Incorrect Class DistributionCorrect Class DistributionIncorrect Class EM EstimateCorrect Class EM Estimate

Normalized histogram of posteriorscores for an example query

I Pick a global threshold θ for all query termsI Apply binary thresholdingI Vary θ for different operating points

No term specific behavior, no joint processing of candidates, hencepoor performance!

37 / 45

Global Thresholding

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

Normalized histogram of posteriorscores for an example query

37 / 45

Global Thresholding

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

Reject Accept Normalized histogram of posteriorscores for an example query

37 / 45

Global Thresholding

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

37 / 45

Global Thresholding

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

37 / 45

Term Weighted Value (TWV) [NIST, 2006]

TWV = 1− 1Q

Q∑k=1{Pmiss(qk) + βPFA(qk)}

Pmiss(qk) = 1−C(qk)R(qk)

, PFA(qk) =A(qk)− C(qk)T − C(qk)

Q Number of queriesR(qk) Number of occurrences of query qkA(qk) Total number of retrieved documents for qkC(qk) Number of correctly retrieved documents for qk

T Total duration of the speech archiveβ Cost of false alarms relative to hits

38 / 45

TWV Based Term Specific Thresholding [Miller et al., 2007]

V̂hit(qk) =1

R̂(qk), ĈFA(qk) =

β

T − R̂(qk)

θ̂(qk) =ĈFA(qk)

ĈFA(qk) + V̂hit(qk)

V̂hit(qk) Expected value of a hit for qkĈFA(qk) Expected cost of a false alarm for qkR̂(qk) Expected count of occurrences of qkθ̂(qk) Optimal threshold for qk maximizing TWV in the

expected sense

I Term specific expected counts → Term specific thresholdsI Vary β for different operating points

Only the sum of individual scores affects the threshold!

39 / 45

TWV Based Term Specific Thresholding [Miller et al., 2007]

V̂hit(qk) =1

R̂(qk), ĈFA(qk) =

β

T − R̂(qk)

θ̂(qk) =ĈFA(qk)

ĈFA(qk) + V̂hit(qk)

V̂hit(qk) Expected value of a hit for qkĈFA(qk) Expected cost of a false alarm for qkR̂(qk) Expected count of occurrences of qkθ̂(qk) Optimal threshold for qk maximizing TWV in the

expected sense

I Term specific expected counts → Term specific thresholdsI Vary β for different operating points

Only the sum of individual scores affects the threshold!

39 / 45

Exploiting Score Distributions[Manmatha et al., 2001,Can and Saraclar, 2009]

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

I Scores follow exponential-likedistributions

I Model both classes (0,1) withexponential distributions:

p0(y) = λ0e−λ0y

p1(y) = λ1e−λ1(1−y)

I Model all candidates as a mixture of exponentials

p(y) = π0p0(y) + (1− π0)p1(y)

I Use EM to estimate parameters ( λ0, λ1, π0)

40 / 45

Exploiting Score Distributions[Manmatha et al., 2001,Can and Saraclar, 2009]

0 0.2 0.4 0.6 0.8 10

5

10

15

Posterior Score

n

I Scores follow exponential-likedistributions

I Model both classes (0,1) withexponential distributions:

p0(y) = λ0e−λ0y

p1(y) = λ1e−λ1(1−y)

I Model all candidates as a mixture of exponentials

p(y) = π0p0(y) + (1− π0)p1(y)

I Use EM to estimate parameters ( λ0, λ1, π0)

40 / 45

Computing Term Specific Thresholds

Cost Scheme

C =[

0 1α 0

]where α is a user defined parameter specifying the cost of falsealarms relative to hits.

I Estimate mixture parameters → each component ∼ a class,mixture weights ∼ priors

I For k = 1, . . . , Q, Bayes-optimal threshold θ̂(qk) is given as:

θ̂(qk) =λ̂1(qk) + log(λ̂0(qk)/λ̂1(qk)) + log(π̂0(qk)/π̂1(qk)) + logα

λ̂0(qk) + λ̂1(qk)

I Different operating points can be achieved by changing α.

41 / 45

STD Evaluation Metrics

Precision-Recall CurvesPrecision and recall are the most popular IR evaluation metrics.Given a set of queries qk, k = 1, . . . , Q, let

R(qk) be the number of segments in the collection that arerelated to the query qk,

A(qk) be the total number of retrieved segments andC(qk) be the number of correctly retrieved segments.

Precision = 1Q

Q∑k=1

C(qk)A(qk)

, Recall = 1Q

Q∑k=1

C(qk)R(qk)

ROC CurvesThese curves use NIST’s PMiss and PFA definitions for STD.

42 / 45

Precision-Recall ComparisonBUTBN-R data-set, R-IV query-set, lattice beam = 4

0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision

Reca

ll

Global ThresholdingTWV Based TSTScore Distribution Based TST

43 / 45

ROC ComparisonBUTBN-R data-set, R-IV query-set, lattice beam = 4

10−6 10−5 10−4 10−30.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PFA

P Mis

s

Global ThresholdingTWV Based TSTScore Distribution Based TST

44 / 45

Summary

I Exploiting score distributions leads to a viable term specificthresholding method

I SD-TST optimizes precision metric → superior to TWV-TSTover a large interval of precision values

I TWV-TST optimizes false alarm metric → has much betterROC performance

45 / 45

PublicationsCan, D., Cooper, E., Ghoshal, A., Jansche, M., Khudanpur,S., Ramabhadran, B., Riley, M., Saraclar, M., Sethy, A.,Ulinski, M., and White, C. (2009a).Web derived pronunciations for spoken term detection.In SIGIR, pages 83–90.

Can, D., Cooper, E., Sethy, A., White, C., Ramabhadran, B.,and Saraclar, M. (2009b).Effect of pronunciations on oov queries in spoken termdetection.Acoustics, Speech, and Signal Processing, IEEE InternationalConference on, 0:3957–3960.Can, D. and Saraclar, M. (2009).Score distribution based term specific thresholding for spokenterm detection.In Proceedings of NAACL-HLT 2009, pages 269–272, Boulder,Colorado. Association for Computational Linguistics.

1 / 10

References I

Allauzen, C., Mohri, M., and Saraclar, M. (2004).General-indexation of weighted automata-application tospoken utterance retrieval.In Proc. HLT-NAACL.Can, D., Cooper, E., Sethy, A., White, C., Ramabhadran, B.,and Saraclar, M. (2009).Effect of pronunciations on oov queries in spoken termdetection.Acoustics, Speech, and Signal Processing, IEEE InternationalConference on, 0:3957–3960.Can, D. and Saraclar, M. (2009).Score distribution based term specific thresholding for spokenterm detection.In Proceedings of NAACL-HLT 2009, pages 269–272, Boulder,Colorado. Association for Computational Linguistics.

2 / 10

References II

Chaudhari, U. V. and Picheny, M. (2007).Improvements in phone based audio search via constrainedmatch with high order confusion estimates.In Proc. of ASRU.Chelba, C. and Acero, A. (2005).Position specific posterior lattices for indexing speech.In Proc. of ACL.Li, Y. C., Lo, W. K., Meng, H. M., and Ching., P. C. (2000).Query expansion using phonetic confusions for chinese spokendocument retrieval.In Proc. of IRAL.Mamou, J., Ramabhadran, B., and Siohan, O. (2007).Vocabulary independent spoken term detection.In Proc. of ACM SIGIR.

3 / 10

References III

Manmatha, R., Rath, T., and Feng, F. (2001).Modeling score distributions for combining the outputs ofsearch engines.In SIGIR ’01, pages 267–275, New York, NY, USA. ACM.

Miller, D. R. H., Kleber, M., Kao, C., Kimball, O., Colthurst,T., Lowe, S. A., Schwartz, R. M., and Gish, H. (2007).Rapid and accurate spoken term detection.In Proc. Interspeech.

NIST (2006).The spoken term detection (STD) 2006 evaluation plan.http://www.itl.nist.gov/iad/mig/tests/std/.

Parlak, S. and Saraclar, M. (2008).Spoken term detection for Turkish Broadcast News.In Proc. ICASSP.

4 / 10

References IV

Saraclar, M. and Sproat, R. (2004).Lattice-based search for spoken utterance retrieval.In Proc. HLT-NAACL.Siohan, O. and Bacchiani, M. (2005).Fast vocabulary independent audio search using path basedgraph indexing.In Proc. of Interspeech.

Zhou, Z. Y., Yu, P., Chelba, C., and Seide, F. (2006).Towards spoken-document retrieval for the internet: latticeindexing for large-scale web-search architectures.In Proc. of HLT-NAACL.

5 / 10

Notation & DefinitionsProduct Semiring

DefinitionFor two partially-ordered semirings A = (A,⊕A,⊗A, 0A, 1A) andB = (B,⊕B,⊗B, 0B, 1B), the product semiring over A× B:

A× B = (A× B,⊕×,⊗×, 0A × 0B, 1A × 1B)

where ⊕× and ⊗× are component-wise operators

(a1, b1)⊕× (a2, b2) = (a1 ⊕A a2, b1 ⊕B b2),(a1, b1)⊗× (a2, b2) = (a1 ⊗A a2, b1 ⊗B b2).

The natural order over A× B, given by

((a1, b1) ≤× (a2, b2))⇔ (a1 ⊕A a2 = a1, b1 ⊕B b2 = b1),

is a partial order, even if A and B are totally-ordered.6 / 10

Notation & DefinitionsLexicographic Semiring

DefinitionFor two partially-ordered semirings A = (A,⊕A,⊗A, 0A, 1A) andB = (B,⊕B,⊗B, 0B, 1B), the lexicographic semiring over A× B:

A ∗ B = (A× B,⊕∗,⊗∗, 0A × 0B, 1A × 1B)

where ⊕∗ is a lexicographic priority operator

(a1, b1)⊕∗ (a2, b2) =

(a1, b1 ⊕B b2) a1 = a2(a1, b1) a1 = a1 ⊕A a2 6= a2(a2, b2) a1 6= a1 ⊕A a2 = a2

and ⊗∗ is a component-wise multiplication operator.A ∗ B is a totally-ordered when A and B are so:

((a1, b1) ≤∗ (a2, b2))⇔(a1 = a1 ⊕A a2 6= a2)

or(a1 = a2 and b1 = b1 ⊕B b2) 7 / 10

DefinitionI Given a transition e ∈ E:

I p[e] : its previous state,I n[e] : its next state,I i[e] : its input label,I o[e] : its output label,I w[e] : its weight.

I A path π = e1 · · · ek ∈ E∗:n[ei − 1] = p[ei], i = 2, . . . , k.

I We extend p, n, i, o, w to paths:I p[π] = p[e],I n[π] = n[ek],I i[π] = i[e1] · · · i[ek],I o[π] = o[e1] · · · o[ek],I w[π] = w[e1]⊗ · · · ⊗ w[ek].

qa

qb

m:h/.5

8 / 10

qa

qb

m:h/.5

8 / 10

qa

qb

m:h/.5

8 / 10

qa

qb

m:h/.5

8 / 10

qa

qb

m:h/.5

8 / 10

qa

qb

m:h/.5

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

qa

qb

m:h/.5

qa

qb

qc/1

m:h/.5

e:a/.7

e:v/.3

8 / 10

Factor TransducerDefinitionFor a finite-state automaton Ai, i = 1 . . . n, the factor transducerof Ai is defined as the weighted finite-state transducer Ti for which:

JTiK(x, i) = − log(EPi [Ci(x)]) ∈ T

where x ∈ XAi and EPi [Ci(x)] is the expected count of x in Ai.For each state q ∈ Qi,d[q]: shortest distance from Ii to q (− log of forward prob.),f [q]: shortest distance from q to Fi (− log of backward prob.).

d[q] =log⊕

π∈P (Ii,q)(λi(p[π])+w[π]), f [q] =

log⊕π∈P (q,Fi)

(w[π]+ρi(n[π]))

− log(EPi [Cx]) =log⊕

i[π]=x(d[p[π]] + w[π] + f [n[π]])

9 / 10

Factor TransducerDefinitionFor a finite-state automaton Ai, i = 1 . . . n, the factor transducerof Ai is defined as the weighted finite-state transducer Ti for which:

JTiK(x, i) = − log(EPi [Ci(x)]) ∈ T

where x ∈ XAi and EPi [Ci(x)] is the expected count of x in Ai.For each state q ∈ Qi,d[q]: shortest distance from Ii to q (− log of forward prob.),f [q]: shortest distance from q to Fi (− log of backward prob.).

d[q] =log⊕

π∈P (Ii,q)(λi(p[π])+w[π]), f [q] =

log⊕π∈P (q,Fi)

(w[π]+ρi(n[π]))

− log(EPi [Cx]) =log⊕

i[π]=x(d[p[π]] + w[π] + f [n[π]])

9 / 10

EM Parameter UpdatesI Model all candidates as a mixture of exponentials

p(y) = π0p0(y) + (1− π0)p1(y)I Use EM to estimate parameters (λ0(qk), λ1(qk) and π0(qk))

given the candidate scores (yk,n n = 1, . . . , Nk) of a queryterm qk.

I First compute

P (j|yk,n) =π̂j(qk)pj(yk,n)

p(yk,n)j = 1, 2, n = 1, . . . , Nk

I Then update

λ̂0(qk) =∑n P (0|yk,n)∑

n P (0|yk,n)yk,n,

λ̂1(qk) =∑n P (1|yk,n)∑

n P (1|yk,n)(1− yk,n),

π̂j(qk) =1Nk

∑n

P (j|yk,n).

10 / 10

IntroductionLattice Indexation/Search FrameworkPreliminariesSpoken Utterance Retrieval with Factor Transducer2-Pass Spoken Term Detection with Factor TransducerSpoken Term Detection with Modified Factor TransducerSpoken Term Detection with Timed Factor TransducerExperimental Results

AppendixAppendix

Date post:	06-Jul-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Indexation, Retrieval and Detection Techniques for Spoken ... ·...

Documents