+ All Categories
Home > Documents > Acousticdata-drivenpronunciationlexiconforLVCSRllu/pdf/liang_asru13_poster.pdf · Liang Lu, Arnab...

Acousticdata-drivenpronunciationlexiconforLVCSRllu/pdf/liang_asru13_poster.pdf · Liang Lu, Arnab...

Date post: 22-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
1
R T S C R T S C T H E U N I V E R S I T Y O F E D I N B U R G H Acoustic data-driven pronunciation lexicon for LVCSR Liang Lu, Arnab Ghoshal, and Steve Renals The University of Edinburgh, {liang.lu, a.ghoshal, s.renals}@ed.ac.uk Introduction This paper is about learning the pronunciation model in the low- resource condition with focuses on a review of probabilistic pronunciation model in the ASR frame- work WFST and Viterbi based implementation for the EM training of the pronunciation model iterative learning of the acoustic and pronunciation models Probabilistic Pronunciation Model Conventional ASR systems assume using a static pronunciation model implicitly as ˆ W = arg max W p(O|M, W)P (W), (1) where M denotes the acoustic model parameters, and P (W) is the prior probability for word sequence W, and O are the acoustic obser- vations. With an explicit pronunciation model, the framework becomes ˆ W = arg max W P (W) X BΨ W p(O|M, B)P (B|W), (2) where B = {b 1 ,..., b n } denotes a valid pronunciation sequence for the word transcription W = {w 1 ,..., w n }. P (B|W) denotes its probabil- ity, and b i is the pronunciation of word w i W denotes the set of all the possible pronunciation sequences of W. Context-independent pronunciation model In this case: P (B|W)= P (b 1 |w 1 ) ··· P (b n |w n ) . (3) Like many others, we assume that each word may have multiple surface pronunciations with a corresponding probability weight. Then P (b i = p j |w i )= θ ij , j =1,...,J i (4) subject to: X j θ ij =1 . (5) where J i is the number of alternate pronunciations of w i , and p j de- notes one of those surface pronunciations with a weight θ ij . EM-based maximum likelihood training EM-auxiliary function to update the pronunciation weight of the word and pronunciation pair(w i , p j ): Q(θ ij )= R X r =1 X B r Ψ * W r P (B r |O r , M, W r ) | {z } posterior of B r log P (B r |W r )+ k = R X r =1 X B r Ψ * W r P (B r |O r , M, W r )C ij |B r | {z } How to compute it? log θ ij + k (6) where C ij |B r denotes the number of times that (w i , p j ) appears in the pronunciation sequence B r . 1) WFST-based training Representing all possible pronunciation sequences for a word sequence W r as a WFST: P r = min(det(L◦G r )), (7) where L is the lexicon transducer that maps the words to their cor- responding pronunciations; G r is a linear acceptor representing W r . Then do scoring using a path counting transducer 0 1 a:a/4 2 b:b/7 3 c:c/23 4/2 d:d/9 c:c/2 d:d/12 0 1 a:a/4 2 b:b/7 3 c:c/2 4/2 d:d/12 (a) (b) (c) 0 a:a/1 b:b/1 c:c/1 d:d/1 1 a:a/1 2/1 b:b/1 a:a/1 b:b/1 c:c/1 d:d/1 Fig 1 (a): The path counting transducer for pronunciation p 1 = “a b” . (b): A decoding graph that contains p 1 . (c): The path with its corresponding weights obtained by the composition of (a) and (b). 2) Viterbi-based training Viterbi approximation makes use of the most likely pronunciation se- quence, which is more computationally efficient. Q(θ ij )= R X r =1 C ij | ˆ B r log θ ij + k (8) ˆ B r = arg max B r P (B r |O, M, W r ). (9) Experiments and Results We run experiments on Swith- board corpus, with seed lexicon with 5k words and the expert lexicon with 30k words 110hr and 300hr training set configuration iterative acoustic and lexicon model training scheme 1. Train the G2P model 2. Generate the lexicon 3. Train the acoustic model 4. Update the lexicon Converged ? Yes 5. Update the acoustic model No 6. Update the G2P model Initial lexicon 1) Results of the training scheme and comparison of WFST and Viterbi based training on the 110hr training set 2) Results on the 300hr training set using Viterbi based training System Callhome Swb Avg G2P baseline (ML) 46.3 29.0 37.7 110-hr lexicon baseline (ML) 44.2 27.2 35.9 G2P iter1 (ML) 43.6 26.1 35.0 + ML-SAT 38.2 23.2 30.8 + bMMI-SAT 35.1 20.5 27.8 Expert lexicon baseline (ML) 42.3 25.3 34.0 + ML-SAT 36.8 22.0 29.4 + bMMI-SAT 33.5 19.3 26.4 Conclusion This paper is about learning the pronunciation lexicon from the transcribed acoustic data. Two training algorithms which are based on the WFST and Viterbi methods are compared. It requires an initial seed lexicon and uses context-independent pronunciation models, and future works are to move beyond these constraints. Acknowledgement The research was supported by EPSRC Pro- gramme Grant EP/I031022/1 (Natural Speech Technology)
Transcript
Page 1: Acousticdata-drivenpronunciationlexiconforLVCSRllu/pdf/liang_asru13_poster.pdf · Liang Lu, Arnab Ghoshal, and Steve Renals The University of Edinburgh, {liang.lu, a.ghoshal, s.renals}@ed.ac.uk

RTSC

RTSC

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Acousticdata-drivenpronunciationlexiconforLVCSRLiang Lu, Arnab Ghoshal, and Steve Renals

The University of Edinburgh, {liang.lu, a.ghoshal, s.renals}@ed.ac.uk

IntroductionThis paper is about learning the pronunciation model in the low-resource condition with focuses on

• a review of probabilistic pronunciation model in the ASR frame-work

• WFST and Viterbi based implementation for the EM training ofthe pronunciation model

• iterative learning of the acoustic and pronunciation models

Probabilistic Pronunciation ModelConventional ASR systems assume using a static pronunciation modelimplicitly as

W = arg maxW

p(O|M,W)P (W), (1)

where M denotes the acoustic model parameters, and P (W) is theprior probability for word sequence W, and O are the acoustic obser-vations.With an explicit pronunciation model, the framework becomes

W = arg maxW

P (W)∑

B∈ΨW

p(O|M,B)P (B|W), (2)

where B = {b1, . . . ,bn} denotes a valid pronunciation sequence for theword transcription W = {w1, . . . ,wn}. P (B|W) denotes its probabil-ity, and bi is the pronunciation of word wi. ΨW denotes the set of allthe possible pronunciation sequences of W.

Context-independent pronunciation modelIn this case:

P (B|W) = P (b1|w1) · · ·P (bn|wn) . (3)

Like many others, we assume that each word may have multiple surfacepronunciations with a corresponding probability weight. Then

P (bi = pj |wi) = θij , j = 1, . . . , Ji (4)

subject to:∑j

θij = 1 . (5)

where Ji is the number of alternate pronunciations of wi, and pj de-notes one of those surface pronunciations with a weight θij .

EM-based maximum likelihood trainingEM-auxiliary function to update the pronunciation weight of the wordand pronunciation pair(wi,pj):

Q(θij) =R∑

r=1

∑Br∈Ψ∗

Wr

P (Br|Or,M,Wr)︸ ︷︷ ︸posterior of Br

logP (Br|Wr) + k

=R∑

r=1

∑Br∈Ψ∗

Wr

P (Br|Or,M,Wr)Cij|Br︸ ︷︷ ︸How to compute it?

log θij + k (6)

where Cij|Brdenotes the number of times that (wi,pj) appears in the

pronunciation sequence Br.

1) WFST-based trainingRepresenting all possible pronunciation sequences for a word sequenceWr as a WFST:

Pr = min(det(L ◦ Gr)), (7)

where L is the lexicon transducer that maps the words to their cor-responding pronunciations; Gr is a linear acceptor representing Wr.Then do scoring using a path counting transducer

0 1a:a/4

2b:b/7

3c:c/23

4/2d:d/9

c:c/2

d:d/12

0 1a:a/4 2b:b/7 3c:c/2 4/2d:d/12

(a) (b)

(c)

0

a:a/1b:b/1c:c/1d:d/1

1a:a/1 2/1b:b/1

a:a/1b:b/1c:c/1d:d/1

Fig 1 (a): The path counting transducer for pronunciation p1 = “a b”. (b): A decoding graph that contains p1. (c): The path with itscorresponding weights obtained by the composition of (a) and (b).

2) Viterbi-based trainingViterbi approximation makes use of the most likely pronunciation se-quence, which is more computationally efficient.

Q(θij) =R∑

r=1

Cij|Brlog θij + k (8)

Br = arg maxBr

P (Br|O,M,Wr). (9)

Experiments and Results

We run experiments on Swith-board corpus, with

• seed lexicon with 5k wordsand the expert lexicon with30k words

• 110hr and 300hr training setconfiguration

• iterative acoustic and lexiconmodel training scheme

1. Train the G2P model

2. Generate the lexicon

3. Train the acoustic model

4. Update the lexicon

Converged ?

Yes

5. Update the acoustic model

No

6. Update the G2P model

Initial lexicon

1) Results of the training scheme and comparison of WFST and Viterbibased training on the 110hr training set

2) Results on the 300hr training set using Viterbi based training

System Callhome Swb AvgG2P baseline (ML) 46.3 29.0 37.7110-hr lexicon baseline (ML) 44.2 27.2 35.9G2P iter1 (ML) 43.6 26.1 35.0+ ML-SAT 38.2 23.2 30.8+ bMMI-SAT 35.1 20.5 27.8Expert lexicon baseline (ML) 42.3 25.3 34.0+ ML-SAT 36.8 22.0 29.4+ bMMI-SAT 33.5 19.3 26.4

Conclusion This paper is about learning the pronunciation lexiconfrom the transcribed acoustic data. Two training algorithms which arebased on the WFST and Viterbi methods are compared. It requires aninitial seed lexicon and uses context-independent pronunciation models,and future works are to move beyond these constraints.Acknowledgement The research was supported by EPSRC Pro-gramme Grant EP/I031022/1 (Natural Speech Technology)

Recommended