Large vocabulary ASR
Peter Bell
Automatic Speech Recognition – ASR Lecture 910 February 2020
ASR Lecture 9 Large vocabulary ASR 1
HMM Speech Recognition
AcousticModel
LanguageModel
Recorded Speech Decoded Text (Transcription)
TrainingData
SignalAnalysis
SearchSpace
ASR Lecture 9 Large vocabulary ASR 2
The Search Problem in ASR
Find the most probable word sequence W = w1,w2, . . . ,wM
given the acoustic observations X = x1, x2, . . . , xT :
W = arg maxW
P(W |X)
= arg maxW
p(X |W )︸ ︷︷ ︸acoustic model
P(W )︸ ︷︷ ︸language model
Use pronuniciation knowledge to construct HMMs for allpossible words
Finding the most probable state sequence allows us to recoverthe most probable word sequence
Viterbi decoding is an efficient way of finding the mostprobable state sequence, but even this is infeasible as thevocabulary gets very large or when a stronger language modelis used
ASR Lecture 9 Large vocabulary ASR 3
Recap: the word HMM
r1 r2 r3 ai_1 ai_2 ai_3r_1 r_2 r_3 t_1 t_2 t_3
/r/ /ai/ /t/“right”
HMM naturally generates an alignment between hidden states andobservation sequence
ASR Lecture 9 Large vocabulary ASR 4
Viterbi algorithm for state alignmentr_1
r_2
r_3
0
x1 …
state
time
Eai_1
ai_2
ai_3
t_1
t_2
t_3
Viterbi algorithm finds the best path through the trellis – givingthe highest p(X ,Q).
ASR Lecture 9 Large vocabulary ASR 5
Simplified version with one state per phone
rai
t0
E
x1 x2
state
time
x3 x4 x5 x6 x7
ASR Lecture 9 Large vocabulary ASR 6
Isolated word recognition
r1 r2 r3r ai t
“right”
r1 r2 r3t eh n
“ten”
r1 r2 r3p ih n
“pin”
E0
ASR Lecture 9 Large vocabulary ASR 7
Viterbi algorithm: isolated word recognition
rai
t0
x1 x2
state
time
x3 x4 x5 x6 x7
teh
n
E
teh
n
ASR Lecture 9 Large vocabulary ASR 8
Connected word recognition
Even worse when recognising connected words...
The number of words in the utterance is not known
Word boundaries are not known: any of the V words maypotentially start at each frame.
ASR Lecture 9 Large vocabulary ASR 9
Connected word recognition
r1 r2 r3r ai t
r1 r2 r3t eh n
r1 r2 r3p ih n
E0
ASR Lecture 9 Large vocabulary ASR 10
Viterbi algorithm: connected word recognition
rai
t0
x1 x2
state
time
x3 x4 x5 x6 x7
teh
nE
pih
n
Add transitions betweenall word-final andword-initial states
ASR Lecture 9 Large vocabulary ASR 11
Connected word recognition
rai
t0
x1 x2
state
time
x3 x4 x5 x6 x7
teh
nE
pih
n
Viterbi decoding finds thebest word sequence
BUT: have to consider|V |2 inter-word transitionsat every time step
ASR Lecture 9 Large vocabulary ASR 12
Connected word recognition
rai
t0
x1 x2
state
time
x3 x4 x5 x6 x7
teh
nE
pih
n
ASR Lecture 9 Large vocabulary ASR 13
Integrating the language model
So far we’ve estimated HMM transition probabilities fromaudio data, as part of the acoustic model
Transitions between words rightarrow use a language model
n-gram language model:
p(wi |hi ) = p(wi |wi−n, . . .wi−1)
Integrate the language model directly in the Viterbi search
ASR Lecture 9 Large vocabulary ASR 14
Incorporating a bigram language model
r1 r2 r3r ai t
r1 r2 r3t eh n
r1 r2 r3p ih n
P(ten | pin)
P(pin | pin)
P(pin | right)
ASR Lecture 9 Large vocabulary ASR 15
Incorporating a bigram language model
rai
t0
x1 x2
state
time
x3 x4 x5 x6 x7
teh
nE
pih
n
P(ten | pin)
P(pin | pin)
P(right | pin)
ASR Lecture 9 Large vocabulary ASR 16
Incorporating a trigram language model
tair
r1 r2 r3t eh n
r1 r2 r3p ih n
r1 r2 r3r ai t
r1 r2 r3t eh n
r1 r2 r3p ih n
r1 r2 r3r ai t
r1 r2 r3t eh n
r1 r2 r3p ih n
P(pin | ten pin)
P(pin | pin pin)
P(ten | pin pin)
P(right | pin ten)
P(right | ten ten)
P(right | ten right )
Need to duplicate HMMstates to incorporateextended word history
ASR Lecture 9 Large vocabulary ASR 17
Computational Issues
Viterbi decoding performs an exact search in an efficientmanner
But exact search is not possible for large vocabulary tasks
Long-span language models and the use of cross-wordtriphones greatly increase the size of the search space
Solutions:
Beam search (prune low probability hypotheses)Tree structured lexiconsLanguage model look-aheadDynamic search structuresMultipass search (→ two-stage decoding)Best-first search (→ stack decoding / A∗ search)
An alternative approach: Weighted Finite State Transducers(WFST)
ASR Lecture 9 Large vocabulary ASR 18
Computational Issues
Viterbi decoding performs an exact search in an efficientmanner
But exact search is not possible for large vocabulary tasks
Long-span language models and the use of cross-wordtriphones greatly increase the size of the search space
Solutions:
Beam search (prune low probability hypotheses)Tree structured lexiconsLanguage model look-aheadDynamic search structuresMultipass search (→ two-stage decoding)Best-first search (→ stack decoding / A∗ search)
An alternative approach: Weighted Finite State Transducers(WFST)
ASR Lecture 9 Large vocabulary ASR 18
Pruning
rai
t0
x1 x2
state
time
x3 x4 x5 x6 x7
teh
nE
pih
n
During Viterbi decoding,don’t propagate tokenswhose probability falls acertain amount below thecurrent best path
Result is only anapproximation to the bestpath
ASR Lecture 9 Large vocabulary ASR 19
Tree-structured lexicon
s
ey
t
p
ee
eh
k
ch
l
aw
k
eh
l
0
say
speak
speech
spell
talk
tell
Figure adapted from Ortmans & Ney, “The time-conditioned approach in dynamic programming search for LVCSR”
ASR Lecture 9 Large vocabulary ASR 20
Tree-structured lexicon
sp
ee
x1 x2
state
time
x3 x4 x5 x6 x7
kch
ehl
ey
spell
speech
say
speak
Reduces the number ofstate transitioncomputations
For clarity, not all the connections are shown
ASR Lecture 9 Large vocabulary ASR 21
Language model look-ahead
Aim to make pruning more efficient
In tree-structured decoding, look ahead to find out the bestLM score for any words further down the tree
This information can be pre-computed and stored at eachnode in the tree
States in the tree are pruned early if we know that none of thepossibilities will receive good enough probabilities from theLM.
ASR Lecture 9 Large vocabulary ASR 22
Weighted Finite State Transducers
Weighted finite state automaton that transduces an inputsequence to an output sequence (Mohri et al 2008)
States connected by transitions. Each transition has
input labeloutput labelweight
Weights use the log semi-ring or tropical semi-ring – withoperations that correspond to multiplication and addition ofprobabilities
There is a single start state. Any state can optionally be afinal state (with a weight)
Used by Kaldi
ASR Lecture 9 Large vocabulary ASR 23
Weighted Finite State AcceptorsSpringer Handbook on Speech Processing and Speech Communication 3
0 1using/1
2data/0.66
3
intuition/0.33
4
is/15
better/0.7
worse/0.3
(a)
are/0.5
is/0.5
(b)
0 1d/1 2ey/0.5ae/0.5
3t/0.3dx/0.7
4ax/1
(c)0
d1
1d1
d2
2d2
d3
3d3
Figure 1: Weighted finite-state acceptor examples. By convention, the states are represented by circles andmarked with their unique number. The initial state is represented by a bold circle, final states by doublecircles. The label l and weight w of a transition are marked on the corresponding directed arc by l/w. Whenexplicitly shown, the final weight w of a final state f is marked by f/w.
quite similar to a weighted acceptor except that it hasan input label, an output label and a weight on eachof its transitions. The examples in Figure 2 encode(a superset of) the information in the WFSAs of Fig-ure 1(a)-(b) as WFSTs. Figure 2(a) represents thesame language model as Figure 1(a) by giving eachtransition identical input and output labels. This addsno new information, but is a convenient way we useoften to treat acceptors and transducers uniformly.
Figure 2(b) represents a toy pronunciation lexi-con as a mapping from phone strings to words inthe lexicon, in this example data and dew, withprobabilities representing the likelihoods of alterna-tive pronunciations. It transduces a phone string thatcan be read along a path from the start state to a fi-nal state to a word string with a particular weight.The word corresponding to a pronunciation is out-put by the transition that consumes the first phonefor that pronunciation. The transitions that consumethe remaining phones output no further symbols, in-dicated by the null symbol ϵ as the transition’s outputlabel. In general, an ϵ input label marks a transitionthat consumes no input, and an ϵ output label marksa transition that produces no output.
This transducer has more information than theWFSA in Figure 1(b). Since words are encoded by
the output label, it is possible to combine the pronun-ciation transducers for more than one word withoutlosing word identity. Similarly, HMM structures ofthe form given in Figure 1(c) can be combined intoa single transducer that preserves phone model iden-tity. This illustrates the key advantage of a transducerover an acceptor: the transducer can represent a rela-tionship between two levels of representation, for in-stance between phones and words or between HMMsand context-independent phones. More precisely, atransducer specifies a binary relation between strings:two strings are in the relation when there is a pathfrom an initial to a final state in the transducer thathas the first string as the sequence of input labelsalong the path, and the second string as the sequenceof output labels along the path (ϵ symbols are leftout in both input and output). In general, this is arelation rather than a function since the same inputstring might be transduced to different strings alongtwo distinct paths. For a weighted transducer, eachstring pair is also associated with a weight.
We rely on a common set of weighted trans-ducer operations to combine, optimize, search andprune them [Mohri et al., 2000]. Each operationimplements a single, well-defined function that hasits foundations in the mathematical theory of ratio-
Springer Handbook on Speech Processing and Speech Communication 3
0 1using/1
2data/0.66
3
intuition/0.33
4
is/15
better/0.7
worse/0.3
(a)
are/0.5
is/0.5
(b)
0 1d/1 2ey/0.5ae/0.5
3t/0.3dx/0.7
4ax/1
(c)0
d1
1d1
d2
2d2
d3
3d3
Figure 1: Weighted finite-state acceptor examples. By convention, the states are represented by circles andmarked with their unique number. The initial state is represented by a bold circle, final states by doublecircles. The label l and weight w of a transition are marked on the corresponding directed arc by l/w. Whenexplicitly shown, the final weight w of a final state f is marked by f/w.
quite similar to a weighted acceptor except that it hasan input label, an output label and a weight on eachof its transitions. The examples in Figure 2 encode(a superset of) the information in the WFSAs of Fig-ure 1(a)-(b) as WFSTs. Figure 2(a) represents thesame language model as Figure 1(a) by giving eachtransition identical input and output labels. This addsno new information, but is a convenient way we useoften to treat acceptors and transducers uniformly.
Figure 2(b) represents a toy pronunciation lexi-con as a mapping from phone strings to words inthe lexicon, in this example data and dew, withprobabilities representing the likelihoods of alterna-tive pronunciations. It transduces a phone string thatcan be read along a path from the start state to a fi-nal state to a word string with a particular weight.The word corresponding to a pronunciation is out-put by the transition that consumes the first phonefor that pronunciation. The transitions that consumethe remaining phones output no further symbols, in-dicated by the null symbol ϵ as the transition’s outputlabel. In general, an ϵ input label marks a transitionthat consumes no input, and an ϵ output label marksa transition that produces no output.
This transducer has more information than theWFSA in Figure 1(b). Since words are encoded by
the output label, it is possible to combine the pronun-ciation transducers for more than one word withoutlosing word identity. Similarly, HMM structures ofthe form given in Figure 1(c) can be combined intoa single transducer that preserves phone model iden-tity. This illustrates the key advantage of a transducerover an acceptor: the transducer can represent a rela-tionship between two levels of representation, for in-stance between phones and words or between HMMsand context-independent phones. More precisely, atransducer specifies a binary relation between strings:two strings are in the relation when there is a pathfrom an initial to a final state in the transducer thathas the first string as the sequence of input labelsalong the path, and the second string as the sequenceof output labels along the path (ϵ symbols are leftout in both input and output). In general, this is arelation rather than a function since the same inputstring might be transduced to different strings alongtwo distinct paths. For a weighted transducer, eachstring pair is also associated with a weight.
We rely on a common set of weighted trans-ducer operations to combine, optimize, search andprune them [Mohri et al., 2000]. Each operationimplements a single, well-defined function that hasits foundations in the mathematical theory of ratio-
ASR Lecture 9 Large vocabulary ASR 24
Weighted Finite State Transducers
Acceptor
Springer Handbook on Speech Processing and Speech Communication 3
0 1using/1
2data/0.66
3
intuition/0.33
4
is/15
better/0.7
worse/0.3
(a)
are/0.5
is/0.5
(b)
0 1d/1 2ey/0.5ae/0.5
3t/0.3dx/0.7
4ax/1
(c)0
d1
1d1
d2
2d2
d3
3d3
Figure 1: Weighted finite-state acceptor examples. By convention, the states are represented by circles andmarked with their unique number. The initial state is represented by a bold circle, final states by doublecircles. The label l and weight w of a transition are marked on the corresponding directed arc by l/w. Whenexplicitly shown, the final weight w of a final state f is marked by f/w.
quite similar to a weighted acceptor except that it hasan input label, an output label and a weight on eachof its transitions. The examples in Figure 2 encode(a superset of) the information in the WFSAs of Fig-ure 1(a)-(b) as WFSTs. Figure 2(a) represents thesame language model as Figure 1(a) by giving eachtransition identical input and output labels. This addsno new information, but is a convenient way we useoften to treat acceptors and transducers uniformly.
Figure 2(b) represents a toy pronunciation lexi-con as a mapping from phone strings to words inthe lexicon, in this example data and dew, withprobabilities representing the likelihoods of alterna-tive pronunciations. It transduces a phone string thatcan be read along a path from the start state to a fi-nal state to a word string with a particular weight.The word corresponding to a pronunciation is out-put by the transition that consumes the first phonefor that pronunciation. The transitions that consumethe remaining phones output no further symbols, in-dicated by the null symbol ϵ as the transition’s outputlabel. In general, an ϵ input label marks a transitionthat consumes no input, and an ϵ output label marksa transition that produces no output.
This transducer has more information than theWFSA in Figure 1(b). Since words are encoded by
the output label, it is possible to combine the pronun-ciation transducers for more than one word withoutlosing word identity. Similarly, HMM structures ofthe form given in Figure 1(c) can be combined intoa single transducer that preserves phone model iden-tity. This illustrates the key advantage of a transducerover an acceptor: the transducer can represent a rela-tionship between two levels of representation, for in-stance between phones and words or between HMMsand context-independent phones. More precisely, atransducer specifies a binary relation between strings:two strings are in the relation when there is a pathfrom an initial to a final state in the transducer thathas the first string as the sequence of input labelsalong the path, and the second string as the sequenceof output labels along the path (ϵ symbols are leftout in both input and output). In general, this is arelation rather than a function since the same inputstring might be transduced to different strings alongtwo distinct paths. For a weighted transducer, eachstring pair is also associated with a weight.
We rely on a common set of weighted trans-ducer operations to combine, optimize, search andprune them [Mohri et al., 2000]. Each operationimplements a single, well-defined function that hasits foundations in the mathematical theory of ratio-
Transducer
Springer Handbook on Speech Processing and Speech Communication 4
(a)
0 1using:using/1
2data:data/0.66
3intuition:intuition/0.33
4
is:is/0.5
are:are/0.5
is:is/1
5better:better/0.7
worse:worse/0.3
(b)
0
1d:data/1
5
d:dew/1
2ey:ε/0.5
ae:ε/0.5
6uw:ε/1
3t:ε/0.3
dx:ε/0.74
ax: ε /1
Figure 2: Weighted finite-state transducer examples. These are similar to the weighted acceptors in Figure 1except output labels are introduced on each transition. The input label i, the output label o, and weight w ofa transition are marked on the corresponding directed arc by i : o/w.
nal power series [Salomaa and Soittola, 1978, Bers-tel and Reutenauer, 1988, Kuich and Salomaa, 1986].Many of those operations are the weighted trans-ducer generalizations of classical algorithms for un-weighted acceptors. We have brought together thoseand a variety of auxiliary operations in a comprehen-sive weighted finite-state machine software library(FsmLib) available for non-commercial use from theAT&T Labs – ResearchWeb site [Mohri et al., 2000].
Basic union, concatenation, and Kleene closureoperations combine transducers in parallel, in series,and with arbitrary repetition, respectively. Other op-erations convert transducers to acceptors by project-ing onto the input or output label set, find the bestor the n best paths in a weighted transducer, removeunreachable states and transitions, and sort acyclicautomata topologically.
Where possible, we provided lazy (also called on-demand) implementations of algorithms. Any finite-state object fsm can be accessed with the three mainmethods fsm.start(), fsm.final(state),and fsm.transitions(state) that return thestart state, the final weight of a state, and the transi-tions leaving a state, respectively. This interface canbe implemented for concrete automata in an obviousway: the methods simply return the requested infor-mation from a stored representation of the automa-
ton. However, the interface can also be given lazyimplementations. For example, the lazy union of twoautomata returns a new lazy fsm object. When theobject is first constructed, the lazy implementationjust initializes internal book-keeping data. It is onlywhen the interface methods request the start state, thefinal weights, or the transitions (and their destinationstates) leaving a state, that this information is actuallycomputed, and optionally cached inside the object forlater reuse. This approach has the advantage that ifonly a part of the result of an operation is needed (forexample in a pruned search), then the unused part isnever computed, saving time and space. We refer theinterested reader to the library documentation and anoverview of the library [Mohri et al., 2000] for fur-ther details on lazy finite-state objects.
We now discuss the key transducer operationsthat are used in our speech applications for modelcombination, redundant path removal, and size re-duction, respectively.
2.3. Composition
Composition is the transducer operation for com-bining different levels of representation. For in-stance, a pronunciation lexicon can be composedwith a word-level grammar to produce a phone-to-
ASR Lecture 9 Large vocabulary ASR 25
Weighted Finite State Transducers
Acceptor
Springer Handbook on Speech Processing and Speech Communication 3
0 1using/1
2data/0.66
3
intuition/0.33
4
is/15
better/0.7
worse/0.3
(a)
are/0.5
is/0.5
(b)
0 1d/1 2ey/0.5ae/0.5
3t/0.3dx/0.7
4ax/1
(c)0
d1
1d1
d2
2d2
d3
3d3
Figure 1: Weighted finite-state acceptor examples. By convention, the states are represented by circles andmarked with their unique number. The initial state is represented by a bold circle, final states by doublecircles. The label l and weight w of a transition are marked on the corresponding directed arc by l/w. Whenexplicitly shown, the final weight w of a final state f is marked by f/w.
quite similar to a weighted acceptor except that it hasan input label, an output label and a weight on eachof its transitions. The examples in Figure 2 encode(a superset of) the information in the WFSAs of Fig-ure 1(a)-(b) as WFSTs. Figure 2(a) represents thesame language model as Figure 1(a) by giving eachtransition identical input and output labels. This addsno new information, but is a convenient way we useoften to treat acceptors and transducers uniformly.
Figure 2(b) represents a toy pronunciation lexi-con as a mapping from phone strings to words inthe lexicon, in this example data and dew, withprobabilities representing the likelihoods of alterna-tive pronunciations. It transduces a phone string thatcan be read along a path from the start state to a fi-nal state to a word string with a particular weight.The word corresponding to a pronunciation is out-put by the transition that consumes the first phonefor that pronunciation. The transitions that consumethe remaining phones output no further symbols, in-dicated by the null symbol ϵ as the transition’s outputlabel. In general, an ϵ input label marks a transitionthat consumes no input, and an ϵ output label marksa transition that produces no output.
This transducer has more information than theWFSA in Figure 1(b). Since words are encoded by
the output label, it is possible to combine the pronun-ciation transducers for more than one word withoutlosing word identity. Similarly, HMM structures ofthe form given in Figure 1(c) can be combined intoa single transducer that preserves phone model iden-tity. This illustrates the key advantage of a transducerover an acceptor: the transducer can represent a rela-tionship between two levels of representation, for in-stance between phones and words or between HMMsand context-independent phones. More precisely, atransducer specifies a binary relation between strings:two strings are in the relation when there is a pathfrom an initial to a final state in the transducer thathas the first string as the sequence of input labelsalong the path, and the second string as the sequenceof output labels along the path (ϵ symbols are leftout in both input and output). In general, this is arelation rather than a function since the same inputstring might be transduced to different strings alongtwo distinct paths. For a weighted transducer, eachstring pair is also associated with a weight.
We rely on a common set of weighted trans-ducer operations to combine, optimize, search andprune them [Mohri et al., 2000]. Each operationimplements a single, well-defined function that hasits foundations in the mathematical theory of ratio-
Transducer
Springer Handbook on Speech Processing and Speech Communication 4
(a)
0 1using:using/1
2data:data/0.66
3intuition:intuition/0.33
4
is:is/0.5
are:are/0.5
is:is/1
5better:better/0.7
worse:worse/0.3
(b)
0
1d:data/1
5
d:dew/1
2ey:ε/0.5
ae:ε/0.5
6uw:ε/1
3t:ε/0.3
dx:ε/0.74
ax: ε /1
Figure 2: Weighted finite-state transducer examples. These are similar to the weighted acceptors in Figure 1except output labels are introduced on each transition. The input label i, the output label o, and weight w ofa transition are marked on the corresponding directed arc by i : o/w.
nal power series [Salomaa and Soittola, 1978, Bers-tel and Reutenauer, 1988, Kuich and Salomaa, 1986].Many of those operations are the weighted trans-ducer generalizations of classical algorithms for un-weighted acceptors. We have brought together thoseand a variety of auxiliary operations in a comprehen-sive weighted finite-state machine software library(FsmLib) available for non-commercial use from theAT&T Labs – ResearchWeb site [Mohri et al., 2000].
Basic union, concatenation, and Kleene closureoperations combine transducers in parallel, in series,and with arbitrary repetition, respectively. Other op-erations convert transducers to acceptors by project-ing onto the input or output label set, find the bestor the n best paths in a weighted transducer, removeunreachable states and transitions, and sort acyclicautomata topologically.
Where possible, we provided lazy (also called on-demand) implementations of algorithms. Any finite-state object fsm can be accessed with the three mainmethods fsm.start(), fsm.final(state),and fsm.transitions(state) that return thestart state, the final weight of a state, and the transi-tions leaving a state, respectively. This interface canbe implemented for concrete automata in an obviousway: the methods simply return the requested infor-mation from a stored representation of the automa-
ton. However, the interface can also be given lazyimplementations. For example, the lazy union of twoautomata returns a new lazy fsm object. When theobject is first constructed, the lazy implementationjust initializes internal book-keeping data. It is onlywhen the interface methods request the start state, thefinal weights, or the transitions (and their destinationstates) leaving a state, that this information is actuallycomputed, and optionally cached inside the object forlater reuse. This approach has the advantage that ifonly a part of the result of an operation is needed (forexample in a pruned search), then the unused part isnever computed, saving time and space. We refer theinterested reader to the library documentation and anoverview of the library [Mohri et al., 2000] for fur-ther details on lazy finite-state objects.
We now discuss the key transducer operationsthat are used in our speech applications for modelcombination, redundant path removal, and size re-duction, respectively.
2.3. Composition
Composition is the transducer operation for com-bining different levels of representation. For in-stance, a pronunciation lexicon can be composedwith a word-level grammar to produce a phone-to-
ASR Lecture 9 Large vocabulary ASR 26
The HMM as a WFST
ax_1 ax_2 ax_3 SE
0
ax_1:<eps>
1ax_1:<eps>
2ax_2:<eps>
ax_2:<eps> ax_3:<eps>
3ax_3:AX
ASR Lecture 9 Large vocabulary ASR 27
WFST Algorithms
Composition Combine transducers T1 and T2 into a singletransducer acting as if the output of T1 was passedinto T2.
Determinisation Ensure that each state has no more than a singleoutput transition for a given input label
Minimisation transforms a transducer to an equivalent transducerwith the fewest possible states and transitions
ASR Lecture 9 Large vocabulary ASR 28
Applying WFSTs to speech recognition
Represent the following components as WFSTs
transducer input sequence output sequence
G word-level grammar words wordsL pronunciation lexicon phones wordsC context-dependency CD phones phonesH HMM HMM states CD phones
Composing L and G results in a transducer L ◦ G that maps aphone sequence to a word sequence
H ◦ C ◦ L ◦ G results in a transducer that maps from HMMstates to a word sequence
ASR Lecture 9 Large vocabulary ASR 29
Reading
Ortmanns and Ney (2000). “The time-conditioned approach indynamic programming search for LVCSR”. In IEEE Transactions onSpeech and Audio Processing, Vol. 8, No. 6.
Mohri et al (2008). “Speech recognition with weighted finite-statetransducers.” In Springer Handbook of Speech Processing, pp.559-584. Springer.http://www.cs.nyu.edu/~mohri/pub/hbka.pdf
WFSTs in Kaldi. http://danielpovey.com/files/Lecture4.pdf
ASR Lecture 9 Large vocabulary ASR 30