Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | magdalene-rice |
View: | 219 times |
Download: | 1 times |
1
Graphical Model Architectures for Speech Recognition
Jeff A. Bilmes and Chris Bartels
Presenter: Shih-Hsiang( 士翔 )
IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005
2
Reference Jeff A. Bilmes and Chris Bartels, Graphical Model Archite
ctures for Speech Recognition, IEEE Signal Processing Magazine, Sep. 2005
Jeff Bilmes, Graphical Models in Speech and Language Research , tutorial presented during the HLT/NAACL'04, 2005
3
Introduction A graph is a two-dimensional visual formalism
Can be used to describe many phenomena E.g. computer science, data and control flow, entity
relationships, and social networks ...etc Represent complex situations in an intuitive and visual
appealing way Statistical graphical models are a family of graphical
abstractions of statistical models where important aspects of such model are represented using graphs It can offer a mathematically formal but widely flexible
means for solving many of the problems
4
Introduction (cont.) The graph used in ASR can represent events at …
High level information (e.g. relationships between linguistic classes)
Very low level information (e.g. correlations between spectral features or acoustic landmarks)
Or at all levels in between (e.g. lexical pronunciation) The fundamental advantage of graphical models is
rapidity It enables to quickly express a novel, complicated idea in
an intuitive, concise, and mathematically precise way Also speedily and visually communicate that idea
between colleagues Moreover, it is possible to rapidly prototype that idea
5
Introduction (cont.) This article discusses the foundations of the use of graphical
models for speech recognition Using dynamic Bayesian networks (DBNs) and a DBN extension
using the Graphical Model Toolkit’s (GMTK’s) basic template, a dynamic graphical model representation that is more suitable for speech and language system
It should be noted that many of the ideas presented today are also applicable to natural language processing and general time-series analysis
6
Notation Conventions 1:N denote the set of integers {1,2,…,N}
X1:N A set of N random variables (RVs)
Given any subset , where S={S1,S2,…,S|S|}, the
subset of random variables is denoted as
Upper case letters (such as X and Q) to refer to random variables
Lower case letters (such as x and q) to refer to random variable values
NS :1
},...,,{||21 SSSSS XXXX
7
Cause and Effect Fact: At least 196 cows died in Thailand, Jan/Feb
2004, as have 16 people Consequence: Canadian officials in April 2004 killed
19 million birds in British Columbia (chickens, ducks, geese, etc.)
Possible cause I: Original deaths due to avian influenza (H5N1 or bird flu)
Possible cause II: They died of old age!
8
Cause and Effect (cont.) Simple directed graphs can be used Directed edges go from parent (possible cause) to child
(possible effect)
Bird Flu Old age
Deaths
Canadian Action
9
Cause and Effect (cont.) Quantities of interest:
Computing probabilities Pr( Deaths | Flu and Old ) Pr( Deaths | Old ) Pr( Bird Flu | Canadian Action )
Asking questions In general, does old age increase the chance that a cow
has contracted bird flu (if at all)? If we know the action by Canada occurred, does having
bird flu decrease the chance that it was old when it died?
Bird Flu Old age
Deaths
Canadian Action
10
Graph Models (GMs) Graph Models can give us SALAD
Structure A method to explore the structure of “natural” phenomena
(causal vs. correlated relations, properties of natural signals and scenes, factorization)
Algorithms A set of algorithms that provide “efficient” probabilistic
inference and statistical decision making Language
A mathematically formal, abstract, visual language with which to efficiently discuss families of probabilistic models and their properties.
11
Graph Models (GMs) (cont.) Approximation
Methods to explore systems of approximation and their implications. E.g., what are the consequences of a (perhaps known to be) wrong assumption? Inferential approximation Task dependent structural approximation
Database Provide a probabilistic “data-base” and corresponding
“search algorithms” for making queries about properties in such model families.
12
Topology of Graphical Models
13
Bayesian network (BN) It is one type of graphical model where the graphs are
directed and acyclic In a BN, the probability distribution over a set of
variables X1:N factorizes with respect to a directed acyclic graph (DAG) as
)|()( :1 iiiN xxpxp πi are the subset of indices of Xi’s immediate parentsAccording to the BN’s DAG
Earthquake Burglary
Alarm
Call
Radio
Nodes: random variables Edges: direct “influence”
P(B,E,A,C,R) = P(B)P(E)P(A|B,E)P(C|A)P(R|E)
Factorization property
14
Dynamic Bayesian networks (DBNs) Speech is inherently a temporal process, and any graphical
model for speech must take this into account Accordingly, dynamic graphical models are graphs that
represent the temporal evolution of the statistical properties of a speech signal
DBNs have been most successfully used DBNs are simply BNs with a repeated “template” structure
over time Specified using a “rolled up” template giving nodes that are
repeated in each slice (time frame) In unrolled DBN, all variables sharing same origin in template
have tied parameters Allows for specifying graph over arbitrary (unbounded) length
series As in any BN, the collection of edges pointing into a node
corresponds to a conditional probability function (CPF)
15
It is well know that the hidden Markov model (HMM) is one type of DBN However, the HMM is only one small model within the
enormous family of statistical techniques represented by DBNs.
Corresponds to template
Dynamic Bayesian networks (DBNs) (cont.)
Q1 Q2 Q3 Q4
X1 X2 X3 X4
Q1 Q2
X1 X2
16
Dynamic Bayesian networks (DBNs) (cont.) More generally, DBN specifies template to be unrolled
DBN Template Unrolled DBN, 3 times
Intra-slice edges Inter-slice edges
17
Dynamic Bayesian networks (DBNs) (cont.) In facts, it is true that many (but not all) DBNs can be
“flattened” into a corresponding HMM, but staying within the DBN framework has several advantages. There can be exploitable computational advantages
Since the DBN explicitly represents factorization properties and factorization is the key to tractable probabilistic inference
The factorization specified by a DBN implies that there are constraints that the model must obey
Information about a domain is visually and intuitively portrayed
18
Dynamic Bayesian networks (DBNs) (cont.)
A flattened HMM would have one chain
In DBNs…
a two-Markov-chain DBN with chains
ignores the factorization constraint expressed by the graph
19
The GMTK Dynamic Template GMTK :Graphical Models Toolkit A GMTK template extends a standard DBN template in five di
stinct ways First, it allows for not only forward but also back ward directed
time links Second, network slices may span multiple time frames, so slice
s are called chunks Third, a GMTK template includes a built-in specification mecha
nism for switching parents Fourth, parents of a variable maybe multiple chunks in the past
or into the future Fifth, it allows for different multiframe structures to occur at bo
th the beginning and the end of the unrolled network
20
The GMTK Dynamic Template (cont.)
21
The GMTK features GMTK Features
Textual Graph Language Switching Parent Functionality Forwards and Backwards time links Multi-rate models with extended DBN templates. Linear Dependencies on observations Arbitrary low-level parameter sharing (EM/GEM training) Gaussian Vanishing/Splitting algorithm Decision-Tree-Based implementations of dependencies (determi
nistic, sparse, formula leaf nodes) Full inference, single pass decoding possible Sampling Methods Linear and Island Algorithm (O(logT)) Exact Inference
22
Why Graphical Models for Speech and Language Processing Expressive but concise way to describe properties
of families of distributions Rapid movement from novel idea to implementation
All graphs utilize exactly same inference algorithm Researcher concentrates on model and can stay
focused on domain Holds promise to replace the ubiquitous HMM Dynamic Bayesian networks and dynamic graphical
models can represent important structure in “natural” time signals such as speech/language
23
Four Main Goals for using Graphic Models in ASR Explicitly and efficiently represent typical or novel ASR control
constructs Derive graph structures that themselves explicitly represent control
constructs E.g., parameter tying/sharing, state sequencing, smoothing, mixing,
backing off, etc. Latent knowledge Modeling
Graphical models can provide the infrastructure in which this knowledge can be exploited
E.g. dialog act, word/phrase category, pronunciation variant, speaking rate, model/style/gender, acoustic channel/noise condition, etc.
Proper observation modeling Model can more appropriately match the underlying statistics extant in
speech as represented by the current series of feature vectors Automatic structure learning
Derive structure automatically, ideally to improve error rate while simultaneously minimizing computational cost
24
Graphical model speech architectures In this paper, they demonstrate
Phone-based bigram decoding structure Phone-based trigram architecture Cross-word triphone architecture Tree-structured lexicon architecture ?? Transition explicit model Multiobservation and multihidden stream semi-asynchron
ous architecture Architectures over observed variables ??
25
Phone-based bigram decoding structure
prologue chunk
Unrolled one time epilogue
time (frame) 1 2 3 4
deterministic dependencies
purely random
Position of current phone within a word
Ensure that a proper finalword is decoded
binary indicator thatspecifies when the model should advanceTo the next phone
26
Phone-based bigram decoding structure (cont.) Word Position:
Do not change from one frame to the next frame (no phone transition, )
Increment in value by one (phone transition and the model is not in the last position of the word)
Reset to zero (phone transition and the model is in the last position of the word)
27
Phone-based bigram decoding structure (cont.) Word Transition:
When the model makes a phone transition out of the last position of a given word with k total p
hones Word:
When the word transition is zero with probability 1
When a word transition occursuse bigram language model probability
28
Phone-based bigram decoding structure (cont.)
29
Phone-based trigram decoding structure Moving from a bigram to trigram language model can re
quire a significant change All the variable evolves at the rate of the frame rather th
an at the rate of the word It is not sufficient to just add an edge from Wt-2 to Wt
Because the word from two frames ago is most often the same as the current word
We must explicitly keep track of the identify of the word that was most recently different The identity of the word just before the previous word tra
nsition
30
Phone-based trigram decoding structure (cont.)
When there is no word transition, both the word and previous word variables do not change
Otherwise, the new previous word get a copy of the previous current word with probability one
31
Cross-word triphone architecture Triphone models are those where the acoustic observati
on is not only conditioned on the currently hypothesized phone And it also makes assumption that the current acoustics a
re significantly influenced by the preceding and following phonetic context (i.e. coarticulation)
Triphone model accomplish this by saying that the distribution over the acoustic frame depends on the current, previous, and next phone
32
Cross-word triphone architecture (cont.)
using backward time edge
33
Cross-word triphone architecture (cont.)
If phone transition does not occur, the three phone variables keeps the same value If phone transition dose occur, the three phone variables should be changed properly
34
Tree-structured lexicon architecture The decoding search space can be computationally
prohibitive It is most often necessary to arrange and reorganize
states in this space in a manner that is efficient to search
On way to accomplish this is to use a tree-structured lexicon The prefixes of each word are represented explicitly and
the state-space probabilistically branches only when extending the prefixes of words that no longer match
A variable is used that correspond to a phone tree, and it encodes the prefix-tree for the entire word lexicon in a large, but very sparse, stochastic matrix
35
Tree-structured lexicon architecture (cont.)
If there is a phone tree transition, the next phone tree stat is governed by the sparse phone tree probability table
If there is no phone tree transition, then with unity probability
We can moreover extend this model to provide early-state pruning by allowing the phone-tree probability table to have scores for the most optimistic of the subsequent word
Lt=1 is used to insist on consistency between the word corresponding to the terminal state of the phone tree variable and the next word chosen by the trigram language modelWhen there is no word transition, then is uncoupled form its other parentsOtherwise, the event =1 is explained only when the next word is the word corresponding to the terminal state of the current phone tree variable
36
Transition explicit model In the past, it has been argued that speech regions
corresponding to spectral transitions might include much, it not all, of the underlying messages
The observation could, for example, consist of features that are designed to provide information about spectral transition
Sometime the novel information might be relevant only part of the time, so it is not appropriately append the new information to the standard feature vector
37
Multiobservation and multihidden stream semi-asynchronous architecture It is also easy to define a generic, semi-asynchronous multist
ream and/or multimodal model for ASR The streams may lie both over the observation space (e.g., mul
tiple streams of feature vectors) and also over the hidden space (multiple semi synchronous stream of hidden variables)
The word variable can be composed of two or more sequences of generic “states”
Allow us to have two independent representation of a word E.g. audio and a video feature stream, differing streams of audio featur
es, different articulatory, or different spectral subband streams
38
Multiobservation and multihidden stream semi-asynchronous architecture (cont.)
When a word transition occur, both sequences must transition out of their last state for the current word
There is no requirement that the two sequences use the same number of states per word, nor is there any requirement that the two sequences line up in any way
Effectively, in accumulating the probability of the word, all alignments are considered along with theircorresponding alignment probability
39
Architectures over observed variables
various factorization properties over the observation vectors themselves
This models have been called autoregressiveHMMs , or, buried Markov models (BMMs)(when specific element-wise dependencies)
BMMs provide a promising vehicle for structure learning, since the structure to be learned is oversequences of only observed feature vector (rather than hidden variables)
40
Conclusion A key advantage of graphical models is that all of these
constructs can be represented in a single, unified framework In general under this framework, may other modification
can be quickly utilized Graphical model machinery can make it easy to express
a new model and with the right software tools, rapidly prototype it in a real ASR system
A key promise of graphical models is that by allowing researchers to reject ideas that perform poorly and advance idea that perform well