1 Graphical Model Architectures for Speech Recognition Jeff A. Bilmes and Chris Bartels Presenter:...

1

Graphical Model Architectures for Speech Recognition

Jeff A. Bilmes and Chris Bartels

Presenter: Shih-Hsiang( 士翔 )

IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005

2

Reference Jeff A. Bilmes and Chris Bartels, Graphical Model Archite

ctures for Speech Recognition, IEEE Signal Processing Magazine, Sep. 2005

Jeff Bilmes, Graphical Models in Speech and Language Research , tutorial presented during the HLT/NAACL'04, 2005

3

Introduction A graph is a two-dimensional visual formalism

Can be used to describe many phenomena E.g. computer science, data and control flow, entity

relationships, and social networks ...etc Represent complex situations in an intuitive and visual

appealing way Statistical graphical models are a family of graphical

abstractions of statistical models where important aspects of such model are represented using graphs It can offer a mathematically formal but widely flexible

means for solving many of the problems

4

Introduction (cont.) The graph used in ASR can represent events at …

High level information (e.g. relationships between linguistic classes)

Very low level information (e.g. correlations between spectral features or acoustic landmarks)

Or at all levels in between (e.g. lexical pronunciation) The fundamental advantage of graphical models is

rapidity It enables to quickly express a novel, complicated idea in

an intuitive, concise, and mathematically precise way Also speedily and visually communicate that idea

between colleagues Moreover, it is possible to rapidly prototype that idea

5

Introduction (cont.) This article discusses the foundations of the use of graphical

models for speech recognition Using dynamic Bayesian networks (DBNs) and a DBN extension

using the Graphical Model Toolkit’s (GMTK’s) basic template, a dynamic graphical model representation that is more suitable for speech and language system

It should be noted that many of the ideas presented today are also applicable to natural language processing and general time-series analysis

6

Notation Conventions 1:N denote the set of integers {1,2,…,N}

X1:N A set of N random variables (RVs)

Given any subset , where S={S1,S2,…,S|S|}, the

subset of random variables is denoted as

Upper case letters (such as X and Q) to refer to random variables

Lower case letters (such as x and q) to refer to random variable values

NS :1

},...,,{||21 SSSSS XXXX

7

Cause and Effect Fact: At least 196 cows died in Thailand, Jan/Feb

2004, as have 16 people Consequence: Canadian officials in April 2004 killed

19 million birds in British Columbia (chickens, ducks, geese, etc.)

Possible cause I: Original deaths due to avian influenza (H5N1 or bird flu)

Possible cause II: They died of old age!

8

Cause and Effect (cont.) Simple directed graphs can be used Directed edges go from parent (possible cause) to child

(possible effect)

Bird Flu Old age

Deaths

Canadian Action

9

Cause and Effect (cont.) Quantities of interest:

Computing probabilities Pr( Deaths | Flu and Old ) Pr( Deaths | Old ) Pr( Bird Flu | Canadian Action )

Asking questions In general, does old age increase the chance that a cow

has contracted bird flu (if at all)? If we know the action by Canada occurred, does having

bird flu decrease the chance that it was old when it died?

Bird Flu Old age

Deaths

Canadian Action

10

Graph Models (GMs) Graph Models can give us SALAD

Structure A method to explore the structure of “natural” phenomena

(causal vs. correlated relations, properties of natural signals and scenes, factorization)

Algorithms A set of algorithms that provide “efficient” probabilistic

inference and statistical decision making Language

A mathematically formal, abstract, visual language with which to efficiently discuss families of probabilistic models and their properties.

11

Graph Models (GMs) (cont.) Approximation

Methods to explore systems of approximation and their implications. E.g., what are the consequences of a (perhaps known to be) wrong assumption? Inferential approximation Task dependent structural approximation

Database Provide a probabilistic “data-base” and corresponding

“search algorithms” for making queries about properties in such model families.

12

Topology of Graphical Models

13

Bayesian network (BN) It is one type of graphical model where the graphs are

directed and acyclic In a BN, the probability distribution over a set of

variables X1:N factorizes with respect to a directed acyclic graph (DAG) as

)|()( :1 iiiN xxpxp πi are the subset of indices of Xi’s immediate parentsAccording to the BN’s DAG

Earthquake Burglary

Alarm

Call

Radio

Nodes: random variables Edges: direct “influence”

P(B,E,A,C,R) = P(B)P(E)P(A|B,E)P(C|A)P(R|E)

Factorization property

14

Dynamic Bayesian networks (DBNs) Speech is inherently a temporal process, and any graphical

model for speech must take this into account Accordingly, dynamic graphical models are graphs that

represent the temporal evolution of the statistical properties of a speech signal

DBNs have been most successfully used DBNs are simply BNs with a repeated “template” structure

over time Specified using a “rolled up” template giving nodes that are

repeated in each slice (time frame) In unrolled DBN, all variables sharing same origin in template

have tied parameters Allows for specifying graph over arbitrary (unbounded) length

series As in any BN, the collection of edges pointing into a node

corresponds to a conditional probability function (CPF)

15

It is well know that the hidden Markov model (HMM) is one type of DBN However, the HMM is only one small model within the

enormous family of statistical techniques represented by DBNs.

Corresponds to template

Dynamic Bayesian networks (DBNs) (cont.)

Q1 Q2 Q3 Q4

X1 X2 X3 X4

Q1 Q2

X1 X2

16

Dynamic Bayesian networks (DBNs) (cont.) More generally, DBN specifies template to be unrolled

DBN Template Unrolled DBN, 3 times

Intra-slice edges Inter-slice edges

17

Dynamic Bayesian networks (DBNs) (cont.) In facts, it is true that many (but not all) DBNs can be

“flattened” into a corresponding HMM, but staying within the DBN framework has several advantages. There can be exploitable computational advantages

Since the DBN explicitly represents factorization properties and factorization is the key to tractable probabilistic inference

The factorization specified by a DBN implies that there are constraints that the model must obey

Information about a domain is visually and intuitively portrayed

18

Dynamic Bayesian networks (DBNs) (cont.)

A flattened HMM would have one chain

In DBNs…

a two-Markov-chain DBN with chains

ignores the factorization constraint expressed by the graph

19

The GMTK Dynamic Template GMTK :Graphical Models Toolkit A GMTK template extends a standard DBN template in five di

stinct ways First, it allows for not only forward but also back ward directed

time links Second, network slices may span multiple time frames, so slice

s are called chunks Third, a GMTK template includes a built-in specification mecha

nism for switching parents Fourth, parents of a variable maybe multiple chunks in the past

or into the future Fifth, it allows for different multiframe structures to occur at bo

th the beginning and the end of the unrolled network

20

The GMTK Dynamic Template (cont.)

21

The GMTK features GMTK Features

Textual Graph Language Switching Parent Functionality Forwards and Backwards time links Multi-rate models with extended DBN templates. Linear Dependencies on observations Arbitrary low-level parameter sharing (EM/GEM training) Gaussian Vanishing/Splitting algorithm Decision-Tree-Based implementations of dependencies (determi

nistic, sparse, formula leaf nodes) Full inference, single pass decoding possible Sampling Methods Linear and Island Algorithm (O(logT)) Exact Inference

22

Why Graphical Models for Speech and Language Processing Expressive but concise way to describe properties

of families of distributions Rapid movement from novel idea to implementation

All graphs utilize exactly same inference algorithm Researcher concentrates on model and can stay

focused on domain Holds promise to replace the ubiquitous HMM Dynamic Bayesian networks and dynamic graphical

models can represent important structure in “natural” time signals such as speech/language

23

Four Main Goals for using Graphic Models in ASR Explicitly and efficiently represent typical or novel ASR control

constructs Derive graph structures that themselves explicitly represent control

constructs E.g., parameter tying/sharing, state sequencing, smoothing, mixing,

backing off, etc. Latent knowledge Modeling

Graphical models can provide the infrastructure in which this knowledge can be exploited

E.g. dialog act, word/phrase category, pronunciation variant, speaking rate, model/style/gender, acoustic channel/noise condition, etc.

Proper observation modeling Model can more appropriately match the underlying statistics extant in

speech as represented by the current series of feature vectors Automatic structure learning

Derive structure automatically, ideally to improve error rate while simultaneously minimizing computational cost

24

Graphical model speech architectures In this paper, they demonstrate

Phone-based bigram decoding structure Phone-based trigram architecture Cross-word triphone architecture Tree-structured lexicon architecture ?? Transition explicit model Multiobservation and multihidden stream semi-asynchron

ous architecture Architectures over observed variables ??

25

Phone-based bigram decoding structure

prologue chunk

Unrolled one time epilogue

time (frame) 1 2 3 4

deterministic dependencies

purely random

Position of current phone within a word

Ensure that a proper finalword is decoded

binary indicator thatspecifies when the model should advanceTo the next phone

26

Phone-based bigram decoding structure (cont.) Word Position:

Do not change from one frame to the next frame (no phone transition, )

Increment in value by one (phone transition and the model is not in the last position of the word)

Reset to zero (phone transition and the model is in the last position of the word)

27

Phone-based bigram decoding structure (cont.) Word Transition:

When the model makes a phone transition out of the last position of a given word with k total p

hones Word:

When the word transition is zero with probability 1

When a word transition occursuse bigram language model probability

28

Phone-based bigram decoding structure (cont.)

29

Phone-based trigram decoding structure Moving from a bigram to trigram language model can re

quire a significant change All the variable evolves at the rate of the frame rather th

an at the rate of the word It is not sufficient to just add an edge from Wt-2 to Wt

Because the word from two frames ago is most often the same as the current word

We must explicitly keep track of the identify of the word that was most recently different The identity of the word just before the previous word tra

nsition

30

Phone-based trigram decoding structure (cont.)

When there is no word transition, both the word and previous word variables do not change

Otherwise, the new previous word get a copy of the previous current word with probability one

31

Cross-word triphone architecture Triphone models are those where the acoustic observati

on is not only conditioned on the currently hypothesized phone And it also makes assumption that the current acoustics a

re significantly influenced by the preceding and following phonetic context (i.e. coarticulation)

Triphone model accomplish this by saying that the distribution over the acoustic frame depends on the current, previous, and next phone

32

Cross-word triphone architecture (cont.)

using backward time edge

33

Cross-word triphone architecture (cont.)

If phone transition does not occur, the three phone variables keeps the same value If phone transition dose occur, the three phone variables should be changed properly

34

Tree-structured lexicon architecture The decoding search space can be computationally

prohibitive It is most often necessary to arrange and reorganize

states in this space in a manner that is efficient to search

On way to accomplish this is to use a tree-structured lexicon The prefixes of each word are represented explicitly and

the state-space probabilistically branches only when extending the prefixes of words that no longer match

A variable is used that correspond to a phone tree, and it encodes the prefix-tree for the entire word lexicon in a large, but very sparse, stochastic matrix

35

Tree-structured lexicon architecture (cont.)

If there is a phone tree transition, the next phone tree stat is governed by the sparse phone tree probability table

If there is no phone tree transition, then with unity probability

We can moreover extend this model to provide early-state pruning by allowing the phone-tree probability table to have scores for the most optimistic of the subsequent word

Lt=1 is used to insist on consistency between the word corresponding to the terminal state of the phone tree variable and the next word chosen by the trigram language modelWhen there is no word transition, then is uncoupled form its other parentsOtherwise, the event =1 is explained only when the next word is the word corresponding to the terminal state of the current phone tree variable

36

Transition explicit model In the past, it has been argued that speech regions

corresponding to spectral transitions might include much, it not all, of the underlying messages

The observation could, for example, consist of features that are designed to provide information about spectral transition

Sometime the novel information might be relevant only part of the time, so it is not appropriately append the new information to the standard feature vector

37

Multiobservation and multihidden stream semi-asynchronous architecture It is also easy to define a generic, semi-asynchronous multist

ream and/or multimodal model for ASR The streams may lie both over the observation space (e.g., mul

tiple streams of feature vectors) and also over the hidden space (multiple semi synchronous stream of hidden variables)

The word variable can be composed of two or more sequences of generic “states”

Allow us to have two independent representation of a word E.g. audio and a video feature stream, differing streams of audio featur

es, different articulatory, or different spectral subband streams

38

Multiobservation and multihidden stream semi-asynchronous architecture (cont.)

When a word transition occur, both sequences must transition out of their last state for the current word

There is no requirement that the two sequences use the same number of states per word, nor is there any requirement that the two sequences line up in any way

Effectively, in accumulating the probability of the word, all alignments are considered along with theircorresponding alignment probability

39

Architectures over observed variables

various factorization properties over the observation vectors themselves

This models have been called autoregressiveHMMs , or, buried Markov models (BMMs)(when specific element-wise dependencies)

BMMs provide a promising vehicle for structure learning, since the structure to be learned is oversequences of only observed feature vector (rather than hidden variables)

40

Conclusion A key advantage of graphical models is that all of these

constructs can be represented in a single, unified framework In general under this framework, may other modification

can be quickly utilized Graphical model machinery can make it easy to express

a new model and with the right software tools, rapidly prototype it in a real ASR system

A key promise of graphical models is that by allowing researchers to reject ideas that perform poorly and advance idea that perform well

Date post:	13-Dec-2015
Category:	Documents
Upload:	magdalene-rice
View:	219 times
Download:	1 times

1 Graphical Model Architectures for Speech Recognition Jeff A. Bilmes and Chris Bartels Presenter:...

Documents