+ All Categories
Home > Documents > The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and...

The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and...

Date post: 17-Dec-2015
Category:
Upload: angelica-cox
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
19
The Center for Signal & Image Processing The Center for Signal & Image Processing Georgia Institute of Technology Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark Clements
Transcript

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing Georgia Institute of TechnologyGeorgia Institute of Technology

Kernel-Based Detectors and Fusion of Phonological

Attributes

Brett Matthews

Mark Clements

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Outline

Frame-Based Detection One-vs-all detectors Context-dependent framewise detection Probabilistic Outputs

Kernel-Based Attribute Detection SVM Least-Squares SVM

Evaluating Probabilistic Estimates Naïve Bayes Combinations Hierarchical Manner Classification

Detector Fusion Genetic Programming

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Frame-Based Detection One-vs-All classifiers

Manner of articulation– Vowel, fricative, stop, nasal,

glide/semivowel, silence Place of articulation

– Dental, labial, coronal, palatal, velar, glottal, back, front

Vowel Manners– High, mid, low, back, round

Framewise Detection 10ms frame rate 12 MFCCs+En 8 context dependent frames

Classifier Types & Posterior Probs Artificial Neural nets

– Probabilistic outputs Kernel-Based Classifiers

– SVM Empirically determined posterior probs

– LS-SVMs Probabilistic outputs

vowel

silence

dental

velar

voicing

Event

Fusion

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Kernel-Based Classifiers Support Vector Machines (SVM) LS-SVM Classifier

Kernel-based classifier like SVM Least-Squares formulation Probabilistic output scores LS-SVM Lab package

– Katholieke Universeit Lueven

Same decision function as SVM

Subject to

Equality constraints, instead of inequality constraints No margin optimization Linear system solution

w

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Least Squares SVMs “Support Vectors” found by solving a linear system

Kernel Functions

Probabilistic Outputs Bayesian Inference for Posterior probs Moderated outputs can be directly interpreted as posterior

probabilities

yxyx TK ),(

0,),( dT rK yxyx

0,exp),(2 yxyxK

Linear

Polynomial

RBF

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Evaluating Probabilistic Estimates Reliability and Accuracy of probabilistic scores Initial Fusion Experiments

Hierarchical Manner Classification– LS-SVM, SVM

Naïve Bayes combination for Phone Detection– LS-SVM, SVM, ANN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC

False Alarm Rate

Pro

ba

bili

ty o

f D

ete

cti

on

stopvowel

fricative

nasalsil

gs

backcoronal

front

highlabial

low

midnil

frontback

nilround

retroflexround

unround

velarvoiced

LS-SVM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC

False Alarm Rate

Pro

babi

lity

of D

etec

tion

vowel

fricative

nasal

silgs

back

coronalfront

high

labial

lowmid

nilfrontback

nilround

retroflex

round

unround

velarvoiced

SVM

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Hierarchical Combinations Probabilistic Phonetic feature hierarchy for classifying

frames into 6 manner classes Train binary detectors on each split in hierarchy 5 Detectors, 6 classes

– silence vs speech– sonorant | speech– vowel | sonorant– stop | non-sonorant– semivowel | sonorant consonant

fricative detection

and gnd truth

P(fric | x) = (1 – P(st | non-sc))

· (1 – P(son | spch))

· P (spch | x)

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Hierarchical Combinations1. Reliability of Posterior Probs (right)

Plot probabilistic estimates of combinations vs. observed frequencies

Hierarchical Combinations much more reliable for SVM than LS-SVM

2. Classification Accuracy (below) Higher classification accuracy for SVMs,

especially fricatives

3. Upper-bound Comparison (below) One-vs-all classifiers trained directly for each

class. Combinations nearly as accurate as one-vs-all

for classification performance LS-SVM combinations not good for

semivowel and nasal

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.1 width

A p

os

teri

ori

pro

ba

bili

ty e

stim

ate

s

Reliability Diagram - LSSVM, stopGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

Bin

Co

un

ts

stop0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.1 width

A p

os

teri

ori

pro

ba

bili

ty e

stim

ate

s

Reliability Diagram - LSSVM, vowelGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Bin

Co

un

ts

vowel0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.1 width

A p

os

teri

ori

pro

ba

bili

ty e

sti

ma

tes

Reliability Diagram - LSSVM, fricativeGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

7000

Bin

Co

un

ts

fricative

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.1 width

A p

os

teri

ori

pro

ba

bili

ty e

stim

ate

s

Reliability Diagram - LSSVM, nasalGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

7000

8000

Bin

Co

un

ts

nasal0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.1 width

A p

os

teri

ori

pro

ba

bili

ty e

stim

ate

s

Reliability Diagram - LSSVM, silGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

7000

Bin

Co

un

ts

silence0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.1 width

A p

os

teri

ori

pro

ba

bili

ty e

stim

ate

s

Reliability Diagram - LSSVM, gsGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

7000

Bin

Co

un

ts

semivowel/glide

LS-SVM (Combined)

stop vowel fricative

nasal silence semivowel/glide

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.094313 width

A p

ost

erio

ri p

roba

bilit

y es

timat

es

Reliability Diagram - LIBSVM, gsGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Bin

Cou

nts

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.09997 width

A p

ost

erio

ri p

roba

bilit

y es

timat

es

Reliability Diagram - LIBSVM, silGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

Bin

Cou

nts

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.095949 width

A p

oste

riori

prob

abili

ty e

stim

ates

Reliability Diagram - LIBSVM, nasalGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

7000

Bin

Cou

nts

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.099912 width

A p

ost

erio

ri pr

oba

bilit

y es

timat

es

Reliability Diagram - LIBSVM, fricativeGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1000

2000

3000

4000

5000

6000

Bin

Cou

nts

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.098786 width

A p

ost

erio

ri pr

oba

bilit

y es

timat

es

Reliability Diagram - LIBSVM, vowelGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

500

1000

1500

2000

2500

3000

3500

Bin

Cou

nts

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RAW Score bins - 0.097798 width

A p

oste

riori

prob

abili

ty e

stim

ates

Reliability Diagram - LIBSVM, stopGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

500

1000

1500

2000

2500

3000

3500

4000

Bin

Co

unts

SVM (Combined)

Classification accuracy (%)

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Naïve Bayes Combinations One-vs-all frameworks desired

Phonetic hierarchies are cumbersome Phone Detection

Combine phonological attribute scores with Naïve Bayes product Initial experiments in evaluating probabilities

Compare accuracy and reliability of probabilistic outputs for ANN, SVM and LS-SVM

Limited training data (LS-SVM limit is 3000 due to memory restrictions) Detect phones with combinations of relevant phonetic attributes

P(/f/ | x) = P(labial | x) P(fric | x)

(1-P(voicing | x))

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Naïve Bayes Combinations1. Phone Detection

Compare combined attributes with direct training on phones as an upper bound

2. ROC Stats (right) SVMs best for attribute detection Mixed results for NB combinations

– No clear winner between LS-SVM and SVM

Direct training outperforms combinations

3. Reliability Naïve Bayes combinations give poor

reliability for all detector types

4. Rare phones & vowels For /v/, /ng/ and /oy/, improvements in

EER and AUC across detector types (lower right)

Most vowels saw improvements as well

ROC Stats

Direct vs. Combined

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Naïve Bayes Combinations

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC

False Alarm Rate

Pro

babi

lity

of D

etec

tion

vphone

fphone

zphone

sphone

Combined attributes (SVM)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC

False Alarm Rate

Pro

babi

lity

of D

etec

tion

/v/

/f//z/

/s/

Direct Training (SVM)

1. Phone Detection Compare combined attributes with

direct training on phones as an upper bound

2. ROC Stats (right) SVMs best for attribute detection Mixed results for NB combinations

– No clear winner between LS-SVM and SVM

Direct training outperforms combinations

3. Reliability Naïve Bayes combinations give poor

reliability for all detector types4. Rare phones & vowels

For /v/, /ng/ and /oy/, improvements in EER and AUC across detector types (lower right)

Most vowels saw improvements as well

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Genetic Programming

Evolutionary algorithm for tree-structured feature “creation” (Extraction)

Maximize a fitness function across a number of generations (iterations)

Operations like crossover & mutation control the evolution of the algorithm

Trees are algebraic networks Inputs are multi-dimensional features Tree nodes are unary or binary mathematical operators (+, -, *, (.)2,

log) Algebraic networks simpler and more transparent than neural nets

GPLab Package from Universidade de Coimbra, Portugal http://gplab.sourceforge.net

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Genetic Programming Trained GP trees on SVM outputs Develop algebraic networks for combining detector outputs Produce a 1-D feature from a nonlinear combination of

detector outputs choose fitness function, set of node operators, tree depth, etc. to

maximize separation

vowel

silence

dental

velar

voicingX7

X12X9

plus

X8 cos

times

X12X9

plus

cos

plus

minus

X13X12

plus

X12X9

plus

X7 cos

times

plus

cos

plus

/aa/

X7

X12X9

plus

X8 cos

times

X12X9

plus

cos

plus

minus

X13X12

plus

X12X9

plus

X7 cos

times

plus

cos

plus

/ae/

X7

X12X9

plus

X8 cos

times

X12X9

plus

cos

plus

minus

X13X12

plus

X12X9

plus

X7 cos

times

plus

cos

plus

/zh/

1-D feature Trees are algebraic networks

Inputs are multi-dimensional features

Tree nodes are unary or binary mathematical operators (+, -, *, (.)2, log)

Algebraic networks simpler and more transparent than neural nets

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Genetic Programming

System is complex for speech recognition (tree + classifier for each phone), but GP trees themselves provide insights for combination

Fitness function Tree node operators Important features

Initial results Mixed results

– Good separation for some phones, not good for most

– GP Trees select attributes of interest, discard others

Still in progress

/oy/

/th/

vowelcoronal

times

back

atan

times

sqrt

min

atan

vowel coronal

timesglide/

semivowel

times

sqrt

min

glide/

semivowel

vowelcoronal

times

back

atan

times

sqrt

min

atan

vowel coronal

timesglide/

semivowel

times

sqrt

min

glide/

semivowel

fricative

atan

fricative

atan

high unround

timesvoiced

minus

plus

atanvoiced

minus

plus

/oy/

/th/

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Summary

Evaluating Posterior Probs ANNs, SVMs, LS-SVMs SVMs are best for reliability and accuracy In limited training data, rare phones may benefit from

from overlapping phonetic classes Genetic Programming for detector fusion

Small, transparent algebraic networks for combining attribute detectors

GP trees select relevant attributes, but much room for improvement

Limiting tree node operators and selecting “fitness functions” should provide insights into detector fusion

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Extras

Feature Space correlation matrix (1)

Feature Space correlation matrix (2)

Feature Space correlation matrix (3)

Training Data

Represents the kernel function K and the range of kernel

parameters

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Extras

Positive scale parameters

Determine w and b by solving the optimization problem

Regression error for training sample k

Generalization/ Regularization term

Subject to

Expression for the trade-off between generalization and training set error

The Center for Signal & Image ProcessingThe Center for Signal & Image Processing

Extras

Support Vector Machines Good performance, but the majority of training points

became support vectors

Posterior probabilities

w


Recommended