Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | angelica-cox |
View: | 218 times |
Download: | 0 times |
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing Georgia Institute of TechnologyGeorgia Institute of Technology
Kernel-Based Detectors and Fusion of Phonological
Attributes
Brett Matthews
Mark Clements
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Outline
Frame-Based Detection One-vs-all detectors Context-dependent framewise detection Probabilistic Outputs
Kernel-Based Attribute Detection SVM Least-Squares SVM
Evaluating Probabilistic Estimates Naïve Bayes Combinations Hierarchical Manner Classification
Detector Fusion Genetic Programming
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Frame-Based Detection One-vs-All classifiers
Manner of articulation– Vowel, fricative, stop, nasal,
glide/semivowel, silence Place of articulation
– Dental, labial, coronal, palatal, velar, glottal, back, front
Vowel Manners– High, mid, low, back, round
Framewise Detection 10ms frame rate 12 MFCCs+En 8 context dependent frames
Classifier Types & Posterior Probs Artificial Neural nets
– Probabilistic outputs Kernel-Based Classifiers
– SVM Empirically determined posterior probs
– LS-SVMs Probabilistic outputs
vowel
silence
dental
velar
voicing
Event
Fusion
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Kernel-Based Classifiers Support Vector Machines (SVM) LS-SVM Classifier
Kernel-based classifier like SVM Least-Squares formulation Probabilistic output scores LS-SVM Lab package
– Katholieke Universeit Lueven
Same decision function as SVM
Subject to
Equality constraints, instead of inequality constraints No margin optimization Linear system solution
w
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Least Squares SVMs “Support Vectors” found by solving a linear system
Kernel Functions
Probabilistic Outputs Bayesian Inference for Posterior probs Moderated outputs can be directly interpreted as posterior
probabilities
yxyx TK ),(
0,),( dT rK yxyx
0,exp),(2 yxyxK
Linear
Polynomial
RBF
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Evaluating Probabilistic Estimates Reliability and Accuracy of probabilistic scores Initial Fusion Experiments
Hierarchical Manner Classification– LS-SVM, SVM
Naïve Bayes combination for Phone Detection– LS-SVM, SVM, ANN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ROC
False Alarm Rate
Pro
ba
bili
ty o
f D
ete
cti
on
stopvowel
fricative
nasalsil
gs
backcoronal
front
highlabial
low
midnil
frontback
nilround
retroflexround
unround
velarvoiced
LS-SVM
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ROC
False Alarm Rate
Pro
babi
lity
of D
etec
tion
vowel
fricative
nasal
silgs
back
coronalfront
high
labial
lowmid
nilfrontback
nilround
retroflex
round
unround
velarvoiced
SVM
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Hierarchical Combinations Probabilistic Phonetic feature hierarchy for classifying
frames into 6 manner classes Train binary detectors on each split in hierarchy 5 Detectors, 6 classes
– silence vs speech– sonorant | speech– vowel | sonorant– stop | non-sonorant– semivowel | sonorant consonant
fricative detection
and gnd truth
P(fric | x) = (1 – P(st | non-sc))
· (1 – P(son | spch))
· P (spch | x)
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Hierarchical Combinations1. Reliability of Posterior Probs (right)
Plot probabilistic estimates of combinations vs. observed frequencies
Hierarchical Combinations much more reliable for SVM than LS-SVM
2. Classification Accuracy (below) Higher classification accuracy for SVMs,
especially fricatives
3. Upper-bound Comparison (below) One-vs-all classifiers trained directly for each
class. Combinations nearly as accurate as one-vs-all
for classification performance LS-SVM combinations not good for
semivowel and nasal
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.1 width
A p
os
teri
ori
pro
ba
bili
ty e
stim
ate
s
Reliability Diagram - LSSVM, stopGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
Bin
Co
un
ts
stop0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.1 width
A p
os
teri
ori
pro
ba
bili
ty e
stim
ate
s
Reliability Diagram - LSSVM, vowelGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Bin
Co
un
ts
vowel0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.1 width
A p
os
teri
ori
pro
ba
bili
ty e
sti
ma
tes
Reliability Diagram - LSSVM, fricativeGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
7000
Bin
Co
un
ts
fricative
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.1 width
A p
os
teri
ori
pro
ba
bili
ty e
stim
ate
s
Reliability Diagram - LSSVM, nasalGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
7000
8000
Bin
Co
un
ts
nasal0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.1 width
A p
os
teri
ori
pro
ba
bili
ty e
stim
ate
s
Reliability Diagram - LSSVM, silGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
7000
Bin
Co
un
ts
silence0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.1 width
A p
os
teri
ori
pro
ba
bili
ty e
stim
ate
s
Reliability Diagram - LSSVM, gsGVNx, Exp360150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
7000
Bin
Co
un
ts
semivowel/glide
LS-SVM (Combined)
stop vowel fricative
nasal silence semivowel/glide
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.094313 width
A p
ost
erio
ri p
roba
bilit
y es
timat
es
Reliability Diagram - LIBSVM, gsGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Bin
Cou
nts
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.09997 width
A p
ost
erio
ri p
roba
bilit
y es
timat
es
Reliability Diagram - LIBSVM, silGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
Bin
Cou
nts
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.095949 width
A p
oste
riori
prob
abili
ty e
stim
ates
Reliability Diagram - LIBSVM, nasalGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
7000
Bin
Cou
nts
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.099912 width
A p
ost
erio
ri pr
oba
bilit
y es
timat
es
Reliability Diagram - LIBSVM, fricativeGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1000
2000
3000
4000
5000
6000
Bin
Cou
nts
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.098786 width
A p
ost
erio
ri pr
oba
bilit
y es
timat
es
Reliability Diagram - LIBSVM, vowelGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
500
1000
1500
2000
2500
3000
3500
Bin
Cou
nts
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RAW Score bins - 0.097798 width
A p
oste
riori
prob
abili
ty e
stim
ates
Reliability Diagram - LIBSVM, stopGVNx, Exp360110 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
500
1000
1500
2000
2500
3000
3500
4000
Bin
Co
unts
SVM (Combined)
Classification accuracy (%)
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Naïve Bayes Combinations One-vs-all frameworks desired
Phonetic hierarchies are cumbersome Phone Detection
Combine phonological attribute scores with Naïve Bayes product Initial experiments in evaluating probabilities
Compare accuracy and reliability of probabilistic outputs for ANN, SVM and LS-SVM
Limited training data (LS-SVM limit is 3000 due to memory restrictions) Detect phones with combinations of relevant phonetic attributes
P(/f/ | x) = P(labial | x) P(fric | x)
(1-P(voicing | x))
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Naïve Bayes Combinations1. Phone Detection
Compare combined attributes with direct training on phones as an upper bound
2. ROC Stats (right) SVMs best for attribute detection Mixed results for NB combinations
– No clear winner between LS-SVM and SVM
Direct training outperforms combinations
3. Reliability Naïve Bayes combinations give poor
reliability for all detector types
4. Rare phones & vowels For /v/, /ng/ and /oy/, improvements in
EER and AUC across detector types (lower right)
Most vowels saw improvements as well
ROC Stats
Direct vs. Combined
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Naïve Bayes Combinations
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ROC
False Alarm Rate
Pro
babi
lity
of D
etec
tion
vphone
fphone
zphone
sphone
Combined attributes (SVM)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ROC
False Alarm Rate
Pro
babi
lity
of D
etec
tion
/v/
/f//z/
/s/
Direct Training (SVM)
1. Phone Detection Compare combined attributes with
direct training on phones as an upper bound
2. ROC Stats (right) SVMs best for attribute detection Mixed results for NB combinations
– No clear winner between LS-SVM and SVM
Direct training outperforms combinations
3. Reliability Naïve Bayes combinations give poor
reliability for all detector types4. Rare phones & vowels
For /v/, /ng/ and /oy/, improvements in EER and AUC across detector types (lower right)
Most vowels saw improvements as well
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Genetic Programming
Evolutionary algorithm for tree-structured feature “creation” (Extraction)
Maximize a fitness function across a number of generations (iterations)
Operations like crossover & mutation control the evolution of the algorithm
Trees are algebraic networks Inputs are multi-dimensional features Tree nodes are unary or binary mathematical operators (+, -, *, (.)2,
log) Algebraic networks simpler and more transparent than neural nets
GPLab Package from Universidade de Coimbra, Portugal http://gplab.sourceforge.net
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Genetic Programming Trained GP trees on SVM outputs Develop algebraic networks for combining detector outputs Produce a 1-D feature from a nonlinear combination of
detector outputs choose fitness function, set of node operators, tree depth, etc. to
maximize separation
vowel
silence
dental
velar
voicingX7
X12X9
plus
X8 cos
times
X12X9
plus
cos
plus
minus
X13X12
plus
X12X9
plus
X7 cos
times
plus
cos
plus
/aa/
X7
X12X9
plus
X8 cos
times
X12X9
plus
cos
plus
minus
X13X12
plus
X12X9
plus
X7 cos
times
plus
cos
plus
/ae/
X7
X12X9
plus
X8 cos
times
X12X9
plus
cos
plus
minus
X13X12
plus
X12X9
plus
X7 cos
times
plus
cos
plus
/zh/
1-D feature Trees are algebraic networks
Inputs are multi-dimensional features
Tree nodes are unary or binary mathematical operators (+, -, *, (.)2, log)
Algebraic networks simpler and more transparent than neural nets
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Genetic Programming
System is complex for speech recognition (tree + classifier for each phone), but GP trees themselves provide insights for combination
Fitness function Tree node operators Important features
Initial results Mixed results
– Good separation for some phones, not good for most
– GP Trees select attributes of interest, discard others
Still in progress
/oy/
/th/
vowelcoronal
times
back
atan
times
sqrt
min
atan
vowel coronal
timesglide/
semivowel
times
sqrt
min
glide/
semivowel
vowelcoronal
times
back
atan
times
sqrt
min
atan
vowel coronal
timesglide/
semivowel
times
sqrt
min
glide/
semivowel
fricative
atan
fricative
atan
high unround
timesvoiced
minus
plus
atanvoiced
minus
plus
/oy/
/th/
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Summary
Evaluating Posterior Probs ANNs, SVMs, LS-SVMs SVMs are best for reliability and accuracy In limited training data, rare phones may benefit from
from overlapping phonetic classes Genetic Programming for detector fusion
Small, transparent algebraic networks for combining attribute detectors
GP trees select relevant attributes, but much room for improvement
Limiting tree node operators and selecting “fitness functions” should provide insights into detector fusion
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Extras
Feature Space correlation matrix (1)
Feature Space correlation matrix (2)
Feature Space correlation matrix (3)
Training Data
Represents the kernel function K and the range of kernel
parameters
The Center for Signal & Image ProcessingThe Center for Signal & Image Processing
Extras
Positive scale parameters
Determine w and b by solving the optimization problem
Regression error for training sample k
Generalization/ Regularization term
Subject to
Expression for the trade-off between generalization and training set error