Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | deborah-mcdaniel |
View: | 216 times |
Download: | 0 times |
Philip Jackson, Boon-Hooi LoPhilip Jackson, Boon-Hooi Lo
and Martin Russelland Martin Russell
Electronic Electrical and Computer Engineering
Models of speech Models of speech dynamics for ASR, using dynamics for ASR, using
intermediate linear intermediate linear representationsrepresentations
http://web.bham.ac.uk/p.jackson/balthasar/
AbstractINTRODUCTIONINTRODUCTION
Speech dynamics into ASR• dynamics of speech production to
constrain recognizer– noisy environments– conversational speech– speaker adaptation
• efficient, complete and trainable models– for recognition– for analysis– for synthesis
INTRODUCTIONINTRODUCTION
Articulatory trajectories
from West (2000)
INTRODUCTIONINTRODUCTION
Articulatory-trajectory model
INTRODUCTIONINTRODUCTION
intermediate
finite-state
surface
Level
source dependent
Articulatory-trajectory model
INTRODUCTIONINTRODUCTION
Multi-level Segmental HMM
• segmental finite-state process
• intermediate “articulatory” layer– linear trajectories
• mapping required– linear transformation– radial basis function network
INTRODUCTIONINTRODUCTION
Linear-trajectory modelINTRODUCTIONINTRODUCTION
acoustic layer
articulatory-to-acoustic mapping
intermediate layer
segmental HMM
2 3 4 51
Linear-trajectory equations
Defined as
whereSegment probability:
,iii ttt cmf
.21t
1
1 ,)();(t
iii RtWtb fyy N
THEORYTHEORY
Linear mapping
Objective function
with matched sequences and
,)()()()(1
1
T
ti ttWRttWE yxyx
T1x
YXW
.1Ty
YWXD min
THEORYTHEORY
Trajectory parameters
S
i
ttii
iyb
1
1)1(,Pr Msy
Utterance probability,
and, for the optimal (ML) state sequence ,s
1 2
1
1
1
1
1
)(ˆ
)(1
ˆ
i
i
i
i
i
i
t
tt
t
tt iki
i
t
ttikii
tt
tDWDtt
tDWDT
ym
yc
THEORYTHEORY
Non-linear (RBF) mapping
. . .
tif
. . . tjx
tky. . .acoustic layer
formant
trajectories
THEORYTHEORY
Trajectory parametersWith the RBF, the least-squares solution issought by gradient descent:
t j j
ijijjj
i
t j j
ijijjj
i
tftxttxtt
mE
tftxttx
cE
2
2
)()()()(2
)()()()(2
yw
yw
THEORYTHEORY
Tests on TIMIT• N. American English, at 8kHz
– MFCC13 acoustic features (incl. zero’th)
a) F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker
b) F1-3+BE5: five band energies added
c) PFS12: synthesiser control parameters
METHODMETHOD
TIMIT baseline performance
47
48
49
50
51
52
53
54
ID_0 ID_1
Features
Acc
ura
cy (
%)
• Constant-trajectory SHMM (ID_0)• Linear-trajectory SHMM (ID_1)
RESULTSRESULTS
Performance across feature sets
47
48
49
50
51
52
53
54
ID_0 (a) F1-3 (b) F1-3+BE5 (c) PFS12 ID_1
Features
Acc
ura
cy (
%)
RESULTSRESULTS
Phone categorisationNo. Description
A 1 all data
B 2 silence; speech
C 6 linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate
D 10 as Deng and Ma (2000):silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop
E 10 discrete articulatory regions
F 49 silence; individual phones
METHODMETHOD
Discrete articulatory regionsFeatures Description
0 -voice Silence, non-speech
1 +voice, VT open Vowel, glide
2 +voice, VT part. Liquid, approximant
3 +voice, VT closed, +velum
Nasal
4 +voice, VT closed Voiced plosive (closure)
5 -voice, VT closed Voiceless plosive (closure)
6 +voice, VT open, +plosion
Voiced plosive (release)
7 -voice, VT open, +plosion Voiceless plosive (release)
8 +voice, VT part., +fric/asp
Voiced fricative
9 -voice, VT part., +fric/asp Voiceless fricative
METHODMETHOD
Performance across groupings
47
48
49
50
51
52
53
54
ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1
Mappings
Acc
ura
cy (
%)
RESULTSRESULTS
Results across groupings
47
48
49
50
51
52
53
54
ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1
Mappings
Acc
ura
cy (
%)
(a) F1-3
(b) F1-3+BE5
(c) PFS12
RESULTSRESULTS
Tests on MOCHA• S. British English, at 16kHz
– MFCC13 acoustic features (incl. zero’th)
– articulatory x- & y-coords from 7 EMA coils
– PCA9+Lx: first nine articulatory modes plus the laryngograph log energy
METHODMETHOD
MOCHA baseline performance
53
54
55
56
ID_0 ID_1
Mappings
Acc
ura
cy (
%)
RESULTSRESULTS
Performance across mappings
53
54
55
56
ID_0 A (1) B (2) C (6) D (10) E (10) F (49) ID_1
Mappings
Acc
ura
cy (
%)
RESULTSRESULTS
Model visualisationDISCUSSIONDISCUSSION
Originalacousticdata
Constant-trajectorymodel
Linear-trajectorymodel, (F)PFS12 (c)
Conclusions• Theory of Multi-level Segmental HMMs• Benefits of linear trajectories• Results show near optimal performance
with linear mappings• Progress towards unified models of the
speech production process
• What next?– unsupervised (embedded) training, to
derive pseudo-articulatory representations– implement non-linear mapping (i.e., RBF)– include biphone language model, and
segment duration models
SUMMARYSUMMARY