550, rue Sherbrooke Ouest, bureau 100Montreal (Quebec) H3A 1B9
Tel. : (514) 840-1234; Telec. : (514) 840-1244888, rue St-Jean, bureau 555Quebec (Quebec) G1R 5H6
Tel. : (418) 648-8080; Telec. : (418) 648-8141http://www.crim.ca
CRIM - Documentation/Communications
E-Inclusion Core speech - Linear Transforms
2005-2006
Technical Report
CRIM-06/05-04
Gilles Boulianne
Speech Group
May 2006
Preliminary
Collection scientifique et technique
ISBN-13: 978-2-89522-073-5ISBN-10: 2-89522-073-5
Proposition no
E-Inclusion Core Speech - Linear Transforms
Contents
1 Litterature review of Optimal linear transforms 3
2 Implementation 4
2.1 Feature space Maximum Likelihood Regression . . . . . . . . . . . . . . . . . . 4
2.1.1 Transform estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Computation of a row wi . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Cofactor matrix computation . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Heteroscedastic Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . 8
2.2.1 Transform optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Results 11
Tous droits reserves c© 2006 CRIM May 2006Page 2
Proposition no
E-Inclusion Core Speech - Linear Transforms
1 Litterature review of Optimal linear transforms
Since we already have experimented with MLLR and FMLLR/CMLLR, the review focussed onstate-of-the-art techniques such as STC, MLLT, EMLLT, HLDA, and SPAM, which can all beseen as forms of precision matrix modelling.
Comparative reviews of all these techniques can be found in a recent Ph.D. Thesis [1] and atechnical report [2]. The last reference also discuss at large implementation details, and providescomparative results on Combinations of the above methods. The following table summarizessome of the information found in these two reviews.
Transformation Estimation Space Dim. reductionSPAM Numerical Model No
EMLLT Closed-form and numerical Model NoMLLT Closed-form Model NoHDA Numerical Feature Yes
HLDA Closed-form Feature YesLDA Closed-form Feature Yes
Table 1. Comparison of linear transformations for environment normalization.
Remarks:
• SPAM is a generalization of EMLLT (any sym. matrices as bases instead of rank-1)
• EMLLT is equivalent to an HDA without dimensionality reduction for particular valuesfor ∆ = D and ∆ = D(D + 1)/2.
• MLLT is a special case of EMLLT and HDA assuming diagonal covariances (and no di-mensionality reduction for HDA)
• HDA requires quadratic optimisation which may fail in some cases (falling back to steepestdescent)
• HLDA is similar to HDA but optimizes a slightly different maximum likelihood criterion,which has LDA as a special case when a common diagonal covariance is assumed
• HLDA-PMM is a special case of HLDA without dimensionality reduction (retaining thenuisance parameters) which implements precision matrix modelling
• LDA is a special case of HDA or HLDA assuming a common diagonal covariance
Tous droits reserves c© 2006 CRIM May 2006Page 3
Proposition no
E-Inclusion Core Speech - Linear Transforms
2 Implementation
In this section we will first describe the FMLLR implementation which served as the basis forimplementing HLDA.
2.1 Feature space Maximum Likelihood Regression
Feature space Maximum likelihood Linear Regression is also designated as “constrained model-space transformation’.
Section 2.1.1 reproduces equations found in [3], while sections 2.1.2 et 2.1.3 describe additionalsteps developed at CRIM during implementation.
2.1.1 Transform estimation
Here the transformation applied to the variance must correspond to the transform applied to themeans:
µ = A′µ − b′ (1)
andΣ = A′ΣA′T (2)
The following equations present the solution for the full transformation case and diagonal co-variance matrices for the models to be adapted.
If W is the extended transformation matrix [ bT AT ]T , and ζ(τ) the extended observation vector[ 1 o(τ)T ]T , we have :
o(τ) = Ao(τ) + b = Wζ(τ) (3)
We want to maximize the following auxiliary function in terms of A and b:
Q(M,M) = K −12
∑M
m=1
∑T
τ=1 γm(τ)[K(m) + log(|Σ(m)|) − log(|A|2)
+(Ao(τ) + b − µ(m))T Σ(m)−1(Ao(τ) + b − µ(m))]
(4)
Restricting to diagonal covariances and ignoring terms that are independent of W, we can rewrite4 :
Q(M,M) = βlog(piwTi ) −
1
2
n∑
j=1
(wjG(j)wT
j − 2wjk(j)T ) (5)
Tous droits reserves c© 2006 CRIM May 2006Page 4
Proposition no
E-Inclusion Core Speech - Linear Transforms
To find W that maximizes Q(M, M) we will need, for each i = 1, .., N where N is the numberof rows in A:
G(i) =M
∑
m=1
1
σ(m)2i
T∑
τ=1
γm(τ)ζ(τ)ζ(τ)T (6)
k(i) =
M∑
m=1
1
σ(m)2i
µ(m)i
T∑
τ=1
γm(τ)ζ(τ)T (7)
β =M
∑
m=1
T∑
τ=1
γm(τ) (8)
The value of W which maximizes Q(M, M) is found row by row, each row wi of W beinggiven by :
wi = (αpi + k(i))G(i)−1 (9)
where α satisfies the quadratic equation :
α2piG(i)−1pT
i + αpiG(i)−1k(i)T − β = 0 (10)
and pi is the extended cofactor row vector [0 ci1 . . . cin], cij = Cof(Aij). Equation 10 can bewritten in quadratic form :
α2ε1 + αε2 − β = 0 (11)
whereε1 = piG
(i)−1pTi (12)
ε2 = piG(i)−1k(i)T (13)
Plugging these values back in equation 5, the auxiliary function becomes :
Q(i)(M,M) = βlog(|αε1 + ε2|) −1
2α2ε1 (14)
Since equation 11 has two solutions for α, we select the one that maximizes equation 14. Alsonote that each row wi is depending on cofactors pi , thus of all other rows of W. Thus we mustoptimize row by row in an iterative process.
Tous droits reserves c© 2006 CRIM May 2006Page 5
Proposition no
E-Inclusion Core Speech - Linear Transforms
2.1.2 Computation of a row wi
Assume the following equation system having X and Y as solutions :
G(i)X = pTi (15)
G(i)Y = k(i)T (16)
Then :ε1 = piX (17)
ε2 = piY (18)
and equation 9 can be rewritten :wi = αXT + YT (19)
Equations 15 and 16 can be solved in a single step. Solving for Z = [ X Y ] :
G(i)Z = [ pTi k(i)T ] (20)
We can compute ε1 and ε2 directly:[ ε1 ε2 ] = piZ (21)
Then solving equation 11 for α, we can compute wi directly from Z as:
wi = [ α 1 ] ZT (22)
2.1.3 Cofactor matrix computation
Preceding equations require the computation of pi for each row. This corresponds to the compu-tation of one column of the inverse matrix A−1, since by definition of the adjoint matrix Aa wehave :
A−1 =1
|A|Aa (23)
whereAa = Cof(Aij)
T (24)
Thus :pi = [ 0 |A|coli(A
−1)T ] (25)
Tous droits reserves c© 2006 CRIM May 2006Page 6
Proposition no
E-Inclusion Core Speech - Linear Transforms
For a given row i, we need only compute a single column of the inverse, since for the next row,the matrix A (or W) will have changed. The most efficient way to compute a column of A−1 isto use LU decomposition :
A = PLU (26)
Solving the following equation gives column i of A−1:
AX = Ei (27)
where Ei = [ δj ]T and δj = 1 when i = j, δj = 0 when i 6= j.
The determinant of A is easy to obtain, since matrices L and U are triangular. In addition, thediagonal elements of L are 1. So we have :
|A| = |P||L||U| = |P|N∏
i=1
uii (28)
To prevent numerical problems, the product is computed as a sum of logarithms. We have |P| =+1 if P corresponds to an even number of row exchanges, and |P| = −1 for an odd number.
2.1.4 Initialisation
The number of iterations necessary to get a final value for W depends on its initial value. Weobserved, as in [3], that only a few iterations are necessary if the diagonal transform is used asinitialization. The diagonal transform is obtained directly without iterations.
Thus the computation of a full transform begins by computing the diagonal transform, theniterating from it until the auxiliary function converges.
Tous droits reserves c© 2006 CRIM May 2006Page 7
Proposition no
E-Inclusion Core Speech - Linear Transforms
2.2 Heteroscedastic Linear Discriminant Analysis
In the following, section 2.2.1 reproduces equations found in [4] and [5]. Section 2.2.2 describeadditional steps developed at CRIM during implementation.
Heteroscedastic Linear Discriminant Analysis [6] finds a transformation A of the feature space :
o(τ) = Ao(τ) (29)
that will maximize the likelihood of the observations given Gaussian models as the n-dimensionalfeature vectors are projected by a p×n projection matrix A[p] into a p-dimensional retained sub-space. A[n−p] projects the original feature space into a n − p dimensional nuisance subspace,which is explicitly modeled using a global Gaussian distribution.
If we assume diagonal covariance models, we can then take advantage of an iterative row-by-row optimization originally developed in [3] that is guaranteed to increase the likelihood at eachiteration.
Let the transformed component means µ(m) and covariances Σ(m) be:
µ(m)[p] = A[p]µ
(m)µ
(g)[n−p] = A[n−p]µ
(g)
Σ(m)[p] = diag(A[p]Σ
(m)AT[p]) Σ
(g)[n−p] = diag(A[n−p]Σ
(g)AT[p])
µ(m) =
[
µ(m)[p]
µ(g)[n−p]
]
Σ(m) =
[
Σ(m)[p] 0
0 Σ(g)[n−p]
]
(30)
where the global covariance Σ(g) is given by :
Σ(g) =
∑
m,τ γm(τ)(o(τ) − µ(g))(o(τ) − µ
(g))T
∑
m,τ γm(τ)(31)
The component covariance Σ(m) is computed via the component specific second order statistics:
Σ(m) =
∑
τ γm(τ)(o(τ) − µ(m))(o(τ) − µ
(m))T
∑
τ γm(τ)(32)
and the means µ(m) and µ
(g) are computed via the component specific first order statistics:
µ(m) =
P
τ γm(τ)o(τ)P
τ γm(τ), µ
(g) =P
m,τ γm(τ)o(τ)P
m,τ γm(τ) (33)
The HLDA transform can be directly applied to acoustic features. The Gaussian likelihoodcalculation uses the transformed means and variances in the retained subspace. The Jacobian
Tous droits reserves c© 2006 CRIM May 2006Page 8
Proposition no
E-Inclusion Core Speech - Linear Transforms
term, and the likelihood computation in the nuisance subspace, can be ignored in the followingequation, since both are constant across all components.
p(o(τ); µ(m), Σ(m),A) = |A|N (Ao(τ); µ(m), Σ(m)) (34)
2.2.1 Transform optimization
Let ai be the ith row of A, T the total number of observations in the training, and :
G(i) =
∑
m,τ
γm(τ)
aiΣ(m)aT
i
Σ(m) i ≤ p
∑
m,τ
γm(τ)
aiΣ(g)aT
i
Σ(g) i > p
(35)
The rows of A are iteratively updated using the following formula :
ai = ciG(i)−1
√
T
ciG(i)−1cTi
(36)
where ci is the cofactor row vector for the current estimate of A.
The algorithm repeats iteratively the following steps:
• Given current ML estimates of model parameters, µ(m),µ(g),Σ(m) and Σ(g) are computed
and the G(i) statistics are accumulated.
• Transform matrix A is iteratively updated according to equation 36
• Model parameters are updated using equations 30
There remains the problem of finding an initialisation for the procedure. One simple method forthe selection of nuisance dimensions is to choose the last few dimensions in the original featurespace. According to [4], no significant performance difference has been observed between thisinitialization scheme and another one using Fisher ratio values.
2.2.2 Implementation details
Setting up the following system of equations :
G(i)X = cTi (37)
Tous droits reserves c© 2006 CRIM May 2006Page 9
Proposition no
E-Inclusion Core Speech - Linear Transforms
and using the symmetry of G(i), we can rewrite equation 36 :
ai = XT
√
T
ciX(38)
We still need to compute row vector ci = rowi(Cof(Aij)). This corresponds to the computationof one column of the inverse matrix A−1, since by definition of the adjoint matrix Aa we have :
A−1 =1
|A|Aa (39)
whereAa = Cof(Aij)
T (40)
Thus:ci = rowi(|A|A−1)T = |A|rowi(A
−1)T = |A|(coli(A−1))T (41)
An efficient way of computing a single column of A−1 and the determinant of A was givenearlier by equations 26 to 28.
Note: what if there is an error there ? C = Cof(Aij) and C = |A|A−1 are the transposed ofeach other!
Check using Olsen p.16 (that’s already in the FMLLR code!!)
|A| = cTi ai (42)
First- and second- order statistics
The following first order statistics are computed in a first pass across the observations:
γ =∑
m,τ
γm(τ) (43)
o =∑
m,τ
γm(τ)o(τ) (44)
Tous droits reserves c© 2006 CRIM May 2006Page 10
Proposition no
E-Inclusion Core Speech - Linear Transforms
3 Results
Here are results obtained on the multispeaker database, in speaker independent recognition. Notethat the transforms were used as a global adaptation, not for speaker adaptation as is usually donefor FMLLR, for example.
We note that without projection, HLDA (which is then equivalent to MLLT) provides no im-provement in accuracy, but produces a speed improvement of 40% for a beam of 160 (exper-iment 3 compared to 2). FMLLR in a well-tested implementation, also results in no accuracyimprovement in that setup (experiment 6. Finally, projecting from a vector of 52 dimensions(which included static, delta, acceleration and 3rd order derivatives) to a vector of 39 dimensionsproduced a 3% absolute degradation in accuracy.
Exp Description %Acc CPU xRT1 baseline, beam 140, 39 dim 82.8%2 baseline, beam 160, 39 dim 83.3% 3.6x3 HLDA, beam 140, 39 dim to 39 dim 82.2% 2.2x4 HLDA, beam 160, 39 dim to 39 dim 83.0% 2.7x5 HLDA, beam 140, 52 dim to 39 dim 79.3% 3.1x6 FMLLR, beam 140, 39 dim 82.7% 3.5x
Table 2. Linear transform results on the multi-speaker database.
References
[1] Michael Pitz, Investigations on Linear Transformations for Speaker Adaptation and Normal-ization, Ph.D. thesis, RWTH Aachen, 2005, Published by: R. Oldenbourg Verlag, Munich.
[2] K. C. Sim and M. J. F. Gales, “Precision matrix modelling for large vocabulary continuousspeech recognition,” Tech. Rep. CUED/F-INFENG/TR485, Cambridge University Engi-neering Department, Cambridge, England, June 2004.
[3] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recog-nition,” Computer Speech and Language, vol. 12, 1998.
[4] X. Liu and M. J. F. Gales, “Discriminative training of multiple subspace projections forlarge vocabulary speech recognition,” Tech. Rep. CUED/F-INFENG/TR-489, CambridgeUniversity Engineering Department, Cambridge, England, Aug. 2004.
[5] M. Karafiat, L. Burget, and J. Cernocky, “Using smoothed heteroscedastic linear discrimi-nant analysis in large vocabulary continuous speech recognition system,” in Proc. MLMI05,2005, p. 8, Edinburgh,UK.
Tous droits reserves c© 2006 CRIM May 2006Page 11