Performance Evaluation of Deep Bottleneck Features for Spoken
Language Identification
Bing Jiang1, Yan Song1, Si Wei2, Meng-Ge Wang1, Ian McLoughlin1, Li-Rong Dai1
1、National Engineering Laboratory of Speech and Language Information Processing,
University of Science and Technology of China2、iFlytek Research, Anhui USTC iFlytek Co. Ltd.
Outline
• Background
• Our Method
• Experiments
• Conclusions
Friday, September 19, 2014 2
Outline
• Background• Our Method
• Experiments
• Conclusions
Friday, September 19, 2014 3
Background
• Language Identification is a typical problem inMachine learning
Friday, September 19, 2014 4
Feature Domain
Model
Domain
SDC GMM
Background
• There are many language-independent nuisancescovered by original acoustical feature.– Speaker variations
– Channel variations
– Special content variations
– Noise variations
• Feature improvement– MFCC SDC
• Temporal extension
– Compensation in feature domain• Factor analysis
Friday, September 19, 2014 5
So difficult
Background
• Model– Generative model Discriminative model
• GMM-UBM
• SVM
• MMI
– i-vector is the state of the art• Factor analysis
• With compensation methods:– LDA
– WCCN
– PLDA
• More suitable features are wanted……
Friday, September 19, 2014 6
Background
• Recently, DNN is drawing lots of attention– Non-linear modeling capability
• Deep layers structure
• Non-linear activation function
– Feature learning capability• Extracting information about the target layer by layer
• Using neural network to extract the discriminativefeature for LID task??– PLLR
– MLP
– Deep Bottleneck Feature
Friday, September 19, 2014 7
Outline
• Background
• Our Method• Experiments
• Conclusions
Friday, September 19, 2014 8
Our Method
• What are Deep bottleneck features?
Friday, September 19, 2014 9
输入特征
输出类别
Bottleneck layer
当前帧
当前帧
输出特征
Y=[y1,y2,...,yf]
Target Class
Input Feature
Current Frame
Current Frame
Output Feature
Our Method
• What are Deep bottleneck features?
Friday, September 19, 2014 10
1 -2 1
1 2
1 1 1 1
1 =1 1
( ; , ,..., )
( (... ( )...) )l l
l
m
M M Ml l l l
mj ji id d i j m
j i d
y W W W
x b b b
x
Our Method
• Why do we use Deep bottleneck features?– The target class
• Phonemes or phoneme states are suitable forlanguage identification task– Statistical method
• A low-dimensional compact representation ofthe original inputs
– Non-linear transformation– Discriminative features
Friday, September 19, 2014 11
Our Method
• Why do we use Deep bottleneck features?
Friday, September 19, 2014 12
SDC PK DBF
Our Method
• How to train the DBF extractor?
Friday, September 19, 2014 13
Outline
• Background
• Our Method
• Experiments• Conclusions
Friday, September 19, 2014 14
Experiments
• DNN training database– 500 hours Mandarin telephone database
• Evaluation database– NIST LRE 2009
Friday, September 19, 2014 15
Experiments
• Exper1: Comparison with SDC– DBF:43x11-2048-2048-43-2048-2048-6004
– 2048 mixture GMM-UBM
Friday, September 19, 2014 16
Experiments
• Exper2: Context window size of DNNinput– Motivation
• Context window size is sensitive for LID
• The parameter for SDC (7-1-3-7)– Can cover 21 frames
• For LID, the input window should be more length than speech recognition– Speech recognition: 5-1-5
Friday, September 19, 2014 17
Experiments
• Exper2: Context window size of DNNinput– DBF:43xn-2048-2048-43-2048-2048-6004
Friday, September 19, 2014 18
Experiments
• Exper3: Dimension of DBF– Motivation
• DNN training forces the activation signals inthe bottleneck layer to form a low-dimensional compact representation of theoriginal inputs
• Find the relationship of the feature dimensionand the performance.
Friday, September 19, 2014 19
Experiments
• Exper3: Dimension of DBF– DBF:43x21-2048-2048-d-2048-2048-6004
Friday, September 19, 2014 20
Experiments
• Exper4: Generated in different layers– Motivation
• The feature is more discriminative for target, the more suitable for LID??
• The bottleneck layer is more closer to the output layer, the performance more better???
Friday, September 19, 2014 21
Experiments
• Exper4: Generated in different layers– Layer3:43x21-2048-2048-43-2048-2048-6004
– Layer4:43x21-2048-2048-2048-43-2048-6004
– Layer5:43x21-2048-2048-2048-2048-43-6004
Friday, September 19, 2014 22
Experiments
• Exper5: DBF with PCA– Motivation
• Since we use the diagonal covariance matrix toapproximate the GMM, each dimension of the inputfeature need to be de-correlated.
• For SDC, (Discrete cosine transformation) DCT.
• For DBF, we use the classical PCA to have a try.
Friday, September 19, 2014 23
Experiments
• Exper5: DBF with PCA– DBF:43x21-2048-2048-43-2048-2048-6004
Friday, September 19, 2014 24
Outline
• Background
• Our Method
• Experiments
• Conclusions
Friday, September 19, 2014 25
Conclusions
• In this paper, we investigated the useof bottleneck features for LID task.– DBF can significantly improve LID
performance, especially for shortduration utterances.
– DBF is a new milestone for LID research.
• We believe that using DNN to extractmore suitable feature for LID will makea great process in LID community.
Friday, September 19, 2014 26
Related Paper
• For more information about DBF for LID, you can seethe following paper:– Song Yan, Jiang Bing, Bao Ye-Bo, Wei Si, Dai Li-Rong, “I-vector
representation based on bottleneck features for languageidentification.” Electronics Letters, vol.49, no. 24, pp. 1569-1570,2013.
– Jiang Bing, Song Yan, Si Wei, Liu Jun-hua, Ian McLoulghlin andDai Li-Rong, “Deep Bottleneck features for Spoken languageidentification.” Plos One 9(7): e100795.doi:10.1371/journal.pone.0100795, 2014.
– Jiang Bing, Song Yan, Si Wei, Ian McLoulghlin and Dai Li-Rong,“Task-aware deep bottleneck features for spoken languageidentification.” in Proc. of INTERSPECH 2014, Singapore, 2014.
Friday, September 19, 2014 27
Friday, September 19, 2014 28