Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
3
Goal: Use less to Perform more
• Identifying an informative subset from a large corpus for Acoustic Model (AM) training.
• Expectation of the Selected Set Good in Performance Fast in Selection
4
Motivation
The improvement of system will become increasingly smaller when we keep adding data.
Training acoustic model is time consuming. We need some guidance on what is the
most needed data.
5
Approach Overview
Applied to well-transcribed data Selection based on transcription Choose subset that have “uniform”
distribution on speech unit (word, phoneme, character)
6
How to sample data wisely?--A simple example
k Gaussian distribution with known priorωi and unknown density function fi(μi ,σi)
7
How to sample wisely?--A simplified example We are given access to at most N examples. We have right to choose how much we want
from each class. We train the model use MLE estimator. When a new sample generated, we use our
model to determine its class.
Question: How to sample to achieve minimum error?
8
The optimal Bayes Classifier
If we have the exact form of fi(x), above classification is optimal.
arg max ( ) arg max(log( ) log( ( ))i i i ii f x f x
9
To approximate the optimal
We use our MLE The true error would be bounded by optimal
Bayes error plus error bound for our worst estimated
ˆif
if
arg max ( ) arg max(log( ) log( ( ))i i i ii f x f x
10
Sample Uniformly
We want to sample each class equally. The data selected will have good coverage on
each class. This will give robust estimation on each class.
12
Data Selection for ASR System
The prior has been estimated independently by language model.
To make acoustic model accurate, we want to sample the W uniformly.
We can take the unit to be phoneme, character, word. We want their distribution to be uniform.
13
Entropy: Measure for “uniformness”
Use the entropy of the word (phoneme) as ways of evaluation Suppose the word (phoneme) has a sample
distribution p1, p2…. pn
Choose subset have maximum -p1*log(p1)-p2*log(p2)-... pn *log(pn))
Entropy actually is the KL distance from uniform distribution
14
Computational Issue
It is computational intractable to find the transcription set that maximizes the entropy
Forward Greedy Search
15
Combination
There are multiple entropies we want to maximize.
Combination Method Weighted Sum Add sequentially
16
Experiment Setup
System: Sphinx III Feature: 39 dimension MFCC Training Corpus: Chinese BN 97(30hr)+
GaleY1(810hr data) Test Set: RT04(60 min)
17
Experiment 1 ( use word distribution)
Time (hour) 30 50 100 840Random (all) 27.6 27.1 26.1 24.3
Max-entropy 27.0 26.2 24.8
Table 1
18
More Result 30 h 50 h 100 h 150 h 840 h
random(all) 27.6 27.1 26.1 25.0 24.3
cctv(bn) 17 15.7 13.2 13.6 12.9
ntdv(bn) 24.7 24.2 23.3 22.2 21.0
rfa(bc) 42.9 43.6 44 41.1 41.0
bc/bn(ratio) 15.4/14.6 25.7/24.3 51.2/49.8 76.8/73.2431/409
max-entropy(all)
27 26.2 24.8
cctv(bn) 15 14 13
ntdtv(bn) 23 22.3 21.1
rfa(bc) 45.8 44.8 42.7
bc/bn(ratio) 11.0/19.0 18.2/31.850.6 50.6/49.8
19
Experiment 2 (add sequentially with phoneme and character 150hr)
CCTV NTDTV RFA ALLRandom(150h) 13.6 22.2 44.1 25.0Max-entropy (word+char)
12.2 21.8 42.3 24.7
Max-entropy (word+phone)
13.1 20.5 41.8 24.4
All data (840 hrs)
12.9 21.0 41.0 24.3
Table 2
20
Experiment 1,2
100
101
102
103
104
105
100
101
102
103
104
105
106
word
co
un
t
max-entropy
random
all data
21
Experiment 3 (with VTLN)
CCTV NTDTV RFA ALL150 hr (word+phone)
13.1 20.5 41.8 24.4
With VTLN 11.8 17.8 40.1 22.5
Table 3