Date post: | 15-Jan-2016 |
Category: |
Documents |
View: | 217 times |
Download: | 0 times |
ICCS-NTUA : WP1+WP2
Prof. Petros MaragosNTUA, School of ECEURL: http://cvsp.cs.ntua.gr
Computer Vision, Speech Communication and Signal Processing Research Group
HIWIRE
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE Evaluation
Databases & Baseline Completed
Platform Front-end Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE Evaluation
Databases & Baseline Completed
Platform Front-end Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
HIWIRE Advanced Front-end: Challenges
Points Considered during Implementation Modular Architecture Implementation in C-Code Incorporation of Different Ideas/Algorithms User-friendly interface providing additional options dealing with on-site demands of the project
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
HIWIRE Advanced Front-end: Options
WantVAD?
No
LTSDVAD /MTEVAD
Yes
WantDenoising?
No
Yes
WienerDenoising
MFCC/TECC
TECCMFCCS
peech S
ignals
Speech
Processing
(Features)
Speech P
re-Processing
(Denoising)
1 1
2 2
3 3
Support for Input Speech Signals
Different Sampling Frequencies• 8 kHz• 11 kHz• 16 kHz
Different Byte-Ordering• Little-endian• Big-endian
Different Input File Formats• RAW• NIST• HTK
Provides Flags/ Options:
Preprocessing Smoothing of Speech Signals• Hamming Windowing• Pre-emphasis
Denoising/ VAD Algorithms• LTSD-VAD Algorithm (UGR)• MTE-VAD Algorithm (ICCS-NTUA)
• Wiener Denoising Algorithm- (Used only with a VAD algorithm)
Output Features
• MFCC• TECC• C0 or logE
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
HIWIRE Advanced Front-end: Things to Be Done
• Script is in Testing Phase
• Create a CVS where Additional Modules should be included
• Tested Further in Speech Databases
Evaluation in progress
• Fine-Tuning is Necessary
• Final Version should be Faster (Real-Time Processing)
• Incorporate it in the HIWIRE Platform
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st, 2nd
Year Evaluation
Databases & Baseline Completed
Platform Front-end Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration?
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Microphone Arrays• Multi-channel Speech Enhancement for Diffuse Noise Fields
– MVDR (Minimum Variance Distortionless Response) Beamforming
– Single Channel Linear and non-linear Post-Filtering
• MSE criterion leads to the linear Wiener Post-filter.
• MSE STSA and MSE log-STSA criteria leads to non-Linear Post-filters.
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Microphone Arrays The Overall Speech Enhancement System includes the following steps:
The noisy channel’s inputs are fed into a time alignment module (Different propagation paths for every input channel)
The time aligned noisy observations are projected to a single channel output with minimum noise variance, through the MVDR beamformer.
The output of the beamformer is further processed by a post-filter according to the used speech enhancement criterion (MSE, MSE STSA, MSE log-STSA).
For the post-filters, since they depend on second order statistics of the source and the noise signals, we have to develop an estimation scheme.
Results on CMU Database 10 Speakers (13 utterances) Diffuse Noise SSNR Enhancement : SSNRoutput-E[SSNRinput] (E[] stands for the mean value of the
N input channels) LAR, LSD, IS, LLR : Low values signify high speech quality. These measures are
found to have a high correlation with the human perception.
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Results: CMU Database
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Spectrograms: CMU Database
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Multi-Microphone ASR Experiments
Details on Setup of ASR Tasks: • 700 Sentences for Training and 300 for Testing
• 12-state, left-right HMM w. Gaussian mixtures
• All-pair, unweighted grammar
• MFCC+C0+D+DD (39 coefficients in total)
Correct Word Accuracies (%)Input Signalsfor ASR task Original Noisy McCowan Proposed
MFCC+D+DD 96.37 94.98 93.83 95.23
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st, 2nd
Year Evaluation
Databases & Baseline Completed
Platform Front-end Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration?
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Multi-Cue Feature Fusion Goal:
Fuse heterogeneous information streams optimally & adaptively Our approach:
Explicitly model uncertainty in all feature measurements (due to noise or model fitting errors)
Adjust model training to accommodate for uncertainty Dynamically compensate feature uncertainty during decoding Feature uncertainty estimation in the AV-ASR case:
For the Audio Stream/MFCC: speech enhancement process For the Visual Stream: model fitting variance
Properties: Adaptation at the frame level Explain and generalize cue weighting through stream exponents Integrates with a wide range of models, e.g. GMM, HMM Applicable to both audio-audio and audio-visual scenarios Can be combined with asynchronous models, e.g. Product-HMM
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Measurement Noise and Adaptive Fusion
C
X
C
X
Y
Our View: We can only measure noise-corrupt features
Conventional View: Features are directly observable
Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
EM-Training with Partially Known Features
( , ) [log ( ,{ } | ) | , ]Q ΄ p X C X ΄
C
X
C
X
Y
Our View
Conventional View
Hidden
Observed
Hidden
Observed
Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06
Even training data can be uncertain
( , ) [log ( ,{ , } | ) | , ]Q ΄ p Y X C Y ΄
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
EM-Training: Results for GMME
-Ste
pM
-Ste
p
Filtered feature
estimate
Similar to conventional update rules
Uncertainty-compensated scores
Formulas for HMM are similar
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Decoding & Uncertain Features
Variance-Compensated (“Soft”) Scoring
Probabilistic Justification for Stream Exponents
Relative Measurement Error
Adaptation at each frame –stream/class/mixture dependent
stream weights
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Audio-visual Asynchrony Modeling
Multi-stream HMM Product HMM
Ref: Gravier et al., 2002
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Fusion: Multi-Cue Audio-Audio
Feature Uncertainty for Audio features
Baseline Audio Features: MFCC
Enhancement using GMM of clean speech and Vector Taylor
Series Approximation
Uncertainty is Gaussian with Variance given by the
enhancement process
Used for Audio-Visual Fusion
Fractal Audio Features: MFD
On-going research applying a similar framework (GMM, VTS)
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
MFD: From Noisy Speech to Feature Uncertainty
Ongoing Research: Noise Compensation for MFD
Estimated Noisy
Clean
Noise
True Noisy
White Noise (0 dB)
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st, 2nd
Year Evaluation
Databases & Baseline Completed
Platform Front-end Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration?
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Showcase: Audio-Visual Speech Recognition
+p1 +p2=
1 2=
Both shape & texture can assist lipreading
Active Appearance Models for face modeling Shape and texture of faces “live” in low-dim manifolds
Features: AAM Fitting (nonlinear least squares problem)
Visual feature Uncertainty related to the sensitivity of the least-squares
solution
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Demo: AAM fitting and uncertainty estimates
The visual front-end supplies both features and their
respective uncertainty.
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Audio-Visual ASR: Database
Subset of CUAVE database used:
36 speakers (30 training, 6 testing)
5 sequences of 10 connected digits per speaker
Training set: 1500 digits (30x5x10)
Test set: 300 digits (6x5x10)
CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)
CUAVE was kindly provided by the Clemson University
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Evaluation on the CUAVE Database
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Audio-Visual Speech Classification with MS-HMM
Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
AV Digit Classification Results (Word Accuracy)
Audio: MFCC_D_Z (26 features)Visual: 6 shape + 12 texture AAM coefficientsAV MS-HMM: AudioVisual Multistream HMM, weights (1,1)AV MS-HMM, Var-Comp: AudioVisual Multistream HMM+Variance Compensation
AV P-HMM: AudioVisual Product HMM, weights (1,1)
AV P-HMM, Var-Comp: AudioVisual Product HMM+ Variance Compensation
SNR
(babble)
Audio Visual AV
MS-HMM
AV
MS-HMM
Var-Comp
AV
P-HMMAV
P-HMMVar-Comp
Clean 100% 68.7% 95.1% 97.0% 95.4% 99.6%
10 dB 92.8% - 88.3% 90.2% 90.6% 92.5%
5 dB 73.9% - 84.5% 86.8% 87.2% 89.1%
0 dB 54.7% - 79.6% 81.1% 83.8% 82.6%
Ref: Pitsikalis, Katsamanis, Papandreou, and Maragos, ICSLP’06
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
AV-ASR: Results with Uncertain Training
Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st, 2nd
Year Evaluation
Databases & Baseline Completed
Platform Front-end Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration?
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
VTLN on the Platform
Warping in the front-end Piecewise Linear Warping Function
Warping in the filterbank domain by stretching or compressing the frequency axis
Training – HTK Implementation
Testing Fast Implementation using GMM representing normalized
speech to estimate warping factors per utterance.
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
VTLN on the Platform, Results
87
87.5
88
88.5
89
89.5
MFCC (H
FE)
MFCC(H
TK)
TECC
MFCC+V
TLN (HFE)
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
VTLN Research, TECC Features Teager Energy Cepstrum Coefficients are actually
energy measurements at the output of a Gammatone filterbank, similarly to MFCC
VTLN can be applied in a similar manner The bark scale along which the filters are uniformly
positioned is properly stretched or shrunk to achieve warping
Evaluation is currently in progress
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
VTLN Research, using Formants
82
83
84
85
86
87
88
89
90
MFCC (H
FE)
MFCC(H
TK)
VTLN (LPC)
VTLN (Multi
Band)
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Raw Formants-Dynamic Programming
time
node
( , ) ( , ) min ( , ) ( 1, )local transm
C t n C t n C m n C t m
2i w,i, ,
( , ) / + β B i ii i
local i i n n ii F i FC t n a F F F d
2
max
( ) ( 1)( , ) i=1 Ni i
trans ii
F t F tC m n
F
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Formant Tracking
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st, 2nd
Year Evaluation
Databases & Baseline Completed
Platform Release 1st Version
WP1
Noise Robust Features Completed
Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results
Audio-Visual ASR Baseline + Adv. Visual Features
VAD Completed + Integration?
WP2
VTLN Platform Integration Completed
Speaker Normalization Research Prelim. Results
Non-native Speech Database Completed
HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA
Next... Fusion
Audio+Audio,
Audio+Visual,
Nonlinear Features+Visual
Visual Front-end
VAD+ Nonlinear Features