Computer Vision, Speech Communication and Signal Processing Research GroupHIWIRE
HIWIREICCS - NTUA
Group Leader: Prof. Petros MaragosPh.D. Students / Graduate Research Assistants :D. Dimitriadis (speech: recognition, modulations)V. Pitsikalis (speech: recognition, fractals/chaos, NLP)A. Katsamanis (speech: modulations, statistical processing, recognition)G. Papandreou (vision: PDEs, active contours, level sets, AV-ASR) G. Evangelopoulos (vision/speech: texture, modulations, fractals)S. Leykimiatis (speech: statistical processing, microphone arrays)
HIWIRE Involved CVSP Members
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Modulation Features Results 1st YearFractal Features Results 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
HIWIREICCS - NTUA
WP1: Noise RobustnessPlatform: HTKBaseline + Evaluation:
Aurora 2, Aurora 3, TIMIT+NOISE
Modulation FeaturesAM-FM ModulationsTeager Energy CepstrumFractal FeaturesDynamical Denoising Correlation DimensionMultiscale Fractal DimensionHybrid-Merged Featuresup to +62% (Aurora 3)up to +36% (Aurora 2)up to +61% (Aurora 2)
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Speech Modulation FeaturesResults 1st YearFractal Features Results 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
HIWIREICCS - NTUA
Speech Modulation FeaturesFilterbank Design
Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-MeanShort-Term Mean Inst. Frequency IF-MeanFrequency Modulation Percentages FMP
Short-Term Energy Modulation FeaturesAverage Teager Energy, Cepstrum Coef. TECC
HIWIREICCS - NTUA
Modulation Acoustic Features
SpeechNonlinearProcessingDemodulationRobustFeatureTransformation/SelectionRegularization+Multiband FilteringStatisticalProcessingV.A.D.Energy Features: Teager Energy Cepstrum Coeff. TECCAM-FM Modulation Features:Mean Inst. Ampl. IA-MeanMean Inst. Freq. IF-MeanFreq. Mod. Percent. FMP
HIWIREICCS - NTUA
TIMIT-based Speech Databases TIMIT Database:Training Set: 3696 sentences , ~35 phonemes/utterancesTesting Set: 1344 utterances, 46680 phonemesSampling Frequency 16 kHz Feature Vectors:MFCC+C0+AM-FM+1st+2nd Time DerivativesStream Weights: (1) for MFCC and (2) for -FM
3-state left-right HMMs, 16 mixturesAll-pair, Unweighted grammarPerformance Criterion: Phone Accuracy Rates (%)Back-end System: HTK v3.2.0
HIWIREICCS - NTUA
Results: TIMIT+NoiseUp to +106%
Chart1
58.458.8959.6159.3459.92
42.4242.443.5343.743.69
27.7141.6139.2536.8738.6
17.7234.7426.0325.3826.15
18.638.431.0530.9232.84
52.7554.3556.555.355.97
MFCC*
TEner. CC
MFCC*+IA-Mean
MFCC*+IF-Mean
MFCC*+FMP
Accuracy
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
Sheet1
MFCC*
TEner. CC
MFCC*+IA-Mean
MFCC*+IF-Mean
MFCC*+FMP
Accuracy
TIMIT Databases
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet2
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
HIWIREICCS - NTUA
Aurora 3 - SpanishConnected-Digits, Sampling Frequency 8 kHzTraining Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211)HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digitsMM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digitsHM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits2 Back-end ASR Systems ( and BLasr)Feature Vectors: MFCC+AM-FM (or Auditory+M-FM), TECCAll-Pair, Unweighted Grammar (or Word-Pair Grammar)Performance Criterion: Word (digit) Accuracy Rates
HIWIREICCS - NTUA
Results: Aurora 3 (HTK)Up to +62%
Chart2
92.9493.6893.6494.0590.7194.41
80.3192.7391.6192.2289.5292.46
51.5565.1886.8577.772.3682.73
74.9383.8690.787.9984.289.87
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet2
000000
000000
000000
000000
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
HIWIREICCS - NTUA
Databases: Aurora 2Task: Speaker Independent Recognition of Digit SequencesTI - Digits at 8kHzTraining (8440 Utterances per scenario, 55M/55F)Clean (8kHz, G712)Multi-Condition (8kHz, G712)4 Noises (artificial): subway, babble, car, exhibition5 SNRs : 5, 10, 15, 20dB , cleanTesting, artificially added noise7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]A: noises as in multi-cond train., G712 (28028 Utters)B: restaurant, street, airport, train station, G712 (28028 Utters)C: subway, street (MIRS) (14014 Utters)
HIWIREICCS - NTUA
Results: Aurora 2Up to +12%
1
98.717598.665
95.9562596.10125
88.8462590.1125
69.5187573.11125
40.522545.8275
16.94519.70125
Baseline
IA-Mean
SNR
Word Accuracy (%)
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
0.1183606557
0.1160714286
0.1358024691
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
AURORA2TESTC
clean20 dB15 dB10 dB5 dB0 dB
Baseline9695.6788.727568.912538.4415.6675
GaborFMP98.707595.6788.727568.912538.4415.6675
AURORA2TESTC
00
00
00
00
00
00
Baseline
GaborFMP
SNR
Word Accuracy (%)
AURORA2
AURORA2TESTB
clean20 dB15 dB10 dB5 dB0 dB
Baseline9695.6788.727568.912538.4415.6675
GaborFMP98.707595.6788.727568.912538.4415.6675
98.6595.8887.0466.7538.7814.4398.6596.3888.3370.2242.9517.38
98.5296.4390.8471.6144.4120.3198.4696.5291.8174.9450.2123.31
98.729688.5270.0343.5421.3298.6996.1290.5274.2948.0524.04
98.9896.6789.1172.0542.5515.6798.8696.7990.976.1248.0420.46
98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975
AURORA2TESTB
00
00
00
00
00
00
Baseline
GaborFMP
SNR
Word Accuracy (%)
AURORA2
AURORA2TESTA
clean20 dB15 dB10 dB5 dB0 dB
Baseline9695.6788.727568.912538.4415.6675
GaborFMP98.707595.6788.727568.912538.4415.6675
TEST A
98.6595.6789.3871.8142.422.32N198.6595.6790.0274.1848.8224.13
98.5296.0187.5864.5734.4912.27N298.4696.5589.5169.4140.4514.54
98.7296.2790.5872.5644.3217.45N398.6996.0991.3875.6948.7921.44
98.9894.7287.7266.7733.6911.79N498.8694.6988.4370.0439.3112.31
98.717595.667588.81568.927538.72515.957598.66595.7589.83572.3344.342518.105
clean20151050
Test B
98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975
Average over Test A + Test B
98.717595.9562588.8462569.5187540.522516.94598.66596.1012590.112573.1112545.827519.70125
98.66596.1012590.112573.1112545.827519.70125
-4.09356725153.585780525511.352684074911.78593397588.91933924593.3185840708
AURORA2TESTA
00
00
00
00
00
00
Baseline
IA-Mean
SNR
Word Accuracy (%)
AURORA2
MBD00060D7E.xls
Chart2
92.9493.6893.6494.0590.7194.41
80.3192.7391.6192.2289.5292.46
51.5565.1886.8577.772.3682.73
74.9383.8690.787.9984.289.87
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet2
000000
000000
000000
000000
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
MBD0006064D.xls
Chart1
58.458.8959.6159.3459.92
42.4242.443.5343.743.69
27.7141.6139.2536.8738.6
17.7234.7426.0325.3826.15
18.638.431.0530.9232.84
52.7554.3556.555.355.97
MFCC*
TEner. CC
MFCC*+IA-Mean
MFCC*+IF-Mean
MFCC*+FMP
Accuracy
TIMIT Databases
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
Sheet1
MFCC*
TEner. CC
MFCC*+IA-Mean
MFCC*+IF-Mean
MFCC*+FMP
Accuracy
TIMIT Databases
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet2
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
HIWIREICCS - NTUA
Work To Be Done on Modulation Features
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Speech Modulation FeaturesResults 1st YearFractal FeaturesResults 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
HIWIREICCS - NTUA
Fractal FeaturesN-dCleanedEmbeddingN-dSignalLocal SVD
speechsignalFiltered Dynamics - Correlation Dimension (8)Noisy EmbeddingFiltered EmbeddingFDCDMultiscale Fractal Dimension (6)MFDGeometrical Filtering
HIWIREICCS - NTUA
Databases: Aurora 2Task: Speaker Independent Recognition of Digit SequencesTI - Digits at 8kHzTraining (8440 Utterances per scenario, 55M/55F)Clean (8kHz, G712)Multi-Condition (8kHz, G712)4 Noises (artificial): subway, babble, car, exhibition5 SNRs : 5, 10, 15, 20dB , cleanTesting, artificially added noise7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]A: noises as in multi-cond train., G712 (28028 Utters)B: restaurant, street, airport, train station, G712 (28028 Utters)C: subway, street (MIRS) (14014 Utters)
HIWIREICCS - NTUA
Results: Aurora 2Up to +40%
3
98.664166666798.575
95.699166666796.34
89.051666666792.7208333333
71.419166666782.915
43.446666666759.0225
16.6787521.36125
63.259083333375.1557638889
Baseline
+FDCD
SNR
Word Accuracy (%)
STMfc08TS_setc
98.5994.2487.873.2147.4898.5696.1491.4877.852.26
98.6295.3691.1281.9559.4698.496.1193.3983.1658.39
0.031.193.7811.9425.23-0.16-0.032.096.8911.73
clean20 dB15 dB10 dB5 dBclean20 dB15 dB10 dB5 dB
Train (MIRS)Street (MIRS)
98.5994.2487.873.2147.4898.6295.3691.1281.9559.46
98.5696.1491.4877.852.2698.496.1193.3983.1658.39
98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925
Base Av.77.55125FDCD Av82.3675
STMfc08TS_setc
00
00
00
00
00
Baseline
+FDCD
Restaurant
Averages
00
00
00
00
00
Baseline
+FDCD
Street (MIRS)
00
00
00
00
00
Baseline
FDCD
SNR
Accuracy
Set ASet BSet CAverageW.AverageClean20dB15dB10dB5dB0dBAver
Baseline61.4862.91777.5512567.3265.2704598.664166666795.699166666789.051666666771.419166666743.446666666716.6787563.2590833333
+FDCD70.0671.18882.367574.5472.973398.57596.3492.720833333382.91559.022521.3612575.1557638889
-0.0667498440.14900213140.33513472370.40222176870.2754184840.05619814870.3237992308
Clean20dB15dB10dB5dB0dBClean20dB15dB10dB5dB0dB
Set C98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925
Set B98.7196.237588.787569.8442.0317.6998.61596.822593.202583.257559.482523.175
Set A98.7195.6788.7368.9138.4415.6798.6096.4692.7182.9358.6619.55
00
00
00
00
00
00
00
Baseline
+FDCD
SNR
Word Accuracy (%)
HIWIREICCS - NTUA
Results: Aurora 2Up to +27%
Chart2
98.693333333398.7333333333
95.722596.3666666667
89.120833333391.6416666667
71.499166666779.1133333333
43.748333333352.7958333333
19.54521.6966666667
69.721527777873.39125
Baseline
MFD
SNR
Word Accuracy
STMfc08TS_seta
98.6595.6789.3871.8142.422.3298.6595.790.7378.2653.3925.67
98.5296.0187.5864.5734.4912.2798.6797.192.0875.6744.0413.39
98.7296.2790.5872.5644.3217.4598.5796.7592.7580.1752.0119.24
98.9894.7287.7266.7733.6911.7999.0796.1491.0876.8948.116.41
98.717595.667588.81568.927538.72515.957598.7496.422591.6677.747549.38518.6775test a
98.717596.24588.877570.1142.3217.932598.7496.977591.9978.797551.987520.9175test b
98.64595.25589.6775.4650.224.74598.7295.791.27580.79557.01525.495test c
98.693333333395.722589.120833333371.499166666743.748333333319.54569.721527777898.733333333396.366666666791.641666666779.113333333352.795833333321.696666666773.39125
clean20 dB15 dB10 dB5 dB0 dBAve.
0.2666666667
STMfc08TS_seta
00
00
00
00
00
00
00
Baseline
MFD
SNR
Word Accuracy
Aurora 2
HIWIREICCS - NTUA
Results: Aurora 2Up to +61%
Chart2
95.7595.602596.462596.1675
69.377.622582.9382.24
38.92553.692558.6661.8375
Baseline
+FMP
+FDCD
+FMP+FDCD
SNR
Accuracy
STMfc08TS_seta
20 dB10 dB5 dB
95.87242.8Subway
96.264.934.3Babble
96.27344.7Car
94.867.333.9Exhibition
95.7569.338.925
94.8177.2555.97
96.8380.3555.11
95.7479.1255
95.0373.7748.69
95.602577.622553.6925
96.0282.9760.7
97.3582.9957.06
96.6285.0259.67
95.8680.7457.21
96.462582.9358.66
95.6481.9864.09
97.1885.5562.89
96.1581.5462.49
95.779.8957.88
96.167582.2461.8375
STMfc08TS_seta
Baseline
+FMP
+FDCD
+FMP+FDCD
SNR
Accuracy
HIWIREICCS - NTUA
Future Directions on Fractal FeaturesRefine Fractal Feature Extraction.Application to Aurora 3.Fusion with other features.
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
HIWIREICCS - NTUA
Visual Front-EndAim:Extract low-dimensional visual speech feature vector from videoVisual front-end modules:Speaker's face detectionROI trackingFacial Model FittingVisual feature extractionChallenges:Very high dimensional signal - which features are proper?RobustnessComputational Efficiency
HIWIREICCS - NTUA
Face ModelingA well studied problem in Computer Vision:Active Appearance Models, Morphable Models, Active BlobsBoth Shape & Appearance can enhance lipreadingThe shape and appearance of human faces live in low dimensional manifolds==
HIWIREICCS - NTUA
Image Fitting Examplestep 2step 6step 10step 14step 18
HIWIREICCS - NTUA
Example: Face Interpretation Using AAMGenerative models like AAM allow us to evaluate the output of the visual front-endoriginal videoshape track superimposed on original videoreconstructed faceThis is what the visual-only speech recognizer sees!
HIWIREICCS - NTUA
Evaluation on the CUAVE Database
HIWIREICCS - NTUA
Audio-Visual ASR: Database Subset of CUAVE database used:36 speakers (30 training, 6 testing)5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10)Test set: 300 digits (6x5x10)
CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)
CUAVE was kindly provided by the Clemson University
HIWIREICCS - NTUA
Recognition Results (Word Accuracy)DataTraining: ~500 digits (29 speakers)Testing: ~100 digits (4 speakers)
HIWIREICCS - NTUA
Future WorkVisual Front-endBetter trained AAMTemporal trackingFeature fusionExperimentation with alternative DBN architecturesAutomatic stream weight determinationIntegration with non-linear acoustic featuresExperiments on other audio-visual databasesSystematic evaluation of visual features
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Modulation Features Results 1st YearFractal FeaturesResults 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
HIWIREICCS - NTUA
User Robustness, Speaker AdaptationVTLNBaselinePlatform: HTKDatabase: AURORA 4Fs = 8 kHzScenarios: Training, TestingComparison with MLLRCollection of non-Native Speech Data Completed10 Speakers100 Utterances/Speaker
HIWIREICCS - NTUA
Vocal Tract Length NormalizationImplementation: HTKWarping Factor EstimationMaximum Likelihood (ML) criterionFigures from Hain99, Lee96
HIWIREICCS - NTUA
VTLNTrainingAURORA 4 Baseline SetupClean (SIC), Multi-Condition (SIM), Noisy (SIN)TestingEstimate warping factor using adaptation utterances (Supervised VTLN)Per speaker warping factor (1, 2, 10, 20 Utterances)2-pass Decoding1st passGet a hypothetical transcriptionAlignment and ML to estimate per utterance warping factor2nd passDecode properly normalized utterance
HIWIREICCS - NTUA
Databases: Aurora 4Task: 5000 Word, Continuous Speech RecognitionWSJ0: (16 / 8 kHz) + Artificially Added Noise 2 microphones: Sennheiser, OtherFiltering: G712, P341Noises: Car, Babble, Restaurant, Street, Airport, Train StationTraining (7138 Utterances per scenario)Clean: Sennheiser mic.Multi-Condition: Sennheiser Other mic., 75% w. artificially added noise @ SNR: 10 20 dBNoisy: Sennheiser, artificially added noise SNR: 10 20 dBTesting (330 Utterances 166 Utterances each. Speaker # = 8)SNR: 5-15 dB 1-7: Sennheiser microphone8-14: Other microphone
HIWIREICCS - NTUA
VTLN Results, Clean Training
1
13.1511.611.9710.94
24.7521.416.2813.33
57.553.964634.44
SIC
Supervised VTLN - 2
MLLR - 2
MLLR-20
Test Noise
Word Error Rate (%)
Clean I
Sennheiser MicOther MicAverage
SIC58.442.4227.71
SVTLN - 258.8942.441.61
VTLN - 2 pass59.6143.5339.25
MLLR - 259.3443.736.87
Clean I
0000
0000
0000
SIC
SVTLN-2
VTLN-2 pass
MLLR-2
Word Error Rate (%)
AURORA4
Multi
Sennheiser MicOther MicAverage
SIM58.442.4227.71
SVTLN - 257.8942.441.61
VTLN - 2 pass59.6143.5339.25
MLLR - 259.3443.736.87
Multi
0000
0000
0000
SIM
SVTLN - 2
VTLN - 2 pass
MLLR - 2
Word Error Rate (%)
AURORA4
Noisy
Sennheiser MicOther MicAverage
SIN58.442.4227.71
SVTLN - 258.8942.441.61
VTLN - 2 pass59.6143.5339.25
MLLR - 259.3443.736.87
Noisy
0000
0000
0000
SIN
SVTLN - 2
VTLN - 2 pass
MLLR - 2
Word Error Rate (%)
AURORA4
CleanII
CleanCarTr. Station
SIC13.1524.7557.5
Supervised VTLN - 211.621.453.96
MLLR - 211.9716.2846
MLLR-2010.9413.3334.44
1234567891011121314
Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41
MLLR-2 match cond
CleanII
0000
0000
0000
SIC
Supervised VTLN - 2
MLLR - 2
MLLR-20
Word Error Rate (%)
AURORA4
HIWIREICCS - NTUA
VTLN Results, Multi-Condition Training
2
19.4517.5316.54
16.514.6615.43
29.6528.2131.27
SIM
Supervised VTLN - 2
MLLR - 2
Test Noise
Word Error Rate (%)
Clean I
Sennheiser MicOther MicAverage
SIC58.442.4227.71
SVTLN - 258.8942.441.61
VTLN - 2 pass59.6143.5339.25
MLLR - 259.3443.736.87
Clean I
0000
0000
0000
SIC
SVTLN-2
VTLN-2 pass
MLLR-2
Word Error Rate (%)
AURORA4
Multi
CleanCarTr. Station
SIM19.4516.529.65
Supervised VTLN - 217.5314.6628.21
MLLR - 216.5415.4331.27
Multi
000
000
000
SIM
Supervised VTLN - 2
MLLR - 2
Test Noise
Word Error Rate (%)
Noisy
Sennheiser MicOther MicAverage
SIN58.442.4227.71
SVTLN - 258.8942.441.61
VTLN - 2 pass59.6143.5339.25
MLLR - 259.3443.736.87
Noisy
0000
0000
0000
SIN
SVTLN - 2
VTLN - 2 pass
MLLR - 2
Word Error Rate (%)
AURORA4
CleanII
CleanCarTr. Station
SIC13.1524.7557.5
SVTLN - 211.621.453.96
MLLR - 211.9716.2846
MLLR-2010.9413.3334.44
1234567891011121314
Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41
MLLR-2 match cond
CleanII
0000
0000
0000
SIC
SVTLN - 2
MLLR - 2
MLLR-20
Word Error Rate (%)
AURORA4
HIWIREICCS - NTUA
VTLN Results, Noisy Training
3
15.7313.1114.92
16.1314.3615.14
33.3731.1234.04
SIN
Supervised VTLN - 2
MLLR - 2
Test Noise
Word Error Rate (%)
Clean I
Sennheiser MicOther MicAverage
SIC58.442.4227.71
SVTLN - 258.8942.441.61
VTLN - 2 pass59.6143.5339.25
MLLR - 259.3443.736.87
Clean I
0000
0000
0000
SIC
SVTLN-2
VTLN-2 pass
MLLR-2
Word Error Rate (%)
AURORA4
Multi
CleanCarTr. Station
SIM19.4516.529.65
Supervised VTLN - 217.5314.6628.21
MLLR - 216.5415.4331.27
Multi
000
000
000
SIM
Supervised VTLN - 2
MLLR - 2
Test Noise
Word Error Rate (%)
Noisy
CleanCarTr. Station
SIN15.7316.1333.37
Supervised VTLN - 213.1114.3631.12
MLLR - 214.9215.1434.04
Noisy
000
000
000
SIN
Supervised VTLN - 2
MLLR - 2
Test Noise
Word Error Rate (%)
CleanII
CleanCarTr. Station
SIC13.1524.7557.5
SVTLN - 211.621.453.96
MLLR - 211.9716.2846
MLLR-2010.9413.3334.44
1234567891011121314
Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41
MLLR-2 match cond
CleanII
0000
0000
0000
SIC
SVTLN - 2
MLLR - 2
MLLR-20
Word Error Rate (%)
AURORA4
HIWIREICCS - NTUA
Future Directions for Speaker NormalizationEstimate warping transforms at signal level Exploit instantaneous amplitude or frequency signals to estimate the warping parameters, Normalize the signal
Effective integration with model-based adaptation techniques (collaboration with TSI)
HIWIREICCS - NTUA
ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted
WP1: Appendix SlidesAurora 3
HIWIREICCS - NTUA
ASR Results
HIWIREICCS - NTUA
Experimental Results IIa (HTK)
HIWIREICCS - NTUA
Aurora 3 ConfigsHM States 14, Mixs 12MM States 16, Mixs 6WMStates 16, Mixs 16
WP1: Appendix SlidesAurora 2
HIWIREICCS - NTUA
Baseline: Aurora 2 Database Structure:2 Training Scenarios, 3 Test Sets, [4+4+2] Conditions, 7 SNRs per Condition: Total of 2x70 TestsPresentation of Selected Results:Average over SNR.Average over Condition. Training Scenarios: Clean- v.s Multi- Train.Noise Level: Low v.s. High SNR.Condition: Worst v.s. Easy Conditions.Features: MFCC+D+A v.s. MFCC+D+A+CMS
Set up: # states 18 [10-22], # mixs [3-32], MFCC+D+A+CMS
HIWIREICCS - NTUA
Average Baseline Results: Aurora 2* Average HTK results as reported with the database.Average over all SNRs and all ConditionsPlain: MFCC+D+A, CMS: MFCC+D+A+CMS.Mixture #: Clean train (Both Plain,CMS) 3, Multi train Plain: 22, CMS: 32.Best: Select for each condition/noise the # mixs with the best result.
HIWIREICCS - NTUA
Results: Aurora 2Up to +12%
1
98.717598.665
95.9562596.10125
88.8462590.1125
69.5187573.11125
40.522545.8275
16.94519.70125
Baseline
IA-Mean
SNR
Word Accuracy (%)
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
0.1183606557
0.1160714286
0.1358024691
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
AURORA2TESTC
clean20 dB15 dB10 dB5 dB0 dB
Baseline9695.6788.727568.912538.4415.6675
GaborFMP98.707595.6788.727568.912538.4415.6675
AURORA2TESTC
00
00
00
00
00
00
Baseline
GaborFMP
SNR
Word Accuracy (%)
AURORA2
AURORA2TESTB
clean20 dB15 dB10 dB5 dB0 dB
Baseline9695.6788.727568.912538.4415.6675
GaborFMP98.707595.6788.727568.912538.4415.6675
98.6595.8887.0466.7538.7814.4398.6596.3888.3370.2242.9517.38
98.5296.4390.8471.6144.4120.3198.4696.5291.8174.9450.2123.31
98.729688.5270.0343.5421.3298.6996.1290.5274.2948.0524.04
98.9896.6789.1172.0542.5515.6798.8696.7990.976.1248.0420.46
98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975
AURORA2TESTB
Baseline
GaborFMP
SNR
Word Accuracy (%)
AURORA2
AURORA2TESTA
clean20 dB15 dB10 dB5 dB0 dB
Baseline9695.6788.727568.912538.4415.6675
GaborFMP98.707595.6788.727568.912538.4415.6675
TEST A
98.6595.6789.3871.8142.422.32N198.6595.6790.0274.1848.8224.13
98.5296.0187.5864.5734.4912.27N298.4696.5589.5169.4140.4514.54
98.7296.2790.5872.5644.3217.45N398.6996.0991.3875.6948.7921.44
98.9894.7287.7266.7733.6911.79N498.8694.6988.4370.0439.3112.31
98.717595.667588.81568.927538.72515.957598.66595.7589.83572.3344.342518.105
clean20151050
Test B
98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975
Average over Test A + Test B
98.717595.9562588.8462569.5187540.522516.94598.66596.1012590.112573.1112545.827519.70125
98.66596.1012590.112573.1112545.827519.70125
-4.09356725153.585780525511.352684074911.78593397588.91933924593.3185840708
AURORA2TESTA
Baseline
IA-Mean
SNR
Word Accuracy (%)
AURORA2
MBD00060D7E.xls
Chart2
92.9493.6893.6494.0590.7194.41
80.3192.7391.6192.2289.5292.46
51.5565.1886.8577.772.3682.73
74.9383.8690.787.9984.289.87
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet2
000000
000000
000000
000000
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
MBD0006064D.xls
Chart1
58.458.8959.6159.3459.92
42.4242.443.5343.743.69
27.7141.6139.2536.8738.6
17.7234.7426.0325.3826.15
18.638.431.0530.9232.84
52.7554.3556.555.355.97
MFCC*
TEner. CC
MFCC*+IA-Mean
MFCC*+IF-Mean
MFCC*+FMP
Accuracy
TIMIT Databases
Sheet1
TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car
MFCC*58.442.4227.7117.7218.652.75
TEner. CC58.8942.441.6134.7438.454.35
MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5
MFCC*+IF-Mean59.3443.736.8725.3830.9255.3
MFCC*+FMP59.9243.6938.626.1532.8455.97
Sheet1
MFCC*
TEner. CC
MFCC*+IA-Mean
MFCC*+IF-Mean
MFCC*+FMP
Accuracy
TIMIT Databases
Sheet2
WMMMHMAverage
Aurora Front-End (WI007)92.9480.3151.5574.93
MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86
TEnerCC+log(Ener) +CMS93.6491.6186.8590.7
MFCC*+IA-Mean94.0592.2277.787.99
MFCC*+IF-Mean90.7189.5272.3684.2
MFCC*+FMP94.4192.4682.7389.87
Sheet2
WI007
MFCC+log(E)+D+DD+CMS
TECC+log(E)+CMS
MFCC+IA-Mean
MFCC+IF-Mean
MFCC+FMP
Word Accuracy (%)
Aurora 3
Sheet3
WMMMHMAverage
Auditory (Baseline)95.489.284.889.8
Aud.+IF-Mean94.888.786.189.9
Aud.+IF-Var95.488.987.490.6
Aud.+FMP95.8898990.7
Aud.+FZC95.695.686.390.3
Aud.+IA-Mean95.589.48690.3
Sheet3
000000
000000
000000
000000
Auditory
+IF-Mean
+IF-Var
+FMP
+FZC
+IA-Mean
Word Accuracy (%)
Aurora3 (Spanish Task)
HIWIREICCS - NTUA
Results: Aurora 2Up to +40%
3
98.664166666798.575
95.699166666796.34
89.051666666792.7208333333
71.419166666782.915
43.446666666759.0225
16.6787521.36125
63.259083333375.1557638889
Baseline
+FDCD
SNR
Word Accuracy (%)
STMfc08TS_setc
98.5994.2487.873.2147.4898.5696.1491.4877.852.26
98.6295.3691.1281.9559.4698.496.1193.3983.1658.39
0.031.193.7811.9425.23-0.16-0.032.096.8911.73
clean20 dB15 dB10 dB5 dBclean20 dB15 dB10 dB5 dB
Train (MIRS)Street (MIRS)
98.5994.2487.873.2147.4898.6295.3691.1281.9559.46
98.5696.1491.4877.852.2698.496.1193.3983.1658.39
98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925
Base Av.77.55125FDCD Av82.3675
STMfc08TS_setc
00
00
00
00
00
Baseline
+FDCD
Restaurant
Averages
00
00
00
00
00
Baseline
+FDCD
Street (MIRS)
00
00
00
00
00
Baseline
FDCD
SNR
Accuracy
Set ASet BSet CAverageW.AverageClean20dB15dB10dB5dB0dBAver
Baseline61.4862.91777.5512567.3265.2704598.664166666795.699166666789.051666666771.419166666743.446666666716.6787563.2590833333
+FDCD70.0671.18882.367574.5472.973398.57596.3492.720833333382.91559.022521.3612575.1557638889
-0.0667498440.14900213140.33513472370.40222176870.2754184840.05619814870.3237992308
Clean20dB15dB10dB5dB0dBClean20dB15dB10dB5dB0dB
Set C98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925
Set B98.7196.237588.787569.8442.0317.6998.61596.822593.202583.257559.482523.175
Set A98.7195.6788.7368.9138.4415.6798.6096.4692.7182.9358.6619.55
00
00
00
00
00
00
00
Baseline
+FDCD
SNR
Word Accuracy (%)
HIWIREICCS - NTUA
Results: Aurora 2Up to +27%
Chart2
98.693333333398.7333333333
95.722596.3666666667
89.120833333391.6416666667
71.499166666779.1133333333
43.748333333352.7958333333
19.54521.6966666667
69.721527777873.39125
Baseline
MFD
SNR
Word Accuracy
STMfc08TS_seta
98.6595.6789.3871.8142.422.3298.6595.790.7378.2653.3925.67
98.5296.0187.5864.5734.4912.2798.6797.192.0875.6744.0413.39
98.7296.2790.5872.5644.3217.4598.5796.7592.7580.1752.0119.24
98.9894.7287.7266.7733.6911.7999.0796.1491.0876.8948.116.41
98.717595.667588.81568.927538.72515.957598.7496.422591.6677.747549.38518.6775test a
98.717596.24588.877570.1142.3217.932598.7496.977591.9978.797551.987520.9175test b
98.64595.25589.6775.4650.224.74598.7295.791.27580.79557.01525.495test c
98.693333333395.722589.120833333371.499166666743.748333333319.54569.721527777898.733333333396.366666666791.641666666779.113333333352.795833333321.696666666773.39125
clean20 dB15 dB10 dB5 dB0 dBAve.
0.2666666667
STMfc08TS_seta
00
00
00
00
00
00
00
Baseline
MFD
SNR
Word Accuracy
Aurora 2
HIWIREICCS - NTUA
Results: Aurora 2Up to +61%
Chart2
95.7595.602596.462596.1675
69.377.622582.9382.24
38.92553.692558.6661.8375
Baseline
+FMP
+FDCD
+FMP+FDCD
SNR
Accuracy
STMfc08TS_seta
20 dB10 dB5 dB
95.87242.8Subway
96.264.934.3Babble
96.27344.7Car
94.867.333.9Exhibition
95.7569.338.925
94.8177.2555.97
96.8380.3555.11
95.7479.1255
95.0373.7748.69
95.602577.622553.6925
96.0282.9760.7
97.3582.9957.06
96.6285.0259.67
95.8680.7457.21
96.462582.9358.66
95.6481.9864.09
97.1885.5562.89
96.1581.5462.49
95.779.8957.88
96.167582.2461.8375
STMfc08TS_seta
0000
0000
0000
Baseline
+FMP
+FDCD
+FMP+FDCD
SNR
Accuracy
HIWIREICCS - NTUA
Aurora 2 Distributed, Multicondition Training
HIWIREICCS - NTUA
Aurora 2 Distributed, Clean Training
WP1: Appendix SlidesAudio Visual: Details
HIWIREICCS - NTUA
Introduction: Motivations for AV-ASRAudio-only ASR does not work reliably in many scenarios:Noisy background (e.g. car's cabin, cockpit)Interference between talkersNeed to enhance the auditory signal when it is not reliableHuman speech perception is multimodal:Different modalities are weighed according to their reliabilityHearing impaired people can lipreadMcGurk Effect (McGurk & MacDonald, 1976)Machines should also be able to exploit multimodal information
HIWIREICCS - NTUA
Audio-Visual Feature FusionAudio-visual feature integration is highly non-trivial:Audio & visual speech asychrony (~100 ms)Relative reliability of streams can vary wildlyMany approaches to feature fusion in the literature:Early integrationIntermediate integrationLate integrationHighly active research area (mainly machine learning)The class of Dynamic Bayesian Networks (DBNs) seems particularly suited for the problem:Stream interaction explicitly modeledModel parameter inference is more difficult than in HMM
HIWIREICCS - NTUA
Visual Front-End AAM ParametersFirst frame of the 36 videos manually annotated68 points on the whole face as shape landmarksColor appearance sampled at 10000 pixelsEigenvectors retained explain 70% variance5 eigenshapes & 10 eigenfacesInitial condition at each new frame the converged solution at the previous frameInverse-compositional gradient descent algorithmCoarse-to-fine refinement (Gaussian pyramid - 3 scales)
HIWIREICCS - NTUA
AV-ASR Experiment SetupFeatures:Audio: 39 features (MFCC_D_A)Visual (upsampled from 30 Hz to 100 Hz):5 shape features (Sh)10 appearance features (App)Audio-Visual: 39+45 feats (MFCC_D_A+SHAPP_D_A)Two-stream HMM8 state, left-to-right HMM whole-digit models with no state skippingSingle Gaussian observation probability densitiesSeparate audio & video feature streams with equal weights (1,1)
WP1: Appendix SlidesAurora 4
HIWIREICCS - NTUA
Aurora 4, Multi-Condition Training
HIWIREICCS - NTUA
Aurora 4, Noisy Training
HIWIREICCS - NTUA
Aurora 4, Noisy Training