ICCS-NTUA : WP1+WP2

Computer Vision, Speech Communication and Signal Processing Research GroupHIWIRE

HIWIREICCS - NTUA

Group Leader: Prof. Petros MaragosPh.D. Students / Graduate Research Assistants :D. Dimitriadis (speech: recognition, modulations)V. Pitsikalis (speech: recognition, fractals/chaos, NLP)A. Katsamanis (speech: modulations, statistical processing, recognition)G. Papandreou (vision: PDEs, active contours, level sets, AV-ASR) G. Evangelopoulos (vision/speech: texture, modulations, fractals)S. Leykimiatis (speech: statistical processing, microphone arrays)

HIWIRE Involved CVSP Members

HIWIREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

HIWIREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Modulation Features Results 1st YearFractal Features Results 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

HIWIREICCS - NTUA

WP1: Noise RobustnessPlatform: HTKBaseline + Evaluation:

Aurora 2, Aurora 3, TIMIT+NOISE

Modulation FeaturesAM-FM ModulationsTeager Energy CepstrumFractal FeaturesDynamical Denoising Correlation DimensionMultiscale Fractal DimensionHybrid-Merged Featuresup to +62% (Aurora 3)up to +36% (Aurora 2)up to +61% (Aurora 2)

HIWIREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Speech Modulation FeaturesResults 1st YearFractal Features Results 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

HIWIREICCS - NTUA

Speech Modulation FeaturesFilterbank Design

Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-MeanShort-Term Mean Inst. Frequency IF-MeanFrequency Modulation Percentages FMP

Short-Term Energy Modulation FeaturesAverage Teager Energy, Cepstrum Coef. TECC

HIWIREICCS - NTUA

Modulation Acoustic Features

SpeechNonlinearProcessingDemodulationRobustFeatureTransformation/SelectionRegularization+Multiband FilteringStatisticalProcessingV.A.D.Energy Features: Teager Energy Cepstrum Coeff. TECCAM-FM Modulation Features:Mean Inst. Ampl. IA-MeanMean Inst. Freq. IF-MeanFreq. Mod. Percent. FMP

HIWIREICCS - NTUA

TIMIT-based Speech Databases TIMIT Database:Training Set: 3696 sentences , ~35 phonemes/utterancesTesting Set: 1344 utterances, 46680 phonemesSampling Frequency 16 kHz Feature Vectors:MFCC+C0+AM-FM+1st+2nd Time DerivativesStream Weights: (1) for MFCC and (2) for -FM

3-state left-right HMMs, 16 mixturesAll-pair, Unweighted grammarPerformance Criterion: Phone Accuracy Rates (%)Back-end System: HTK v3.2.0

HIWIREICCS - NTUA

Results: TIMIT+NoiseUp to +106%

Chart1

58.458.8959.6159.3459.92

42.4242.443.5343.743.69

27.7141.6139.2536.8738.6

17.7234.7426.0325.3826.15

18.638.431.0530.9232.84

52.7554.3556.555.355.97

MFCC*

TEner. CC

MFCC*+IA-Mean

MFCC*+IF-Mean

MFCC*+FMP

Accuracy

Sheet1

TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

Sheet1

MFCC*

TEner. CC

MFCC*+IA-Mean

MFCC*+IF-Mean

MFCC*+FMP

Accuracy

TIMIT Databases

Sheet2

WMMMHMAverage

Aurora Front-End (WI007)92.9480.3151.5574.93

MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet2

WI007

MFCC+log(E)+D+DD+CMS

TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet3

WMMMHMAverage

Auditory (Baseline)95.489.284.889.8

Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)

Aurora3 (Spanish Task)

HIWIREICCS - NTUA

Aurora 3 - SpanishConnected-Digits, Sampling Frequency 8 kHzTraining Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211)HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digitsMM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digitsHM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits2 Back-end ASR Systems ( and BLasr)Feature Vectors: MFCC+AM-FM (or Auditory+M-FM), TECCAll-Pair, Unweighted Grammar (or Word-Pair Grammar)Performance Criterion: Word (digit) Accuracy Rates

HIWIREICCS - NTUA

Results: Aurora 3 (HTK)Up to +62%

Chart2

92.9493.6893.6494.0590.7194.41

80.3192.7391.6192.2289.5292.46

51.5565.1886.8577.772.3682.73

74.9383.8690.787.9984.289.87

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet2

000000

000000

000000

000000

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


HIWIREICCS - NTUA

Databases: Aurora 2Task: Speaker Independent Recognition of Digit SequencesTI - Digits at 8kHzTraining (8440 Utterances per scenario, 55M/55F)Clean (8kHz, G712)Multi-Condition (8kHz, G712)4 Noises (artificial): subway, babble, car, exhibition5 SNRs : 5, 10, 15, 20dB , cleanTesting, artificially added noise7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]A: noises as in multi-cond train., G712 (28028 Utters)B: restaurant, street, airport, train station, G712 (28028 Utters)C: subway, street (MIRS) (14014 Utters)

HIWIREICCS - NTUA

Results: Aurora 2Up to +12%

1

98.717598.665

95.9562596.10125

88.8462590.1125

69.5187573.11125

40.522545.8275

16.94519.70125

Baseline

IA-Mean

SNR

Word Accuracy (%)

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

0.1183606557

0.1160714286

0.1358024691

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


AURORA2TESTC

clean20 dB15 dB10 dB5 dB0 dB

Baseline9695.6788.727568.912538.4415.6675

GaborFMP98.707595.6788.727568.912538.4415.6675

AURORA2TESTC

00

00

00

00

00

00

Baseline

GaborFMP

SNR

Word Accuracy (%)

AURORA2

AURORA2TESTB


Baseline9695.6788.727568.912538.4415.6675

GaborFMP98.707595.6788.727568.912538.4415.6675

98.6595.8887.0466.7538.7814.4398.6596.3888.3370.2242.9517.38

98.5296.4390.8471.6144.4120.3198.4696.5291.8174.9450.2123.31

98.729688.5270.0343.5421.3298.6996.1290.5274.2948.0524.04

98.9896.6789.1172.0542.5515.6798.8696.7990.976.1248.0420.46

98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

AURORA2TESTB

00

00

00

00

00

00

Baseline

GaborFMP

SNR

Word Accuracy (%)

AURORA2

AURORA2TESTA


Baseline9695.6788.727568.912538.4415.6675

GaborFMP98.707595.6788.727568.912538.4415.6675

TEST A

98.6595.6789.3871.8142.422.32N198.6595.6790.0274.1848.8224.13

98.5296.0187.5864.5734.4912.27N298.4696.5589.5169.4140.4514.54

98.7296.2790.5872.5644.3217.45N398.6996.0991.3875.6948.7921.44

98.9894.7287.7266.7733.6911.79N498.8694.6988.4370.0439.3112.31

98.717595.667588.81568.927538.72515.957598.66595.7589.83572.3344.342518.105

clean20151050

Test B

98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

Average over Test A + Test B

98.717595.9562588.8462569.5187540.522516.94598.66596.1012590.112573.1112545.827519.70125

98.66596.1012590.112573.1112545.827519.70125

-4.09356725153.585780525511.352684074911.78593397588.91933924593.3185840708

AURORA2TESTA

00

00

00

00

00

00

Baseline

IA-Mean

SNR

Word Accuracy (%)

AURORA2

MBD00060D7E.xls

Chart2

92.9493.6893.6494.0590.7194.41

80.3192.7391.6192.2289.5292.46

51.5565.1886.8577.772.3682.73

74.9383.8690.787.9984.289.87

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet2

000000

000000

000000

000000

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


MBD0006064D.xls

Chart1

58.458.8959.6159.3459.92

42.4242.443.5343.743.69

27.7141.6139.2536.8738.6

17.7234.7426.0325.3826.15

18.638.431.0530.9232.84

52.7554.3556.555.355.97

MFCC*

TEner. CC

MFCC*+IA-Mean

MFCC*+IF-Mean

MFCC*+FMP

Accuracy

TIMIT Databases

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

Sheet1

MFCC*

TEner. CC

MFCC*+IA-Mean

MFCC*+IF-Mean

MFCC*+FMP

Accuracy

TIMIT Databases

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet2

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


HIWIREICCS - NTUA

Work To Be Done on Modulation Features

HIWIREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Speech Modulation FeaturesResults 1st YearFractal FeaturesResults 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

HIWIREICCS - NTUA

Fractal FeaturesN-dCleanedEmbeddingN-dSignalLocal SVD

speechsignalFiltered Dynamics - Correlation Dimension (8)Noisy EmbeddingFiltered EmbeddingFDCDMultiscale Fractal Dimension (6)MFDGeometrical Filtering

HIWIREICCS - NTUA

Databases: Aurora 2Task: Speaker Independent Recognition of Digit SequencesTI - Digits at 8kHzTraining (8440 Utterances per scenario, 55M/55F)Clean (8kHz, G712)Multi-Condition (8kHz, G712)4 Noises (artificial): subway, babble, car, exhibition5 SNRs : 5, 10, 15, 20dB , cleanTesting, artificially added noise7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]A: noises as in multi-cond train., G712 (28028 Utters)B: restaurant, street, airport, train station, G712 (28028 Utters)C: subway, street (MIRS) (14014 Utters)

HIWIREICCS - NTUA


3

98.664166666798.575

95.699166666796.34

89.051666666792.7208333333

71.419166666782.915

43.446666666759.0225

16.6787521.36125

63.259083333375.1557638889

Baseline

+FDCD

SNR

Word Accuracy (%)

STMfc08TS_setc

98.5994.2487.873.2147.4898.5696.1491.4877.852.26

98.6295.3691.1281.9559.4698.496.1193.3983.1658.39

0.031.193.7811.9425.23-0.16-0.032.096.8911.73

clean20 dB15 dB10 dB5 dBclean20 dB15 dB10 dB5 dB

Train (MIRS)Street (MIRS)

98.5994.2487.873.2147.4898.6295.3691.1281.9559.46

98.5696.1491.4877.852.2698.496.1193.3983.1658.39

98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

Base Av.77.55125FDCD Av82.3675

STMfc08TS_setc

00

00

00

00

00

Baseline

+FDCD

Restaurant

Averages

00

00

00

00

00

Baseline

+FDCD

Street (MIRS)

00

00

00

00

00

Baseline

FDCD

SNR

Accuracy

Set ASet BSet CAverageW.AverageClean20dB15dB10dB5dB0dBAver

Baseline61.4862.91777.5512567.3265.2704598.664166666795.699166666789.051666666771.419166666743.446666666716.6787563.2590833333

+FDCD70.0671.18882.367574.5472.973398.57596.3492.720833333382.91559.022521.3612575.1557638889

-0.0667498440.14900213140.33513472370.40222176870.2754184840.05619814870.3237992308

Clean20dB15dB10dB5dB0dBClean20dB15dB10dB5dB0dB

Set C98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

Set B98.7196.237588.787569.8442.0317.6998.61596.822593.202583.257559.482523.175

Set A98.7195.6788.7368.9138.4415.6798.6096.4692.7182.9358.6619.55

00

00

00

00

00

00

00

Baseline

+FDCD

SNR

Word Accuracy (%)

HIWIREICCS - NTUA


Chart2

98.693333333398.7333333333

95.722596.3666666667

89.120833333391.6416666667

71.499166666779.1133333333

43.748333333352.7958333333

19.54521.6966666667

69.721527777873.39125

Baseline

MFD

SNR

Word Accuracy

STMfc08TS_seta

98.6595.6789.3871.8142.422.3298.6595.790.7378.2653.3925.67

98.5296.0187.5864.5734.4912.2798.6797.192.0875.6744.0413.39

98.7296.2790.5872.5644.3217.4598.5796.7592.7580.1752.0119.24

98.9894.7287.7266.7733.6911.7999.0796.1491.0876.8948.116.41

98.717595.667588.81568.927538.72515.957598.7496.422591.6677.747549.38518.6775test a

98.717596.24588.877570.1142.3217.932598.7496.977591.9978.797551.987520.9175test b

98.64595.25589.6775.4650.224.74598.7295.791.27580.79557.01525.495test c

98.693333333395.722589.120833333371.499166666743.748333333319.54569.721527777898.733333333396.366666666791.641666666779.113333333352.795833333321.696666666773.39125

clean20 dB15 dB10 dB5 dB0 dBAve.

0.2666666667

STMfc08TS_seta

00

00

00

00

00

00

00

Baseline

MFD

SNR

Word Accuracy

Aurora 2

HIWIREICCS - NTUA


Chart2

95.7595.602596.462596.1675

69.377.622582.9382.24

38.92553.692558.6661.8375

Baseline

+FMP

+FDCD

+FMP+FDCD

SNR

Accuracy

STMfc08TS_seta

20 dB10 dB5 dB

95.87242.8Subway

96.264.934.3Babble

96.27344.7Car

94.867.333.9Exhibition

95.7569.338.925

94.8177.2555.97

96.8380.3555.11

95.7479.1255

95.0373.7748.69

95.602577.622553.6925

96.0282.9760.7

97.3582.9957.06

96.6285.0259.67

95.8680.7457.21

96.462582.9358.66

95.6481.9864.09

97.1885.5562.89

96.1581.5462.49

95.779.8957.88

96.167582.2461.8375

STMfc08TS_seta

Baseline

+FMP

+FDCD

+FMP+FDCD

SNR

Accuracy

HIWIREICCS - NTUA

Future Directions on Fractal FeaturesRefine Fractal Feature Extraction.Application to Aurora 3.Fusion with other features.

HIWIREICCS - NTUA


HIWIREICCS - NTUA

Visual Front-EndAim:Extract low-dimensional visual speech feature vector from videoVisual front-end modules:Speaker's face detectionROI trackingFacial Model FittingVisual feature extractionChallenges:Very high dimensional signal - which features are proper?RobustnessComputational Efficiency

HIWIREICCS - NTUA

Face ModelingA well studied problem in Computer Vision:Active Appearance Models, Morphable Models, Active BlobsBoth Shape & Appearance can enhance lipreadingThe shape and appearance of human faces live in low dimensional manifolds==

HIWIREICCS - NTUA

Image Fitting Examplestep 2step 6step 10step 14step 18

HIWIREICCS - NTUA

Example: Face Interpretation Using AAMGenerative models like AAM allow us to evaluate the output of the visual front-endoriginal videoshape track superimposed on original videoreconstructed faceThis is what the visual-only speech recognizer sees!

HIWIREICCS - NTUA

Evaluation on the CUAVE Database

HIWIREICCS - NTUA

Audio-Visual ASR: Database Subset of CUAVE database used:36 speakers (30 training, 6 testing)5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10)Test set: 300 digits (6x5x10)

CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

CUAVE was kindly provided by the Clemson University

HIWIREICCS - NTUA

Recognition Results (Word Accuracy)DataTraining: ~500 digits (29 speakers)Testing: ~100 digits (4 speakers)

HIWIREICCS - NTUA

Future WorkVisual Front-endBetter trained AAMTemporal trackingFeature fusionExperimentation with alternative DBN architecturesAutomatic stream weight determinationIntegration with non-linear acoustic featuresExperiments on other audio-visual databasesSystematic evaluation of visual features

HIWIREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Modulation Features Results 1st YearFractal FeaturesResults 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

HIWIREICCS - NTUA

User Robustness, Speaker AdaptationVTLNBaselinePlatform: HTKDatabase: AURORA 4Fs = 8 kHzScenarios: Training, TestingComparison with MLLRCollection of non-Native Speech Data Completed10 Speakers100 Utterances/Speaker

HIWIREICCS - NTUA

Vocal Tract Length NormalizationImplementation: HTKWarping Factor EstimationMaximum Likelihood (ML) criterionFigures from Hain99, Lee96

HIWIREICCS - NTUA

VTLNTrainingAURORA 4 Baseline SetupClean (SIC), Multi-Condition (SIM), Noisy (SIN)TestingEstimate warping factor using adaptation utterances (Supervised VTLN)Per speaker warping factor (1, 2, 10, 20 Utterances)2-pass Decoding1st passGet a hypothetical transcriptionAlignment and ML to estimate per utterance warping factor2nd passDecode properly normalized utterance

HIWIREICCS - NTUA

Databases: Aurora 4Task: 5000 Word, Continuous Speech RecognitionWSJ0: (16 / 8 kHz) + Artificially Added Noise 2 microphones: Sennheiser, OtherFiltering: G712, P341Noises: Car, Babble, Restaurant, Street, Airport, Train StationTraining (7138 Utterances per scenario)Clean: Sennheiser mic.Multi-Condition: Sennheiser Other mic., 75% w. artificially added noise @ SNR: 10 20 dBNoisy: Sennheiser, artificially added noise SNR: 10 20 dBTesting (330 Utterances 166 Utterances each. Speaker # = 8)SNR: 5-15 dB 1-7: Sennheiser microphone8-14: Other microphone

HIWIREICCS - NTUA

VTLN Results, Clean Training

1

13.1511.611.9710.94

24.7521.416.2813.33

57.553.964634.44

SIC

Supervised VTLN - 2

MLLR - 2

MLLR-20

Test Noise

Word Error Rate (%)

Clean I

Sennheiser MicOther MicAverage

SIC58.442.4227.71

SVTLN - 258.8942.441.61

VTLN - 2 pass59.6143.5339.25

MLLR - 259.3443.736.87

Clean I

0000

0000

0000

SIC

SVTLN-2

VTLN-2 pass

MLLR-2

Word Error Rate (%)

AURORA4

Multi


SIM58.442.4227.71

SVTLN - 257.8942.441.61

VTLN - 2 pass59.6143.5339.25

MLLR - 259.3443.736.87

Multi

0000

0000

0000

SIM

SVTLN - 2

VTLN - 2 pass

MLLR - 2

Word Error Rate (%)

AURORA4

Noisy


SIN58.442.4227.71

SVTLN - 258.8942.441.61

VTLN - 2 pass59.6143.5339.25

MLLR - 259.3443.736.87

Noisy

0000

0000

0000

SIN

SVTLN - 2

VTLN - 2 pass

MLLR - 2

Word Error Rate (%)

AURORA4

CleanII

CleanCarTr. Station

SIC13.1524.7557.5

Supervised VTLN - 211.621.453.96

MLLR - 211.9716.2846

MLLR-2010.9413.3334.44

1234567891011121314

Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41

MLLR-2 match cond

CleanII

0000

0000

0000

SIC

Supervised VTLN - 2

MLLR - 2

MLLR-20

Word Error Rate (%)

AURORA4

HIWIREICCS - NTUA

VTLN Results, Multi-Condition Training

2

19.4517.5316.54

16.514.6615.43

29.6528.2131.27

SIM

Supervised VTLN - 2

MLLR - 2

Test Noise

Word Error Rate (%)

Clean I


SIC58.442.4227.71

SVTLN - 258.8942.441.61

VTLN - 2 pass59.6143.5339.25

MLLR - 259.3443.736.87

Clean I

0000

0000

0000

SIC

SVTLN-2

VTLN-2 pass

MLLR-2

Word Error Rate (%)

AURORA4

Multi

CleanCarTr. Station

SIM19.4516.529.65


MLLR - 216.5415.4331.27

Multi

000

000

000

SIM

Supervised VTLN - 2

MLLR - 2

Test Noise

Word Error Rate (%)

Noisy


SIN58.442.4227.71

SVTLN - 258.8942.441.61

VTLN - 2 pass59.6143.5339.25

MLLR - 259.3443.736.87

Noisy

0000

0000

0000

SIN

SVTLN - 2

VTLN - 2 pass

MLLR - 2

Word Error Rate (%)

AURORA4

CleanII

CleanCarTr. Station

SIC13.1524.7557.5

SVTLN - 211.621.453.96

MLLR - 211.9716.2846

MLLR-2010.9413.3334.44

1234567891011121314

Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41

MLLR-2 match cond

CleanII

0000

0000

0000

SIC

SVTLN - 2

MLLR - 2

MLLR-20

Word Error Rate (%)

AURORA4

HIWIREICCS - NTUA

VTLN Results, Noisy Training

3

15.7313.1114.92

16.1314.3615.14

33.3731.1234.04

SIN

Supervised VTLN - 2

MLLR - 2

Test Noise

Word Error Rate (%)

Clean I


SIC58.442.4227.71

SVTLN - 258.8942.441.61

VTLN - 2 pass59.6143.5339.25

MLLR - 259.3443.736.87

Clean I

0000

0000

0000

SIC

SVTLN-2

VTLN-2 pass

MLLR-2

Word Error Rate (%)

AURORA4

Multi

CleanCarTr. Station

SIM19.4516.529.65


MLLR - 216.5415.4331.27

Multi

000

000

000

SIM

Supervised VTLN - 2

MLLR - 2

Test Noise

Word Error Rate (%)

Noisy

CleanCarTr. Station

SIN15.7316.1333.37


MLLR - 214.9215.1434.04

Noisy

000

000

000

SIN

Supervised VTLN - 2

MLLR - 2

Test Noise

Word Error Rate (%)

CleanII

CleanCarTr. Station

SIC13.1524.7557.5

SVTLN - 211.621.453.96

MLLR - 211.9716.2846

MLLR-2010.9413.3334.44

1234567891011121314

Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41

MLLR-2 match cond

CleanII

0000

0000

0000

SIC

SVTLN - 2

MLLR - 2

MLLR-20

Word Error Rate (%)

AURORA4

HIWIREICCS - NTUA

Future Directions for Speaker NormalizationEstimate warping transforms at signal level Exploit instantaneous amplitude or frequency signals to estimate the warping parameters, Normalize the signal

Effective integration with model-based adaptation techniques (collaboration with TSI)

HIWIREICCS - NTUA


WP1: Appendix SlidesAurora 3

HIWIREICCS - NTUA

ASR Results

HIWIREICCS - NTUA

Experimental Results IIa (HTK)

HIWIREICCS - NTUA

Aurora 3 ConfigsHM States 14, Mixs 12MM States 16, Mixs 6WMStates 16, Mixs 16


HIWIREICCS - NTUA

Baseline: Aurora 2 Database Structure:2 Training Scenarios, 3 Test Sets, [4+4+2] Conditions, 7 SNRs per Condition: Total of 2x70 TestsPresentation of Selected Results:Average over SNR.Average over Condition. Training Scenarios: Clean- v.s Multi- Train.Noise Level: Low v.s. High SNR.Condition: Worst v.s. Easy Conditions.Features: MFCC+D+A v.s. MFCC+D+A+CMS

Set up: # states 18 [10-22], # mixs [3-32], MFCC+D+A+CMS

HIWIREICCS - NTUA

Average Baseline Results: Aurora 2* Average HTK results as reported with the database.Average over all SNRs and all ConditionsPlain: MFCC+D+A, CMS: MFCC+D+A+CMS.Mixture #: Clean train (Both Plain,CMS) 3, Multi train Plain: 22, CMS: 32.Best: Select for each condition/noise the # mixs with the best result.

HIWIREICCS - NTUA


1

98.717598.665

95.9562596.10125

88.8462590.1125

69.5187573.11125

40.522545.8275

16.94519.70125

Baseline

IA-Mean

SNR

Word Accuracy (%)

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

0.1183606557

0.1160714286

0.1358024691

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


AURORA2TESTC


Baseline9695.6788.727568.912538.4415.6675

GaborFMP98.707595.6788.727568.912538.4415.6675

AURORA2TESTC

00

00

00

00

00

00

Baseline

GaborFMP

SNR

Word Accuracy (%)

AURORA2

AURORA2TESTB


Baseline9695.6788.727568.912538.4415.6675

GaborFMP98.707595.6788.727568.912538.4415.6675

98.6595.8887.0466.7538.7814.4398.6596.3888.3370.2242.9517.38

98.5296.4390.8471.6144.4120.3198.4696.5291.8174.9450.2123.31

98.729688.5270.0343.5421.3298.6996.1290.5274.2948.0524.04

98.9896.6789.1172.0542.5515.6798.8696.7990.976.1248.0420.46

98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

AURORA2TESTB

Baseline

GaborFMP

SNR

Word Accuracy (%)

AURORA2

AURORA2TESTA


Baseline9695.6788.727568.912538.4415.6675

GaborFMP98.707595.6788.727568.912538.4415.6675

TEST A

98.6595.6789.3871.8142.422.32N198.6595.6790.0274.1848.8224.13

98.5296.0187.5864.5734.4912.27N298.4696.5589.5169.4140.4514.54

98.7296.2790.5872.5644.3217.45N398.6996.0991.3875.6948.7921.44

98.9894.7287.7266.7733.6911.79N498.8694.6988.4370.0439.3112.31

98.717595.667588.81568.927538.72515.957598.66595.7589.83572.3344.342518.105

clean20151050

Test B

98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

Average over Test A + Test B

98.717595.9562588.8462569.5187540.522516.94598.66596.1012590.112573.1112545.827519.70125

98.66596.1012590.112573.1112545.827519.70125

-4.09356725153.585780525511.352684074911.78593397588.91933924593.3185840708

AURORA2TESTA

Baseline

IA-Mean

SNR

Word Accuracy (%)

AURORA2

MBD00060D7E.xls

Chart2

92.9493.6893.6494.0590.7194.41

80.3192.7391.6192.2289.5292.46

51.5565.1886.8577.772.3682.73

74.9383.8690.787.9984.289.87

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet2

000000

000000

000000

000000

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


MBD0006064D.xls

Chart1

58.458.8959.6159.3459.92

42.4242.443.5343.743.69

27.7141.6139.2536.8738.6

17.7234.7426.0325.3826.15

18.638.431.0530.9232.84

52.7554.3556.555.355.97

MFCC*

TEner. CC

MFCC*+IA-Mean

MFCC*+IF-Mean

MFCC*+FMP

Accuracy

TIMIT Databases

Sheet1


MFCC*58.442.4227.7117.7218.652.75

TEner. CC58.8942.441.6134.7438.454.35

MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

MFCC*+FMP59.9243.6938.626.1532.8455.97

Sheet1

MFCC*

TEner. CC

MFCC*+IA-Mean

MFCC*+IF-Mean

MFCC*+FMP

Accuracy

TIMIT Databases

Sheet2

WMMMHMAverage




MFCC*+IA-Mean94.0592.2277.787.99

MFCC*+IF-Mean90.7189.5272.3684.2

MFCC*+FMP94.4192.4682.7389.87

Sheet2

WI007


TECC+log(E)+CMS

MFCC+IA-Mean

MFCC+IF-Mean

MFCC+FMP

Word Accuracy (%)

Aurora 3

Sheet3

WMMMHMAverage


Aud.+IF-Mean94.888.786.189.9

Aud.+IF-Var95.488.987.490.6

Aud.+FMP95.8898990.7

Aud.+FZC95.695.686.390.3

Aud.+IA-Mean95.589.48690.3

Sheet3

000000

000000

000000

000000

Auditory

+IF-Mean

+IF-Var

+FMP

+FZC

+IA-Mean

Word Accuracy (%)


HIWIREICCS - NTUA


3

98.664166666798.575

95.699166666796.34

89.051666666792.7208333333

71.419166666782.915

43.446666666759.0225

16.6787521.36125

63.259083333375.1557638889

Baseline

+FDCD

SNR

Word Accuracy (%)

STMfc08TS_setc

98.5994.2487.873.2147.4898.5696.1491.4877.852.26

98.6295.3691.1281.9559.4698.496.1193.3983.1658.39

0.031.193.7811.9425.23-0.16-0.032.096.8911.73

clean20 dB15 dB10 dB5 dBclean20 dB15 dB10 dB5 dB

Train (MIRS)Street (MIRS)

98.5994.2487.873.2147.4898.6295.3691.1281.9559.46

98.5696.1491.4877.852.2698.496.1193.3983.1658.39

98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

Base Av.77.55125FDCD Av82.3675

STMfc08TS_setc

00

00

00

00

00

Baseline

+FDCD

Restaurant

Averages

00

00

00

00

00

Baseline

+FDCD

Street (MIRS)

00

00

00

00

00

Baseline

FDCD

SNR

Accuracy

Set ASet BSet CAverageW.AverageClean20dB15dB10dB5dB0dBAver

Baseline61.4862.91777.5512567.3265.2704598.664166666795.699166666789.051666666771.419166666743.446666666716.6787563.2590833333

+FDCD70.0671.18882.367574.5472.973398.57596.3492.720833333382.91559.022521.3612575.1557638889

-0.0667498440.14900213140.33513472370.40222176870.2754184840.05619814870.3237992308

Clean20dB15dB10dB5dB0dBClean20dB15dB10dB5dB0dB

Set C98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

Set B98.7196.237588.787569.8442.0317.6998.61596.822593.202583.257559.482523.175

Set A98.7195.6788.7368.9138.4415.6798.6096.4692.7182.9358.6619.55

00

00

00

00

00

00

00

Baseline

+FDCD

SNR

Word Accuracy (%)

HIWIREICCS - NTUA


Chart2

98.693333333398.7333333333

95.722596.3666666667

89.120833333391.6416666667

71.499166666779.1133333333

43.748333333352.7958333333

19.54521.6966666667

69.721527777873.39125

Baseline

MFD

SNR

Word Accuracy

STMfc08TS_seta

98.6595.6789.3871.8142.422.3298.6595.790.7378.2653.3925.67

98.5296.0187.5864.5734.4912.2798.6797.192.0875.6744.0413.39

98.7296.2790.5872.5644.3217.4598.5796.7592.7580.1752.0119.24

98.9894.7287.7266.7733.6911.7999.0796.1491.0876.8948.116.41

98.717595.667588.81568.927538.72515.957598.7496.422591.6677.747549.38518.6775test a

98.717596.24588.877570.1142.3217.932598.7496.977591.9978.797551.987520.9175test b

98.64595.25589.6775.4650.224.74598.7295.791.27580.79557.01525.495test c

98.693333333395.722589.120833333371.499166666743.748333333319.54569.721527777898.733333333396.366666666791.641666666779.113333333352.795833333321.696666666773.39125

clean20 dB15 dB10 dB5 dB0 dBAve.

0.2666666667

STMfc08TS_seta

00

00

00

00

00

00

00

Baseline

MFD

SNR

Word Accuracy

Aurora 2

HIWIREICCS - NTUA


Chart2

95.7595.602596.462596.1675

69.377.622582.9382.24

38.92553.692558.6661.8375

Baseline

+FMP

+FDCD

+FMP+FDCD

SNR

Accuracy

STMfc08TS_seta

20 dB10 dB5 dB

95.87242.8Subway

96.264.934.3Babble

96.27344.7Car

94.867.333.9Exhibition

95.7569.338.925

94.8177.2555.97

96.8380.3555.11

95.7479.1255

95.0373.7748.69

95.602577.622553.6925

96.0282.9760.7

97.3582.9957.06

96.6285.0259.67

95.8680.7457.21

96.462582.9358.66

95.6481.9864.09

97.1885.5562.89

96.1581.5462.49

95.779.8957.88

96.167582.2461.8375

STMfc08TS_seta

0000

0000

0000

Baseline

+FMP

+FDCD

+FMP+FDCD

SNR

Accuracy

HIWIREICCS - NTUA

Aurora 2 Distributed, Multicondition Training

HIWIREICCS - NTUA

Aurora 2 Distributed, Clean Training

WP1: Appendix SlidesAudio Visual: Details

HIWIREICCS - NTUA

Introduction: Motivations for AV-ASRAudio-only ASR does not work reliably in many scenarios:Noisy background (e.g. car's cabin, cockpit)Interference between talkersNeed to enhance the auditory signal when it is not reliableHuman speech perception is multimodal:Different modalities are weighed according to their reliabilityHearing impaired people can lipreadMcGurk Effect (McGurk & MacDonald, 1976)Machines should also be able to exploit multimodal information

HIWIREICCS - NTUA

Audio-Visual Feature FusionAudio-visual feature integration is highly non-trivial:Audio & visual speech asychrony (~100 ms)Relative reliability of streams can vary wildlyMany approaches to feature fusion in the literature:Early integrationIntermediate integrationLate integrationHighly active research area (mainly machine learning)The class of Dynamic Bayesian Networks (DBNs) seems particularly suited for the problem:Stream interaction explicitly modeledModel parameter inference is more difficult than in HMM

HIWIREICCS - NTUA

Visual Front-End AAM ParametersFirst frame of the 36 videos manually annotated68 points on the whole face as shape landmarksColor appearance sampled at 10000 pixelsEigenvectors retained explain 70% variance5 eigenshapes & 10 eigenfacesInitial condition at each new frame the converged solution at the previous frameInverse-compositional gradient descent algorithmCoarse-to-fine refinement (Gaussian pyramid - 3 scales)

HIWIREICCS - NTUA

AV-ASR Experiment SetupFeatures:Audio: 39 features (MFCC_D_A)Visual (upsampled from 30 Hz to 100 Hz):5 shape features (Sh)10 appearance features (App)Audio-Visual: 39+45 feats (MFCC_D_A+SHAPP_D_A)Two-stream HMM8 state, left-to-right HMM whole-digit models with no state skippingSingle Gaussian observation probability densitiesSeparate audio & video feature streams with equal weights (1,1)


HIWIREICCS - NTUA

Aurora 4, Multi-Condition Training

HIWIREICCS - NTUA

Aurora 4, Noisy Training

HIWIREICCS - NTUA

Aurora 4, Noisy Training

Date post:	25-Jan-2016
Category:	Documents
Upload:	tevin
View:	45 times
Download:	0 times

ICCS-NTUA : WP1+WP2

Documents