+ All Categories
Home > Documents > ICCS-NTUA : WP1+WP2

ICCS-NTUA : WP1+WP2

Date post: 25-Jan-2016
Category:
Upload: tevin
View: 45 times
Download: 0 times
Share this document with a friend
Description:
ICCS-NTUA : WP1+WP2. Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr. HIWIRE. Computer Vision, Speech Communication and Signal Processing Research Group. ICCS-NTUA in HIWIRE: 1 st Year. Evaluation Databases Completed Baseline Completed WP1 - PowerPoint PPT Presentation
Popular Tags:
61
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr Computer Vision, Speech Communication and Signal Processing Research Group HIWIRE
Transcript
  • Computer Vision, Speech Communication and Signal Processing Research GroupHIWIRE

    HIWIREICCS - NTUA

    Group Leader: Prof. Petros MaragosPh.D. Students / Graduate Research Assistants :D. Dimitriadis (speech: recognition, modulations)V. Pitsikalis (speech: recognition, fractals/chaos, NLP)A. Katsamanis (speech: modulations, statistical processing, recognition)G. Papandreou (vision: PDEs, active contours, level sets, AV-ASR) G. Evangelopoulos (vision/speech: texture, modulations, fractals)S. Leykimiatis (speech: statistical processing, microphone arrays)

    HIWIRE Involved CVSP Members

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Modulation Features Results 1st YearFractal Features Results 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

    HIWIREICCS - NTUA

    WP1: Noise RobustnessPlatform: HTKBaseline + Evaluation:

    Aurora 2, Aurora 3, TIMIT+NOISE

    Modulation FeaturesAM-FM ModulationsTeager Energy CepstrumFractal FeaturesDynamical Denoising Correlation DimensionMultiscale Fractal DimensionHybrid-Merged Featuresup to +62% (Aurora 3)up to +36% (Aurora 2)up to +61% (Aurora 2)

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Speech Modulation FeaturesResults 1st YearFractal Features Results 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

    HIWIREICCS - NTUA

    Speech Modulation FeaturesFilterbank Design

    Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-MeanShort-Term Mean Inst. Frequency IF-MeanFrequency Modulation Percentages FMP

    Short-Term Energy Modulation FeaturesAverage Teager Energy, Cepstrum Coef. TECC

    HIWIREICCS - NTUA

    Modulation Acoustic Features

    SpeechNonlinearProcessingDemodulationRobustFeatureTransformation/SelectionRegularization+Multiband FilteringStatisticalProcessingV.A.D.Energy Features: Teager Energy Cepstrum Coeff. TECCAM-FM Modulation Features:Mean Inst. Ampl. IA-MeanMean Inst. Freq. IF-MeanFreq. Mod. Percent. FMP

    HIWIREICCS - NTUA

    TIMIT-based Speech Databases TIMIT Database:Training Set: 3696 sentences , ~35 phonemes/utterancesTesting Set: 1344 utterances, 46680 phonemesSampling Frequency 16 kHz Feature Vectors:MFCC+C0+AM-FM+1st+2nd Time DerivativesStream Weights: (1) for MFCC and (2) for -FM

    3-state left-right HMMs, 16 mixturesAll-pair, Unweighted grammarPerformance Criterion: Phone Accuracy Rates (%)Back-end System: HTK v3.2.0

    HIWIREICCS - NTUA

    Results: TIMIT+NoiseUp to +106%

    Chart1

    58.458.8959.6159.3459.92

    42.4242.443.5343.743.69

    27.7141.6139.2536.8738.6

    17.7234.7426.0325.3826.15

    18.638.431.0530.9232.84

    52.7554.3556.555.355.97

    MFCC*

    TEner. CC

    MFCC*+IA-Mean

    MFCC*+IF-Mean

    MFCC*+FMP

    Accuracy

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    Sheet1

    MFCC*

    TEner. CC

    MFCC*+IA-Mean

    MFCC*+IF-Mean

    MFCC*+FMP

    Accuracy

    TIMIT Databases

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet2

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    HIWIREICCS - NTUA

    Aurora 3 - SpanishConnected-Digits, Sampling Frequency 8 kHzTraining Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211)HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digitsMM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digitsHM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits2 Back-end ASR Systems ( and BLasr)Feature Vectors: MFCC+AM-FM (or Auditory+M-FM), TECCAll-Pair, Unweighted Grammar (or Word-Pair Grammar)Performance Criterion: Word (digit) Accuracy Rates

    HIWIREICCS - NTUA

    Results: Aurora 3 (HTK)Up to +62%

    Chart2

    92.9493.6893.6494.0590.7194.41

    80.3192.7391.6192.2289.5292.46

    51.5565.1886.8577.772.3682.73

    74.9383.8690.787.9984.289.87

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet2

    000000

    000000

    000000

    000000

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    HIWIREICCS - NTUA

    Databases: Aurora 2Task: Speaker Independent Recognition of Digit SequencesTI - Digits at 8kHzTraining (8440 Utterances per scenario, 55M/55F)Clean (8kHz, G712)Multi-Condition (8kHz, G712)4 Noises (artificial): subway, babble, car, exhibition5 SNRs : 5, 10, 15, 20dB , cleanTesting, artificially added noise7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]A: noises as in multi-cond train., G712 (28028 Utters)B: restaurant, street, airport, train station, G712 (28028 Utters)C: subway, street (MIRS) (14014 Utters)

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +12%

    1

    98.717598.665

    95.9562596.10125

    88.8462590.1125

    69.5187573.11125

    40.522545.8275

    16.94519.70125

    Baseline

    IA-Mean

    SNR

    Word Accuracy (%)

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    0.1183606557

    0.1160714286

    0.1358024691

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    AURORA2TESTC

    clean20 dB15 dB10 dB5 dB0 dB

    Baseline9695.6788.727568.912538.4415.6675

    GaborFMP98.707595.6788.727568.912538.4415.6675

    AURORA2TESTC

    00

    00

    00

    00

    00

    00

    Baseline

    GaborFMP

    SNR

    Word Accuracy (%)

    AURORA2

    AURORA2TESTB

    clean20 dB15 dB10 dB5 dB0 dB

    Baseline9695.6788.727568.912538.4415.6675

    GaborFMP98.707595.6788.727568.912538.4415.6675

    98.6595.8887.0466.7538.7814.4398.6596.3888.3370.2242.9517.38

    98.5296.4390.8471.6144.4120.3198.4696.5291.8174.9450.2123.31

    98.729688.5270.0343.5421.3298.6996.1290.5274.2948.0524.04

    98.9896.6789.1172.0542.5515.6798.8696.7990.976.1248.0420.46

    98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

    AURORA2TESTB

    00

    00

    00

    00

    00

    00

    Baseline

    GaborFMP

    SNR

    Word Accuracy (%)

    AURORA2

    AURORA2TESTA

    clean20 dB15 dB10 dB5 dB0 dB

    Baseline9695.6788.727568.912538.4415.6675

    GaborFMP98.707595.6788.727568.912538.4415.6675

    TEST A

    98.6595.6789.3871.8142.422.32N198.6595.6790.0274.1848.8224.13

    98.5296.0187.5864.5734.4912.27N298.4696.5589.5169.4140.4514.54

    98.7296.2790.5872.5644.3217.45N398.6996.0991.3875.6948.7921.44

    98.9894.7287.7266.7733.6911.79N498.8694.6988.4370.0439.3112.31

    98.717595.667588.81568.927538.72515.957598.66595.7589.83572.3344.342518.105

    clean20151050

    Test B

    98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

    Average over Test A + Test B

    98.717595.9562588.8462569.5187540.522516.94598.66596.1012590.112573.1112545.827519.70125

    98.66596.1012590.112573.1112545.827519.70125

    -4.09356725153.585780525511.352684074911.78593397588.91933924593.3185840708

    AURORA2TESTA

    00

    00

    00

    00

    00

    00

    Baseline

    IA-Mean

    SNR

    Word Accuracy (%)

    AURORA2

    MBD00060D7E.xls

    Chart2

    92.9493.6893.6494.0590.7194.41

    80.3192.7391.6192.2289.5292.46

    51.5565.1886.8577.772.3682.73

    74.9383.8690.787.9984.289.87

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet2

    000000

    000000

    000000

    000000

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    MBD0006064D.xls

    Chart1

    58.458.8959.6159.3459.92

    42.4242.443.5343.743.69

    27.7141.6139.2536.8738.6

    17.7234.7426.0325.3826.15

    18.638.431.0530.9232.84

    52.7554.3556.555.355.97

    MFCC*

    TEner. CC

    MFCC*+IA-Mean

    MFCC*+IF-Mean

    MFCC*+FMP

    Accuracy

    TIMIT Databases

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    Sheet1

    MFCC*

    TEner. CC

    MFCC*+IA-Mean

    MFCC*+IF-Mean

    MFCC*+FMP

    Accuracy

    TIMIT Databases

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet2

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    HIWIREICCS - NTUA

    Work To Be Done on Modulation Features

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Speech Modulation FeaturesResults 1st YearFractal FeaturesResults 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

    HIWIREICCS - NTUA

    Fractal FeaturesN-dCleanedEmbeddingN-dSignalLocal SVD

    speechsignalFiltered Dynamics - Correlation Dimension (8)Noisy EmbeddingFiltered EmbeddingFDCDMultiscale Fractal Dimension (6)MFDGeometrical Filtering

    HIWIREICCS - NTUA

    Databases: Aurora 2Task: Speaker Independent Recognition of Digit SequencesTI - Digits at 8kHzTraining (8440 Utterances per scenario, 55M/55F)Clean (8kHz, G712)Multi-Condition (8kHz, G712)4 Noises (artificial): subway, babble, car, exhibition5 SNRs : 5, 10, 15, 20dB , cleanTesting, artificially added noise7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]A: noises as in multi-cond train., G712 (28028 Utters)B: restaurant, street, airport, train station, G712 (28028 Utters)C: subway, street (MIRS) (14014 Utters)

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +40%

    3

    98.664166666798.575

    95.699166666796.34

    89.051666666792.7208333333

    71.419166666782.915

    43.446666666759.0225

    16.6787521.36125

    63.259083333375.1557638889

    Baseline

    +FDCD

    SNR

    Word Accuracy (%)

    STMfc08TS_setc

    98.5994.2487.873.2147.4898.5696.1491.4877.852.26

    98.6295.3691.1281.9559.4698.496.1193.3983.1658.39

    0.031.193.7811.9425.23-0.16-0.032.096.8911.73

    clean20 dB15 dB10 dB5 dBclean20 dB15 dB10 dB5 dB

    Train (MIRS)Street (MIRS)

    98.5994.2487.873.2147.4898.6295.3691.1281.9559.46

    98.5696.1491.4877.852.2698.496.1193.3983.1658.39

    98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

    Base Av.77.55125FDCD Av82.3675

    STMfc08TS_setc

    00

    00

    00

    00

    00

    Baseline

    +FDCD

    Restaurant

    Averages

    00

    00

    00

    00

    00

    Baseline

    +FDCD

    Street (MIRS)

    00

    00

    00

    00

    00

    Baseline

    FDCD

    SNR

    Accuracy

    Set ASet BSet CAverageW.AverageClean20dB15dB10dB5dB0dBAver

    Baseline61.4862.91777.5512567.3265.2704598.664166666795.699166666789.051666666771.419166666743.446666666716.6787563.2590833333

    +FDCD70.0671.18882.367574.5472.973398.57596.3492.720833333382.91559.022521.3612575.1557638889

    -0.0667498440.14900213140.33513472370.40222176870.2754184840.05619814870.3237992308

    Clean20dB15dB10dB5dB0dBClean20dB15dB10dB5dB0dB

    Set C98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

    Set B98.7196.237588.787569.8442.0317.6998.61596.822593.202583.257559.482523.175

    Set A98.7195.6788.7368.9138.4415.6798.6096.4692.7182.9358.6619.55

    00

    00

    00

    00

    00

    00

    00

    Baseline

    +FDCD

    SNR

    Word Accuracy (%)

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +27%

    Chart2

    98.693333333398.7333333333

    95.722596.3666666667

    89.120833333391.6416666667

    71.499166666779.1133333333

    43.748333333352.7958333333

    19.54521.6966666667

    69.721527777873.39125

    Baseline

    MFD

    SNR

    Word Accuracy

    STMfc08TS_seta

    98.6595.6789.3871.8142.422.3298.6595.790.7378.2653.3925.67

    98.5296.0187.5864.5734.4912.2798.6797.192.0875.6744.0413.39

    98.7296.2790.5872.5644.3217.4598.5796.7592.7580.1752.0119.24

    98.9894.7287.7266.7733.6911.7999.0796.1491.0876.8948.116.41

    98.717595.667588.81568.927538.72515.957598.7496.422591.6677.747549.38518.6775test a

    98.717596.24588.877570.1142.3217.932598.7496.977591.9978.797551.987520.9175test b

    98.64595.25589.6775.4650.224.74598.7295.791.27580.79557.01525.495test c

    98.693333333395.722589.120833333371.499166666743.748333333319.54569.721527777898.733333333396.366666666791.641666666779.113333333352.795833333321.696666666773.39125

    clean20 dB15 dB10 dB5 dB0 dBAve.

    0.2666666667

    STMfc08TS_seta

    00

    00

    00

    00

    00

    00

    00

    Baseline

    MFD

    SNR

    Word Accuracy

    Aurora 2

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +61%

    Chart2

    95.7595.602596.462596.1675

    69.377.622582.9382.24

    38.92553.692558.6661.8375

    Baseline

    +FMP

    +FDCD

    +FMP+FDCD

    SNR

    Accuracy

    STMfc08TS_seta

    20 dB10 dB5 dB

    95.87242.8Subway

    96.264.934.3Babble

    96.27344.7Car

    94.867.333.9Exhibition

    95.7569.338.925

    94.8177.2555.97

    96.8380.3555.11

    95.7479.1255

    95.0373.7748.69

    95.602577.622553.6925

    96.0282.9760.7

    97.3582.9957.06

    96.6285.0259.67

    95.8680.7457.21

    96.462582.9358.66

    95.6481.9864.09

    97.1885.5562.89

    96.1581.5462.49

    95.779.8957.88

    96.167582.2461.8375

    STMfc08TS_seta

    Baseline

    +FMP

    +FDCD

    +FMP+FDCD

    SNR

    Accuracy

    HIWIREICCS - NTUA

    Future Directions on Fractal FeaturesRefine Fractal Feature Extraction.Application to Aurora 3.Fusion with other features.

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

    HIWIREICCS - NTUA

    Visual Front-EndAim:Extract low-dimensional visual speech feature vector from videoVisual front-end modules:Speaker's face detectionROI trackingFacial Model FittingVisual feature extractionChallenges:Very high dimensional signal - which features are proper?RobustnessComputational Efficiency

    HIWIREICCS - NTUA

    Face ModelingA well studied problem in Computer Vision:Active Appearance Models, Morphable Models, Active BlobsBoth Shape & Appearance can enhance lipreadingThe shape and appearance of human faces live in low dimensional manifolds==

    HIWIREICCS - NTUA

    Image Fitting Examplestep 2step 6step 10step 14step 18

    HIWIREICCS - NTUA

    Example: Face Interpretation Using AAMGenerative models like AAM allow us to evaluate the output of the visual front-endoriginal videoshape track superimposed on original videoreconstructed faceThis is what the visual-only speech recognizer sees!

    HIWIREICCS - NTUA

    Evaluation on the CUAVE Database

    HIWIREICCS - NTUA

    Audio-Visual ASR: Database Subset of CUAVE database used:36 speakers (30 training, 6 testing)5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10)Test set: 300 digits (6x5x10)

    CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

    CUAVE was kindly provided by the Clemson University

    HIWIREICCS - NTUA

    Recognition Results (Word Accuracy)DataTraining: ~500 digits (29 speakers)Testing: ~100 digits (4 speakers)

    HIWIREICCS - NTUA

    Future WorkVisual Front-endBetter trained AAMTemporal trackingFeature fusionExperimentation with alternative DBN architecturesAutomatic stream weight determinationIntegration with non-linear acoustic featuresExperiments on other audio-visual databasesSystematic evaluation of visual features

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Modulation Features Results 1st YearFractal FeaturesResults 1st YearAudio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

    HIWIREICCS - NTUA

    User Robustness, Speaker AdaptationVTLNBaselinePlatform: HTKDatabase: AURORA 4Fs = 8 kHzScenarios: Training, TestingComparison with MLLRCollection of non-Native Speech Data Completed10 Speakers100 Utterances/Speaker

    HIWIREICCS - NTUA

    Vocal Tract Length NormalizationImplementation: HTKWarping Factor EstimationMaximum Likelihood (ML) criterionFigures from Hain99, Lee96

    HIWIREICCS - NTUA

    VTLNTrainingAURORA 4 Baseline SetupClean (SIC), Multi-Condition (SIM), Noisy (SIN)TestingEstimate warping factor using adaptation utterances (Supervised VTLN)Per speaker warping factor (1, 2, 10, 20 Utterances)2-pass Decoding1st passGet a hypothetical transcriptionAlignment and ML to estimate per utterance warping factor2nd passDecode properly normalized utterance

    HIWIREICCS - NTUA

    Databases: Aurora 4Task: 5000 Word, Continuous Speech RecognitionWSJ0: (16 / 8 kHz) + Artificially Added Noise 2 microphones: Sennheiser, OtherFiltering: G712, P341Noises: Car, Babble, Restaurant, Street, Airport, Train StationTraining (7138 Utterances per scenario)Clean: Sennheiser mic.Multi-Condition: Sennheiser Other mic., 75% w. artificially added noise @ SNR: 10 20 dBNoisy: Sennheiser, artificially added noise SNR: 10 20 dBTesting (330 Utterances 166 Utterances each. Speaker # = 8)SNR: 5-15 dB 1-7: Sennheiser microphone8-14: Other microphone

    HIWIREICCS - NTUA

    VTLN Results, Clean Training

    1

    13.1511.611.9710.94

    24.7521.416.2813.33

    57.553.964634.44

    SIC

    Supervised VTLN - 2

    MLLR - 2

    MLLR-20

    Test Noise

    Word Error Rate (%)

    Clean I

    Sennheiser MicOther MicAverage

    SIC58.442.4227.71

    SVTLN - 258.8942.441.61

    VTLN - 2 pass59.6143.5339.25

    MLLR - 259.3443.736.87

    Clean I

    0000

    0000

    0000

    SIC

    SVTLN-2

    VTLN-2 pass

    MLLR-2

    Word Error Rate (%)

    AURORA4

    Multi

    Sennheiser MicOther MicAverage

    SIM58.442.4227.71

    SVTLN - 257.8942.441.61

    VTLN - 2 pass59.6143.5339.25

    MLLR - 259.3443.736.87

    Multi

    0000

    0000

    0000

    SIM

    SVTLN - 2

    VTLN - 2 pass

    MLLR - 2

    Word Error Rate (%)

    AURORA4

    Noisy

    Sennheiser MicOther MicAverage

    SIN58.442.4227.71

    SVTLN - 258.8942.441.61

    VTLN - 2 pass59.6143.5339.25

    MLLR - 259.3443.736.87

    Noisy

    0000

    0000

    0000

    SIN

    SVTLN - 2

    VTLN - 2 pass

    MLLR - 2

    Word Error Rate (%)

    AURORA4

    CleanII

    CleanCarTr. Station

    SIC13.1524.7557.5

    Supervised VTLN - 211.621.453.96

    MLLR - 211.9716.2846

    MLLR-2010.9413.3334.44

    1234567891011121314

    Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41

    MLLR-2 match cond

    CleanII

    0000

    0000

    0000

    SIC

    Supervised VTLN - 2

    MLLR - 2

    MLLR-20

    Word Error Rate (%)

    AURORA4

    HIWIREICCS - NTUA

    VTLN Results, Multi-Condition Training

    2

    19.4517.5316.54

    16.514.6615.43

    29.6528.2131.27

    SIM

    Supervised VTLN - 2

    MLLR - 2

    Test Noise

    Word Error Rate (%)

    Clean I

    Sennheiser MicOther MicAverage

    SIC58.442.4227.71

    SVTLN - 258.8942.441.61

    VTLN - 2 pass59.6143.5339.25

    MLLR - 259.3443.736.87

    Clean I

    0000

    0000

    0000

    SIC

    SVTLN-2

    VTLN-2 pass

    MLLR-2

    Word Error Rate (%)

    AURORA4

    Multi

    CleanCarTr. Station

    SIM19.4516.529.65

    Supervised VTLN - 217.5314.6628.21

    MLLR - 216.5415.4331.27

    Multi

    000

    000

    000

    SIM

    Supervised VTLN - 2

    MLLR - 2

    Test Noise

    Word Error Rate (%)

    Noisy

    Sennheiser MicOther MicAverage

    SIN58.442.4227.71

    SVTLN - 258.8942.441.61

    VTLN - 2 pass59.6143.5339.25

    MLLR - 259.3443.736.87

    Noisy

    0000

    0000

    0000

    SIN

    SVTLN - 2

    VTLN - 2 pass

    MLLR - 2

    Word Error Rate (%)

    AURORA4

    CleanII

    CleanCarTr. Station

    SIC13.1524.7557.5

    SVTLN - 211.621.453.96

    MLLR - 211.9716.2846

    MLLR-2010.9413.3334.44

    1234567891011121314

    Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41

    MLLR-2 match cond

    CleanII

    0000

    0000

    0000

    SIC

    SVTLN - 2

    MLLR - 2

    MLLR-20

    Word Error Rate (%)

    AURORA4

    HIWIREICCS - NTUA

    VTLN Results, Noisy Training

    3

    15.7313.1114.92

    16.1314.3615.14

    33.3731.1234.04

    SIN

    Supervised VTLN - 2

    MLLR - 2

    Test Noise

    Word Error Rate (%)

    Clean I

    Sennheiser MicOther MicAverage

    SIC58.442.4227.71

    SVTLN - 258.8942.441.61

    VTLN - 2 pass59.6143.5339.25

    MLLR - 259.3443.736.87

    Clean I

    0000

    0000

    0000

    SIC

    SVTLN-2

    VTLN-2 pass

    MLLR-2

    Word Error Rate (%)

    AURORA4

    Multi

    CleanCarTr. Station

    SIM19.4516.529.65

    Supervised VTLN - 217.5314.6628.21

    MLLR - 216.5415.4331.27

    Multi

    000

    000

    000

    SIM

    Supervised VTLN - 2

    MLLR - 2

    Test Noise

    Word Error Rate (%)

    Noisy

    CleanCarTr. Station

    SIN15.7316.1333.37

    Supervised VTLN - 213.1114.3631.12

    MLLR - 214.9215.1434.04

    Noisy

    000

    000

    000

    SIN

    Supervised VTLN - 2

    MLLR - 2

    Test Noise

    Word Error Rate (%)

    CleanII

    CleanCarTr. Station

    SIC13.1524.7557.5

    SVTLN - 211.621.453.96

    MLLR - 211.9716.2846

    MLLR-2010.9413.3334.44

    1234567891011121314

    Baseline15.2117.5727.9229.3635.427.9938.0816.7622.134.0737.5146.6331.8742.41

    MLLR-2 match cond

    CleanII

    0000

    0000

    0000

    SIC

    SVTLN - 2

    MLLR - 2

    MLLR-20

    Word Error Rate (%)

    AURORA4

    HIWIREICCS - NTUA

    Future Directions for Speaker NormalizationEstimate warping transforms at signal level Exploit instantaneous amplitude or frequency signals to estimate the warping parameters, Normalize the signal

    Effective integration with model-based adaptation techniques (collaboration with TSI)

    HIWIREICCS - NTUA

    ICCS-NTUA in HIWIRE: 1st YearEvaluationDatabasesCompletedBaselineCompletedWP1Noise Robust FeaturesResults 1st Year Audio-Visual ASR Baseline + Visual FeaturesMulti-microphone arrayExploratory PhaseVADPrelim. ResultsWP2Speaker NormalizationBaselineNon-native Speech DatabaseCompleted

  • WP1: Appendix SlidesAurora 3

    HIWIREICCS - NTUA

    ASR Results

    HIWIREICCS - NTUA

    Experimental Results IIa (HTK)

    HIWIREICCS - NTUA

    Aurora 3 ConfigsHM States 14, Mixs 12MM States 16, Mixs 6WMStates 16, Mixs 16

  • WP1: Appendix SlidesAurora 2

    HIWIREICCS - NTUA

    Baseline: Aurora 2 Database Structure:2 Training Scenarios, 3 Test Sets, [4+4+2] Conditions, 7 SNRs per Condition: Total of 2x70 TestsPresentation of Selected Results:Average over SNR.Average over Condition. Training Scenarios: Clean- v.s Multi- Train.Noise Level: Low v.s. High SNR.Condition: Worst v.s. Easy Conditions.Features: MFCC+D+A v.s. MFCC+D+A+CMS

    Set up: # states 18 [10-22], # mixs [3-32], MFCC+D+A+CMS

    HIWIREICCS - NTUA

    Average Baseline Results: Aurora 2* Average HTK results as reported with the database.Average over all SNRs and all ConditionsPlain: MFCC+D+A, CMS: MFCC+D+A+CMS.Mixture #: Clean train (Both Plain,CMS) 3, Multi train Plain: 22, CMS: 32.Best: Select for each condition/noise the # mixs with the best result.

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +12%

    1

    98.717598.665

    95.9562596.10125

    88.8462590.1125

    69.5187573.11125

    40.522545.8275

    16.94519.70125

    Baseline

    IA-Mean

    SNR

    Word Accuracy (%)

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    0.1183606557

    0.1160714286

    0.1358024691

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    AURORA2TESTC

    clean20 dB15 dB10 dB5 dB0 dB

    Baseline9695.6788.727568.912538.4415.6675

    GaborFMP98.707595.6788.727568.912538.4415.6675

    AURORA2TESTC

    00

    00

    00

    00

    00

    00

    Baseline

    GaborFMP

    SNR

    Word Accuracy (%)

    AURORA2

    AURORA2TESTB

    clean20 dB15 dB10 dB5 dB0 dB

    Baseline9695.6788.727568.912538.4415.6675

    GaborFMP98.707595.6788.727568.912538.4415.6675

    98.6595.8887.0466.7538.7814.4398.6596.3888.3370.2242.9517.38

    98.5296.4390.8471.6144.4120.3198.4696.5291.8174.9450.2123.31

    98.729688.5270.0343.5421.3298.6996.1290.5274.2948.0524.04

    98.9896.6789.1172.0542.5515.6798.8696.7990.976.1248.0420.46

    98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

    AURORA2TESTB

    Baseline

    GaborFMP

    SNR

    Word Accuracy (%)

    AURORA2

    AURORA2TESTA

    clean20 dB15 dB10 dB5 dB0 dB

    Baseline9695.6788.727568.912538.4415.6675

    GaborFMP98.707595.6788.727568.912538.4415.6675

    TEST A

    98.6595.6789.3871.8142.422.32N198.6595.6790.0274.1848.8224.13

    98.5296.0187.5864.5734.4912.27N298.4696.5589.5169.4140.4514.54

    98.7296.2790.5872.5644.3217.45N398.6996.0991.3875.6948.7921.44

    98.9894.7287.7266.7733.6911.79N498.8694.6988.4370.0439.3112.31

    98.717595.667588.81568.927538.72515.957598.66595.7589.83572.3344.342518.105

    clean20151050

    Test B

    98.717596.24588.877570.1142.3217.932598.66596.452590.3973.892547.312521.2975

    Average over Test A + Test B

    98.717595.9562588.8462569.5187540.522516.94598.66596.1012590.112573.1112545.827519.70125

    98.66596.1012590.112573.1112545.827519.70125

    -4.09356725153.585780525511.352684074911.78593397588.91933924593.3185840708

    AURORA2TESTA

    Baseline

    IA-Mean

    SNR

    Word Accuracy (%)

    AURORA2

    MBD00060D7E.xls

    Chart2

    92.9493.6893.6494.0590.7194.41

    80.3192.7391.6192.2289.5292.46

    51.5565.1886.8577.772.3682.73

    74.9383.8690.787.9984.289.87

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet2

    000000

    000000

    000000

    000000

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    MBD0006064D.xls

    Chart1

    58.458.8959.6159.3459.92

    42.4242.443.5343.743.69

    27.7141.6139.2536.8738.6

    17.7234.7426.0325.3826.15

    18.638.431.0530.9232.84

    52.7554.3556.555.355.97

    MFCC*

    TEner. CC

    MFCC*+IA-Mean

    MFCC*+IF-Mean

    MFCC*+FMP

    Accuracy

    TIMIT Databases

    Sheet1

    TIMITNTIMITTIMIT+BabbleTIMIT+WhiteTIMIT+ PinkTIMIT+ Car

    MFCC*58.442.4227.7117.7218.652.75

    TEner. CC58.8942.441.6134.7438.454.35

    MFCC*+IA-Mean59.6143.5339.2526.0331.0556.5

    MFCC*+IF-Mean59.3443.736.8725.3830.9255.3

    MFCC*+FMP59.9243.6938.626.1532.8455.97

    Sheet1

    MFCC*

    TEner. CC

    MFCC*+IA-Mean

    MFCC*+IF-Mean

    MFCC*+FMP

    Accuracy

    TIMIT Databases

    Sheet2

    WMMMHMAverage

    Aurora Front-End (WI007)92.9480.3151.5574.93

    MFCC+log(Ener)+D+DD+CMS93.6892.7365.1883.86

    TEnerCC+log(Ener) +CMS93.6491.6186.8590.7

    MFCC*+IA-Mean94.0592.2277.787.99

    MFCC*+IF-Mean90.7189.5272.3684.2

    MFCC*+FMP94.4192.4682.7389.87

    Sheet2

    WI007

    MFCC+log(E)+D+DD+CMS

    TECC+log(E)+CMS

    MFCC+IA-Mean

    MFCC+IF-Mean

    MFCC+FMP

    Word Accuracy (%)

    Aurora 3

    Sheet3

    WMMMHMAverage

    Auditory (Baseline)95.489.284.889.8

    Aud.+IF-Mean94.888.786.189.9

    Aud.+IF-Var95.488.987.490.6

    Aud.+FMP95.8898990.7

    Aud.+FZC95.695.686.390.3

    Aud.+IA-Mean95.589.48690.3

    Sheet3

    000000

    000000

    000000

    000000

    Auditory

    +IF-Mean

    +IF-Var

    +FMP

    +FZC

    +IA-Mean

    Word Accuracy (%)

    Aurora3 (Spanish Task)

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +40%

    3

    98.664166666798.575

    95.699166666796.34

    89.051666666792.7208333333

    71.419166666782.915

    43.446666666759.0225

    16.6787521.36125

    63.259083333375.1557638889

    Baseline

    +FDCD

    SNR

    Word Accuracy (%)

    STMfc08TS_setc

    98.5994.2487.873.2147.4898.5696.1491.4877.852.26

    98.6295.3691.1281.9559.4698.496.1193.3983.1658.39

    0.031.193.7811.9425.23-0.16-0.032.096.8911.73

    clean20 dB15 dB10 dB5 dBclean20 dB15 dB10 dB5 dB

    Train (MIRS)Street (MIRS)

    98.5994.2487.873.2147.4898.6295.3691.1281.9559.46

    98.5696.1491.4877.852.2698.496.1193.3983.1658.39

    98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

    Base Av.77.55125FDCD Av82.3675

    STMfc08TS_setc

    00

    00

    00

    00

    00

    Baseline

    +FDCD

    Restaurant

    Averages

    00

    00

    00

    00

    00

    Baseline

    +FDCD

    Street (MIRS)

    00

    00

    00

    00

    00

    Baseline

    FDCD

    SNR

    Accuracy

    Set ASet BSet CAverageW.AverageClean20dB15dB10dB5dB0dBAver

    Baseline61.4862.91777.5512567.3265.2704598.664166666795.699166666789.051666666771.419166666743.446666666716.6787563.2590833333

    +FDCD70.0671.18882.367574.5472.973398.57596.3492.720833333382.91559.022521.3612575.1557638889

    -0.0667498440.14900213140.33513472370.40222176870.2754184840.05619814870.3237992308

    Clean20dB15dB10dB5dB0dBClean20dB15dB10dB5dB0dB

    Set C98.57595.1989.6475.50549.8798.5195.73592.25582.55558.925

    Set B98.7196.237588.787569.8442.0317.6998.61596.822593.202583.257559.482523.175

    Set A98.7195.6788.7368.9138.4415.6798.6096.4692.7182.9358.6619.55

    00

    00

    00

    00

    00

    00

    00

    Baseline

    +FDCD

    SNR

    Word Accuracy (%)

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +27%

    Chart2

    98.693333333398.7333333333

    95.722596.3666666667

    89.120833333391.6416666667

    71.499166666779.1133333333

    43.748333333352.7958333333

    19.54521.6966666667

    69.721527777873.39125

    Baseline

    MFD

    SNR

    Word Accuracy

    STMfc08TS_seta

    98.6595.6789.3871.8142.422.3298.6595.790.7378.2653.3925.67

    98.5296.0187.5864.5734.4912.2798.6797.192.0875.6744.0413.39

    98.7296.2790.5872.5644.3217.4598.5796.7592.7580.1752.0119.24

    98.9894.7287.7266.7733.6911.7999.0796.1491.0876.8948.116.41

    98.717595.667588.81568.927538.72515.957598.7496.422591.6677.747549.38518.6775test a

    98.717596.24588.877570.1142.3217.932598.7496.977591.9978.797551.987520.9175test b

    98.64595.25589.6775.4650.224.74598.7295.791.27580.79557.01525.495test c

    98.693333333395.722589.120833333371.499166666743.748333333319.54569.721527777898.733333333396.366666666791.641666666779.113333333352.795833333321.696666666773.39125

    clean20 dB15 dB10 dB5 dB0 dBAve.

    0.2666666667

    STMfc08TS_seta

    00

    00

    00

    00

    00

    00

    00

    Baseline

    MFD

    SNR

    Word Accuracy

    Aurora 2

    HIWIREICCS - NTUA

    Results: Aurora 2Up to +61%

    Chart2

    95.7595.602596.462596.1675

    69.377.622582.9382.24

    38.92553.692558.6661.8375

    Baseline

    +FMP

    +FDCD

    +FMP+FDCD

    SNR

    Accuracy

    STMfc08TS_seta

    20 dB10 dB5 dB

    95.87242.8Subway

    96.264.934.3Babble

    96.27344.7Car

    94.867.333.9Exhibition

    95.7569.338.925

    94.8177.2555.97

    96.8380.3555.11

    95.7479.1255

    95.0373.7748.69

    95.602577.622553.6925

    96.0282.9760.7

    97.3582.9957.06

    96.6285.0259.67

    95.8680.7457.21

    96.462582.9358.66

    95.6481.9864.09

    97.1885.5562.89

    96.1581.5462.49

    95.779.8957.88

    96.167582.2461.8375

    STMfc08TS_seta

    0000

    0000

    0000

    Baseline

    +FMP

    +FDCD

    +FMP+FDCD

    SNR

    Accuracy

    HIWIREICCS - NTUA

    Aurora 2 Distributed, Multicondition Training

    HIWIREICCS - NTUA

    Aurora 2 Distributed, Clean Training

  • WP1: Appendix SlidesAudio Visual: Details

    HIWIREICCS - NTUA

    Introduction: Motivations for AV-ASRAudio-only ASR does not work reliably in many scenarios:Noisy background (e.g. car's cabin, cockpit)Interference between talkersNeed to enhance the auditory signal when it is not reliableHuman speech perception is multimodal:Different modalities are weighed according to their reliabilityHearing impaired people can lipreadMcGurk Effect (McGurk & MacDonald, 1976)Machines should also be able to exploit multimodal information

    HIWIREICCS - NTUA

    Audio-Visual Feature FusionAudio-visual feature integration is highly non-trivial:Audio & visual speech asychrony (~100 ms)Relative reliability of streams can vary wildlyMany approaches to feature fusion in the literature:Early integrationIntermediate integrationLate integrationHighly active research area (mainly machine learning)The class of Dynamic Bayesian Networks (DBNs) seems particularly suited for the problem:Stream interaction explicitly modeledModel parameter inference is more difficult than in HMM

    HIWIREICCS - NTUA

    Visual Front-End AAM ParametersFirst frame of the 36 videos manually annotated68 points on the whole face as shape landmarksColor appearance sampled at 10000 pixelsEigenvectors retained explain 70% variance5 eigenshapes & 10 eigenfacesInitial condition at each new frame the converged solution at the previous frameInverse-compositional gradient descent algorithmCoarse-to-fine refinement (Gaussian pyramid - 3 scales)

    HIWIREICCS - NTUA

    AV-ASR Experiment SetupFeatures:Audio: 39 features (MFCC_D_A)Visual (upsampled from 30 Hz to 100 Hz):5 shape features (Sh)10 appearance features (App)Audio-Visual: 39+45 feats (MFCC_D_A+SHAPP_D_A)Two-stream HMM8 state, left-to-right HMM whole-digit models with no state skippingSingle Gaussian observation probability densitiesSeparate audio & video feature streams with equal weights (1,1)

  • WP1: Appendix SlidesAurora 4

    HIWIREICCS - NTUA

    Aurora 4, Multi-Condition Training

    HIWIREICCS - NTUA

    Aurora 4, Noisy Training

    HIWIREICCS - NTUA

    Aurora 4, Noisy Training


Recommended