+ All Categories
Home > Documents > Automatic Speech Recognition Studies

Automatic Speech Recognition Studies

Date post: 07-Jan-2016
Category:
Upload: eamon
View: 25 times
Download: 1 times
Share this document with a friend
Description:
Automatic Speech Recognition Studies. Guy Brown, Amy Beeston and Kalle Palomäki. Overview. Aims The articulation index (AI) corpus Phone recogniser Results on sir/stir subset of AI corpus Future plans. Aims. - PowerPoint PPT Presentation
Popular Tags:
28
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 1 Automatic Speech Recognition Studies Guy Brown, Amy Beeston and Kalle Palomäki
Transcript
Page 1: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 1

Automatic Speech Recognition Studies

Guy Brown, Amy Beeston and Kalle Palomäki

Page 2: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 2

Overview

• Aims

• The articulation index (AI) corpus

• Phone recogniser

• Results on sir/stir subset of AI corpus

• Future plans

Page 3: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 3

Aims

• Aim to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).

• Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task.– wider vocabulary

– range of reverberation conditions

– variety of speech contexts

– naturalistic speech, rather than interpolated stimuli

– consider phonetic confusions in reverberation in general

Page 4: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 4

Progress to date

• Current work has focused on implementing a baseline ASR system for the articulation index (AI) corpus, which meets the requirements for speech material stated on previous slide.

• So far have results for phone recognition on small test set without any ‘constancy’ processing.

• Planning evaluation that compares phonetic confusions made by listeners and ASR on the same test.

Page 5: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 5

The articulation index (AI) corpus• Recorded by Jonathan Wright (University of

Pennsylvania), available via LDC.

• Intended for speech recognition in noise experiments similar to those of Fletcher.

• Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins et al.:– English (American)

– Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”)

– Target syllables are embedded in a context sentence drawn from a limited vocabulary

Page 6: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 6

Details of the AI corpus

• Includes all “valid” English diphone (CV, VC) syllables.

• Triphone syllables (CVC, CCV, VCC) chosen according to frequency in Switchboard corpus– correlated with syllable frequency in casual conversation.

• 12 male speakers, 8 female speakers.

• Approximately 2000 syllables common to all speakers.

• Small amount (10 min) of conversational data.

• All speech data sampled at 16 kHz.

Page 7: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 7

AI corpus examples

• Target syllable preceded by two context words and followed by one context word:– CW1 CW2 SYL CW3

– CW1, CW2 and CW3 drawn from sets of 8, 51 and 44 words respectively

• Examples:

they recognise sir entirely

people ponder stir second

Page 8: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 8

Phone recogniser

• Monophone recogniser implemented and trained on the TIMIT corpus.

• Based on HTK scripts by Tony Robinson1.

• Front-end: speech encoded as 12 cepstral coefficients +energy+deltas+accelerations (39 features).

• Cepstral mean normalisation applied.

• 3 emitting states per phone model, observations modelled by 20 Gaussian mixtures per state.

• Approx 58% phone accuracy on TIMIT test set.1http://www.cantabResearch.com/HTKtimit.html

Page 9: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 9

Training and testing

• Trained on TIMIT training set.

• Really needs adapting to the AI corpus material; work in progress.

• Removed allophones from TIMIT labels (as is usual) to give 41 phone set.

• Short pause and silence models.

• For testing on AI corpus, word-level transcriptions were expanded into phone sequences using Switchboard-ICSI pronunciation dictionary.

Page 10: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 10

Experiments

• Initial experiments done with a subset of AI corpus utterances in which the target syllable is “sir” or “stir”.

• Small test set of 40 utterances:

Male speaker Female speaker

“Sir” 12 8

“Stir” 12 8

Page 11: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 11

Experiment 1: Fletcher-style paradigm• A recogniser grammar was used in which

– The sets of context words CW1, CW2 and CW3 are specified;

– Target syllable is any sequence of two or three phones.

• Corresponds to task in which listener knows that context words are drawn from a limited set.

• Recogniser grammar is a (rather unconventional) mix of word-level and phone-level labels.

Page 12: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 12

Experiment 1: recogniser grammar$cw1 = I | YOU | WE | THEY | SOMEONE | NO-ONE | EVERYONE | PEOPLE;

$cw2 = SEE | SAW | HEAR | PERCEIVE | THINK | SAY | SAID | SPEAK | PRONOUNCE | WRITE | RECORD | OBSERVE | TRY | UNDERSTAND | ATTEMPT | REPEAT | DESCRIBE | DETECT | DETERMINE | DISTINGUISH | ECHO | EVOKE | PRODUCE | ELICIT | PROMPT | SUGGEST | UTTER | IMAGINE | PONDER | CHECK | MONITOR | RECALL | REMEMBER | RECOGNIZE | REPEAT | REPORT | USE | UTILIZE | REVIEW | SENSE | SHOW | NOTE | NOTICE | SPELL | READ | EXAMINE | STUDY | PROPOSE | WATCH | VIEW | WITNESS;

$cw3 = NOW | AGAIN | OFTEN | TODAY | WELL | CLEARLY | ENTIRELY | NICELY | PRECISELY | ANYWAY | DAILY | WEEKLY | YEARLY | HOURLY | MONTHLY | ALWAYS | EASILY | SOMETIME | TWICE | MORE | EVENLY | FLUENTLY | GLADLY | HAPPILY | NEATLY | NIGHTLY | ONLY | PROPERLY | FIRST | SECOND | THIRD | FOURTH | FIFTH | SIXTH | SEVENTH | EIGHTH | NINTH | TENTH | STEADILY | SURELY | TYPICALLY | USUALLY | WISELY;

$phn = AA | AE | AH | AO | AW | AX | AY | B | CH | D | DH | DX | EH | ER | EY | F | G | HH | IH | IY | JH | K | L | M | N | NG | OW | OY | P | R | S | SH | T | TH | UH | UW | V | W | Y | Z | ZH;

(!ENTER $cw1 $cw2 $phn $phn [$phn] $cw3 !EXIT)

Page 13: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 13

Experiment 1: results

• Overall 47.5% correct at word level (sir/stir)

• Context words not correctly recognised in some cases, leading to knock-on effect on recognition of the target syllable.

• Examples:

they imagine stir surelythey imagine s t er surely

Correct

they sense stir gladly they sense s er gladly Deletion

I evoke sir precisely I evoke s eh preciselySubstitution (/eh/ as in ‘head’)

they recognize sir entirely

they witness er n p daily

Incorrect context words

Page 14: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 14

Experiment 2: constrained sir/stir• A recogniser grammar was used in which

– The sets of context words CW1, CW2 and CW3 are specified;

– Target syllable is constrained to “sir” or “stir”;

– Canonical pronunciation of “sir” and “stir” is assumed (i.e. “sir” = /s er/ and “stir” = /s t er/)

• Corresponds to Watkins-style task, except that context words vary and are drawn from a limited set.

• Utterances either presented clean or convolved with the left channel or right channel of the L-shaped room or corridor BRIRs.

Page 15: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 15

Experiment 2: recogniser grammar• Recogniser grammar was

$test = SIR | STIR;

( !ENTER $cw1 $cw2 $test $cw3 !EXIT )

with $cw1, $cw2 and $cw3 defined as before.

Page 16: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 16

Results: L-shaped room, left channel

Impulse response ID % Correct SIR STIR

clean.wav 95.0SIR 18 2

STIR 0 20

outconv22feb31p5.wav 92.5SIR 18 2

STIR 1 19

outconv22feb63.wav 85.0SIR 18 2

STIR 4 16

outconv22feb125.wav 72.5SIR 13 7

STIR 4 16

outconv22feb250.wav 67.5SIR 9 11

STIR 2 18

outconv22feb500.wav 62.5SIR 15 5

STIR 10 10

outconv22feb1000.wav 65.0SIR 14 6

STIR 8 12

Page 17: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 17

Results: L-shaped room, right channel

Impulse response ID % Correct SIR STIR

clean.wav 95.0SIR 18 2

STIR 0 20

outconv22feb31p5.wav 87.5SIR 17 3

STIR 2 18

outconv22feb63.wav 85.0SIR 19 1

STIR 5 15

outconv22feb125.wav 87.5SIR 15 5

STIR 1 19

outconv22feb250.wav 82.5SIR 16 4

STIR 3 17

outconv22feb500.wav 67.5SIR 16 4

STIR 9 11

outconv22feb1000.wav 65.0SIR 14 6

STIR 8 12

Page 18: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 18

Results: corridor, left channel

Impulse response ID % Correct SIR STIR

clean.wav 95.0SIR 18 2

STIR 0 20

outconv22feb31p5.wav 90.0SIR 18 2

STIR 2 18

outconv22feb63.wav 87.5SIR 19 1

STIR 4 16

outconv22feb125.wav 77.5SIR 15 5

STIR 4 16

outconv22feb250.wav 72.5SIR 17 3

STIR 8 12

outconv22feb500.wav 67.5SIR 17 3

STIR 10 10

outconv22feb1000.wav 57.5SIR 14 6

STIR 11 9

Page 19: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 19

Results: corridor, right channel

Impulse response ID % Correct SIR STIR

clean.wav 95.0SIR 18 2

STIR 0 20

outconv22feb31p5.wav 90.0SIR 18 2

STIR 2 18

outconv22feb63.wav 87.5SIR 18 2

STIR 3 17

outconv22feb125.wav 87.5SIR 18 2

STIR 3 17

outconv22feb250.wav 85.0SIR 16 4

STIR 2 18

outconv22feb500.wav 82.5SIR 19 1

STIR 6 14

outconv22feb1000.wav 60.0SIR 15 5

STIR 11 9

Page 20: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 20

Conclusions

• The phone recogniser works well when constrained to recogniser “sir”/”stir” only (95% correct).

• Recognition rate falls as reverberation increases, as expected.

• The fall in performance is not only due to “stir” being reported as “sir”, as expected from human studies.

• Some effects of BRIR channel on performance. Right channel of the corridor BRIR is less problematic, most likely due to a strong early reflection in the right channel for the 5m condition.

Page 21: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 21

Plans for next period: experiments• The AI corpus lends itself to experiments in

which target and context are varied as in Watkins et al. experiments.

• Suggestion:– Compare listener and ASR phone confusions under conditions

in which the whole utterance is reverberated, and when reverberation is added to the target syllable only.

• Possible problems:– Relatively insensitive design? Will effect of reverberation be

sufficient to show up as consistent phone confusions? – Are the contexts long enough? (some contexts as short as 0.5

s).– As shown in baseline studies, recogniser does not necessarily

make the same mistakes as human listeners.

Page 22: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22

AI corpus sir/stir stimuli

• Utterances similar to sir/stir format– Wider variety of speakers/contexts (but still limited vocabulary)

– Targets mostly nonsense, but some real words (eg. sir/stir)

– Reverberated (by Amy) according to sir-stir paradigm

• Widening sir/stir paradigm towards ASR environment– Introduce different stop consonants first: s {t,p,k} ir

– Look for confusion in place of articulation

near-near near-far far-far

near-near near-far far-far

Page 23: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23

Test words from AI corpus

We could record our own: sigh, sty, spy, sky (sky is missing)

Page 24: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 24

Questions for Tony

• Generally - would this sort of thing work?

• Is the initial delay in BRIR kept?

• How should the AI corpus signals be level-normalised when mixed reveberation distance is used?

• How to control the ordering of stimuli?

Page 25: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 25

Plans: system development

• Currently the ASR system is trained on TIMIT; expect improvement if adapted to the AI corpus material.

• Only have word-level transcription for the AI corpus so must obtain phone labels by forced alignment.

• We will try the efferent model as a front end for recognition of reverberated speech, however:– it may not be sufficiently general, having been developed/tuned

only for the sir/stir task– that said, we have shown elsewhere that efferent suppression is

effective in improving ASR performance in additive noise– there is some relationship between the efferent model and

successful engineering approaches

Page 26: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 26

Plans: system development

• Current efferent model is not unrelated to engineering approach of Thomas et al. (2008):– “the effect of reverberation is reduced when features are extracted

from gain normalized temporal envelopes of long duration in narrow subbands”

• Our efferent model also does gain control over long-duration windows (and will work in narrow bands).

• The model currently produces a spectral representation but could be modified to give cesptral features for ASR.

Page 27: Automatic Speech Recognition Studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 27

Plans: other approaches

• Parallel search over room acoustics and word models?– How would context effects be included in such a scheme?

– On-line selection of word models trained in dry or reverberant conditions, according to context characteristics?

• Recognition within individual bands (i.e. train recogniser for each band and combine posterior probabilities)– May allow modelling of Watkins et al. 8-band results

– Performance of multiband systems generally lower than conventional ASR

Page 28: Automatic Speech Recognition Studies

Lunch

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 28


Recommended