Download - ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS · ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT TEXT TRANSCRIPTIONS Alexander Haubold and John R. Kender Department of Computer

ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT

TEXT TRANSCRIPTIONSAlexander Haubold and John R. Kender

Department of Computer Science, Columbia University

Co

lum

bia

Un

ive

rsity

De

pa

rtme

nt o

f Co

mp

ute

r Scie

nce

Problem

Approach

Results Application

Unsophisticated Manual Transcription

... prepared slides please. Ok, alright you'll

have to bear with me until I get a little more

used to this medium. I haven't taught in a 2

hour format for a while and generally speaking

when I do teach I use the blackboard. They've

requested because it's easier to make

archiveable summaries of things if I write

things on slides, because then they'll be

digitized and placed online. What this means is

that if you miss a lecture you don't even have

to go to the engineering library ...

Near Perfect Transcript

Inexpensive Automatic Transcription

... prepared slides plea and a and nonblack to

go with the unsettled of more use this medium

haven't gotten a two-out, four while and

generally speaking when they did teach a if

blackboard they've requested because it's

easier to make hot titles online of flames that

by lightning funds lies the this and that will

be digitized and placed on long a look this

means is that in this lecture you don't even

after go to the engineering library ...

Imperfect Transcripts

… prepared slides plea and a and …

? ? ? ? ? ?

• Missing Temporal Alignment

• Words or higher-level structures not time-stamped

• Linear Fit (Speech Signal to Text) unsuitable

• Does not consider pauses in Speech

• Does not adjust to various speeds of user speech

Need temporal alignment to index text

from speech.

Multimedia Browser for Student Presentation Videos:

• Database: 5 years, >180 videos, >160 hrs, >1500 students

• Used for archival and reference by students and instructors

• ASR transcripts aligned for more accurate retrieval

• Filtered transcript text in yellow boxes

• More salient phrases highlighted in red

• Temporal occurrence preserved along horizontal timeline

• Text search results are highlighted in separate yellow box

1000 2000 3000 4000 5000 6000-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)

Length of audio (seconds)

(A) Lecture, single speaker• t = 1:48:21

• Manual transcription

• Avg. Matching Error = 3.9 sec

1000 2000 3000 4000 5000 6000-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)


(B) Lecture, single speaker• t = 1:48:21

• Automatic transcription


(C) Student Presentation, 31 speakers• t = 1:15:12



0 500 1000 1500 2000 2500 3000 3500 4000 4500-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)


(D) Student Presentation, 10 speakers• t = 0:22:32



200 400 600 800 1000 1200-100

-80

-60

-40

-20

0

20

40

60

80

100

Ali

gn

men

t er

ror

(sec

on

ds)


Correct speech alignment for 1-100 second accuracy. More than 75% of correct alignment occurs within a 20 second error margin

0 20 40 60 80 10010

20

30

40

50

60

70

80

90

100

Matching accuracy (se conds)

%o

fsp

ee

cha

lig

ne

d

(A)(B)

(C)

(D)

Alignment results for 4 videos:

• Correct alignment within small error margins

• Longer pauses introduce

larger errors (gray bars)

• Smaller error margin for manual transcript

Observation:

• Vowels and fricatives are the most accurate among automatic transcriptions

• Alignment of Speech and Text on easily detectable phonemes

Use Edit Distance dynamic programming algorithm to align long sequences of phonemes(60 min ~ 15,000 text phonemes,

~45,000 speech phonemes *)

Monophthongs: IY (beet), IH (bit), EH (bet), AE bat,

AH (above), UW (boot), UH (book),

AA (father), ER (bird), AO (bought)

Fricatives: SH (assure), S (sign)

Diphthongs: AW (out)→ AH, AY (five)→ AH, EY (day)

→ AE, OW (crow)→ UH, OY (boy)→ AO

Fricatives: Z (resign)→ S

Affricates: CH (church)→ SH

*Phoneme and Phoneme Substitution Table

Audio

Transcript

Text Phonemes Speech Phonemes

Filtered T.P.

Phoneme Detection:Vowels (Auto Regressive Model)Fricatives, Affricates (Spectro-

gram Energy Distribution)

Filtered S.P.

keep subset * merge sequences

Edit

Dist.

Text-Speech Alignment

with smallest Edit Distance

CMU Pronouncing

Dictionary:>125,000 words withphonemes