ALIGNMENT OF SPEECH TO HIGHLY IMPERFECT
TEXT TRANSCRIPTIONSAlexander Haubold and John R. Kender
Department of Computer Science, Columbia University
Co
lum
bia
Un
ive
rsity
De
pa
rtme
nt o
f Co
mp
ute
r Scie
nce
Problem
Approach
Results Application
Unsophisticated Manual Transcription
... prepared slides please. Ok, alright you'll
have to bear with me until I get a little more
used to this medium. I haven't taught in a 2
hour format for a while and generally speaking
when I do teach I use the blackboard. They've
requested because it's easier to make
archiveable summaries of things if I write
things on slides, because then they'll be
digitized and placed online. What this means is
that if you miss a lecture you don't even have
to go to the engineering library ...
Near Perfect Transcript
Inexpensive Automatic Transcription
... prepared slides plea and a and nonblack to
go with the unsettled of more use this medium
haven't gotten a two-out, four while and
generally speaking when they did teach a if
blackboard they've requested because it's
easier to make hot titles online of flames that
by lightning funds lies the this and that will
be digitized and placed on long a look this
means is that in this lecture you don't even
after go to the engineering library ...
Imperfect Transcripts
… prepared slides plea and a and …
? ? ? ? ? ?
• Missing Temporal Alignment
• Words or higher-level structures not time-stamped
• Linear Fit (Speech Signal to Text) unsuitable
• Does not consider pauses in Speech
• Does not adjust to various speeds of user speech
Need temporal alignment to index text
from speech.
Multimedia Browser for Student Presentation Videos:
• Database: 5 years, >180 videos, >160 hrs, >1500 students
• Used for archival and reference by students and instructors
• ASR transcripts aligned for more accurate retrieval
• Filtered transcript text in yellow boxes
• More salient phrases highlighted in red
• Temporal occurrence preserved along horizontal timeline
• Text search results are highlighted in separate yellow box
1000 2000 3000 4000 5000 6000-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
(A) Lecture, single speaker• t = 1:48:21
• Manual transcription
• Avg. Matching Error = 3.9 sec
1000 2000 3000 4000 5000 6000-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
(B) Lecture, single speaker• t = 1:48:21
• Automatic transcription
• Avg. Matching Error = 7.7 sec
(C) Student Presentation, 31 speakers• t = 1:15:12
• Automatic transcription
• Avg. Matching Error = 6.4 sec
0 500 1000 1500 2000 2500 3000 3500 4000 4500-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
(D) Student Presentation, 10 speakers• t = 0:22:32
• Automatic transcription
• Avg. Matching Error = 26.7 sec
200 400 600 800 1000 1200-100
-80
-60
-40
-20
0
20
40
60
80
100
Ali
gn
men
t er
ror
(sec
on
ds)
Length of audio (seconds)
Correct speech alignment for 1-100 second accuracy. More than 75% of correct alignment occurs within a 20 second error margin
0 20 40 60 80 10010
20
30
40
50
60
70
80
90
100
Matching accuracy (se conds)
%o
fsp
ee
cha
lig
ne
d
(A)(B)
(C)
(D)
Alignment results for 4 videos:
• Correct alignment within small error margins
• Longer pauses introduce
larger errors (gray bars)
• Smaller error margin for manual transcript
Observation:
• Vowels and fricatives are the most accurate among automatic transcriptions
• Alignment of Speech and Text on easily detectable phonemes
Use Edit Distance dynamic programming algorithm to align long sequences of phonemes(60 min ~ 15,000 text phonemes,
~45,000 speech phonemes *)
Monophthongs: IY (beet), IH (bit), EH (bet), AE bat,
AH (above), UW (boot), UH (book),
AA (father), ER (bird), AO (bought)
Fricatives: SH (assure), S (sign)
Diphthongs: AW (out)→ AH, AY (five)→ AH, EY (day)
→ AE, OW (crow)→ UH, OY (boy)→ AO
Fricatives: Z (resign)→ S
Affricates: CH (church)→ SH
*Phoneme and Phoneme Substitution Table
Audio
Transcript
Text Phonemes Speech Phonemes
Filtered T.P.
Phoneme Detection:Vowels (Auto Regressive Model)Fricatives, Affricates (Spectro-
gram Energy Distribution)
Filtered S.P.
keep subset * merge sequences
Edit
Dist.
Text-Speech Alignment
with smallest Edit Distance
CMU Pronouncing
Dictionary:>125,000 words withphonemes