+ All Categories
Home > Documents > Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...

Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...

Date post: 01-Apr-2015
Category:
Upload: kaleb-bedell
View: 215 times
Download: 3 times
Share this document with a friend
Popular Tags:
51
Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory http://www.phon.ox.ac.uk/SpokenBNC
Transcript
Page 1: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Linking transcriptions to spoken audio

John Coleman and Sergio Grau

Oxford University Phonetics Laboratoryhttp://www.phon.ox.ac.uk/SpokenBNC

Page 2: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Many thanks to• Lou Burnard (re XML)• Jiahong Yuan, UPenn (for P2FA aligner)• Dave de Roure & Kevin Page (for discussions re linked data)•John Pybus & Amir Nettler (for experiments with streamed audio fragments)• for £££

Page 3: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Outline of our talk:

• Large audio corpora and their challenges

• Mining a Year of Speech

• Random access to audio snippets

Page 4: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Multimedia dominates the internet

• 2005: YouTube launched

• 2008: YouTube surpasses Yahoo as world’s No. 2 search engine

• 2011: video/audio dominates peak-time bandwidth in North America

Page 5: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Some browsable audio corpora • www.oyez.org

(US Supreme Court recordings)• whitehousetapes.net

(1940-1973)• www.scottishcorpus.ac.uk

(Scottish Corpus of Texts and Speech)• http://sounds.bl.uk/

(British Library Archival Sound Recordings)

Page 6: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .
Page 7: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Challenges of very large audio collections of spoken language

How does a researcher find audio segments of interest?

How do audio corpus providers mark them up to facilitate searching and browsing?

How to make very large scale audio collections accessible?

Page 8: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Server-side challenges

Amount of material

Storage– CD quality audio: 635 MB/hour– Uncompressed .wav files: 115 MB/hour– 1.02 TB/year– Library/archive .wav files: 1 GB/hr, 9 TB/yr

1 TB (1000 GB) hard drive: c. £65 Now £39.95!

Spoken audio = 250 times XML

---

Page 9: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Server-side challenges

Audio format issues

– Uncompressed .wav files: 115 MB/hour– Temptation to use compressed formats– For speech analysis, low bitrate

compression (40 kbs) is pretty disastrous– Spectral centre-of-gravity measures are

unreliable even at higher compression rates, but pitch and formant estimation is OK

van Son (2005) Acta Acustica with Acustica 91: 771-778

Page 10: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Challenges• Amount of material

• Computing – distance measures, etc.– alignment of labels– searching and browsing– Just reading or copying 9 TB takes >1 day– Download time: days or weeks

Page 11: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

How large?Some biggish transcribed corpora:

• Switchboard corpus: 13 days (included in MYS)

• Spoken Dutch: 1 month, only a fraction transcribed

• Spoken Spanish: 110 hours• OSU Buckeye Corpus: 2 days• Wellington Corpus, NZ: 3 days

• Mining a Year of Speech: 218 days so far, on track towards 3.6 years (>1200 days)

Page 12: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

The “Year of Speech”A grove of corpora, held at various sites with a common indexing scheme and search tools:

US English: 2,240 hours of telephone conversations

• 1,255 hours of broadcast news• Talk show conversations (1,000 hrs),

Supreme Court oral arguments (5,000 hrs), political speeches and debates

British English: Spoken audio part of the British National Corpus• >7.4 million words of transcribed speech• 1,400 hours• Digitized by collaboration with British

Library

Page 13: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Analogue audio in librariesBritish Library: >1m disks and tapes, 5%

digitizedLibrary of Congress Recorded Sound

Reference Center: >2m items, including …International Storytelling Foundation:

>8000 hrs of audio and videoEuropean broadcast archives: >20m hrs

(2,283 years) cf. Large Hadron Collider

74% on ¼” tape19% shellac and vinyl7% digital

Page 14: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Analogue audio in librariesWorld wide: ~100m hours (11,415 yrs)

analoguei.e. 4-5 Large Hadron

Colliders!

Cost of professional digitization and cataloguing: ~£20/$32 per tape (e.g. C-90 cassette)

Using speech recognition and natural language technologies (e.g. summarization) could provide more detailed cataloguing/indexing without time-consuming human listening

Page 15: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Why so large? Lopsided sparsity I Top ten words each occurYou 58,000 timesitthe 'sand n'taThat 12,400 words (23%) onlyYeah occur once

Page 16: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Why so large? Lopsided sparsity

Page 17: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

A rule of thumb

To catch most• English sounds, you need minutes of audio• common words of English … a few hours• a typical person's vocabulary … >100 hrs

• pairs of common words … >1000 hrs• arbitrary word-pairs … >100 years

Page 18: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Main problem in large corporaFinding needles in the haystack

To address that challenge, we think there are two “killer apps”

Forced alignment Data linking, or at least open exposure of

digital material, coupled with cross-searching

Page 19: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Practicalities

• In order to be of much practical use, such very large corpora must be indexed at word and segment level

• All included speech corpora must therefore have associated text transcriptions

• We’re using P2FA, the Penn Phonetics Laboratory Forced Aligner, to associate each word and segment with the corresponding start and end points in the sound files

Page 20: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Mining (indexing by forced alignment)

x 21 million

Page 21: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Mining (indexing by forced alignment)

Page 22: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Mining (a needle in a haystack)

Page 23: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Mining (a diamond in the rough)

Page 24: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Challenges for alignments

Problems with documentation and records

• Transcription errors• Long untranscribed portions• Some transcribed regions with no audio

(lost in copying)

Page 25: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Challenges for alignments

Broadcast recordings may include untranscribed commercials

Transcripts generally edit out dysfluenciesPolitical speeches may extemporize,

departing from the published script

Page 26: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Challenges for alignments

• Overlapping speakers• Background noise/music/babble• Variable signal loudness• Reverberation• Distortion• Poor speaker vocal health/voice quality• Unexpected accents: need multidialect

pronouncing dictionary

Page 27: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Issues we’re still grappling with

• No standards for adding phonemic transcriptions and timing information to XML transcriptions

• Many different possible schemes

• How to decide?

Page 28: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Enabling other corpora to be brought in in futurePromoting common standards for audio

with linked transcription

?<w c5="AV0" hw="well" pos="ADV" >Well </w>

Page 29: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Automatic Speech-to-Phoneme alignment

Page 30: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Aligner output to extended XML

• HTK example:

• HTK output+ XML -> extended XML• How to represent the obtained time

information within the existing TEI-XML structure?

0.56250.6125"IH1"0.61250.8225"T”

0.56250.8225"IT”

Page 31: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Integrating alignment information in the TEI-XML structure• Time information

• Word level• Phoneme level

• Phonemic representation of each word

• Timeline

Page 32: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: EXMARaLDA

EXMARaLDA: “Extensible Markup Language for Discourse Annotation” http://www.exmaralda.org/

<common-timeline><tli id="T0" time="0.0"/> <tli id="T1" time="1.309974117691172"/> <tli id="T2" time="1.899962460773455"/> <tli id="T3" time="2.3399537674788866"/> ....<tier id="TIE0" speaker="SPK0" category="v" type="t"

display-name="PRE [v]"> <event start="T2" end="T3">Good evening. </event> <event start="T5" end="T6">I have with me tonight

Ann Elk Mistress Ann Elk. </event>

Page 33: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: Voices of the Holocaust

http://voices.iit.edu/xml/voth_project_tei_example.xml <div corresp="#transcription_id"> <!-- begin Spool XXX --> <div xml:lang="en"> <u who="#interviewer_id" start="1.631">This is the

first utterance of the interviewer.</u> <u who="#interviewee_id" start="2.465">This is the

first utterance of the interviewee.</u> </div>

Page 34: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: IFA Dialog Video corpus, Phonetic Sciences, University of Amsterdam

van Son, R., Wesseling, W., Sanders, E., and van den Heuvel, H., The IFADV corpus: A free dialog video corpus, LREC’08, Marrakech, 2008

<TIME_ORDER> <TIME_SLOT TIME_SLOT_ID="ts1" TIME_VALUE="0"/> <TIME_SLOT TIME_SLOT_ID="ts2" TIME_VALUE="10"/> <TIME_SLOT TIME_SLOT_ID="ts3" TIME_VALUE="462"/> <TIME_SLOT TIME_SLOT_ID="ts4" TIME_VALUE="840"/> ... <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a1"

TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts7"> <ANNOTATION_VALUE>beginnen we weer

opnieuw?</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION>

Page 35: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: Labb-Cat (ONZE Miner)

http://onzeminer.sourceforge.net

Transcriber or Praat representation

Page 36: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: Transcriber

http://trans.sourceforge.net

<Turn speaker="spk2" startTime="0.557" endTime="5.851"> <Sync time="0.557"/> so what do you know of your family ’s <Sync time="2.255"/> history like <Sync time="3.410"/> do you know when and why they came to Oxford

</Turn>

Page 37: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: COLT Corpus

http://www.hd.uib.no/colt/

– Sentence Level <u who=5 id=1 time=0.112> But I must see Mr <name> [smile again.] <u who=1 id=2 time=2.016> [<unclear> spoiled again?] ...

– Word level <u who=5 id=1 time=0.112><Audio word=BUT time=0.112 durn=0.176>But</Audio> <Audio word=I time=0.288 durn=0.064>I</Audio> <Audio word=MUST time=0.352 durn=0.304>must</Audio> <Audio word=SEE time=0.816 durn=0.352>see</Audio> <Audio word=MR time=1.168 durn=0.160>Mr</Audio> ...

Page 38: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: Summary

• Mostly sentence/word level time information representation

• No phoneme analysis

• No phoneme time information • Timeline representation

• TEI standard?

Page 39: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Other representations: Summary

• Mostly sentence/word level time information representation

• No phoneme analysis

• No phoneme time information • Timeline representation

• TEI standard?

• Extended TEI-XML with time and phoneme information

Page 40: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

<u who="D94PSUNK"> <s n="3">  <w c5="VVD" hw="want" pos="VERB">Wanted </w>  <w c5="PNP" hw="i" pos="PRON">me </w>  <w c5="TO0" hw="to" pos="PREP">to</w>  <c c5="PUN">.</c> </s><!-- ... --></u>

Page 41: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

<u who="D94PSUNK"> <s n="3">  <w    ana="#D94:0083:11"    c5="VVD"    hw="want"    pos="VERB">Wanted </w>  <w    ana="#D94:0083:12"    c5="PNP"    hw="i"    pos="PRON">me </w>  <w    ana="#D94:0083:13"    c5="TO0"    hw="to"    pos="PREP">to</w>  <c c5="PUN">.</c>

Page 42: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

  <fs xml:id="D94:0083:11">   <f name="orth">wanted</f>   <f name="phon_ana">    <vcoll type="lst">     <symbol synch="#D94:0083:11:0" value="W"/>     <symbol synch="#D94:0083:11:1" value="AO1"/>     <symbol synch="#D94:0083:11:2" value="N"/>     <symbol synch="#D94:0083:11:3" value="AH0"/>     <symbol synch="#D94:0083:11:4" value="D"/>    </vcoll>   </f>  </fs>

Page 43: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

<timeline origin="0" unit="s" xml:id="TL0"> ... <when xml:id="#D94:0083:11:0" from="1.6925" to="1.8225"/> <when xml:id="#D94:0083:11:1" from="1.8225" to="1.9225"/> <when xml:id="#D94:0083:11:2" from="1.9225" to="2.1125"/> <when xml:id="#D94:0083:11:3" from="2.1125" to="2.1825"/> <when xml:id="#D94:0083:11:4" from="2.1825" to="2.3125"/> ...</timeline>

Page 44: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Q. When you have an indexing scheme and a big database, what do you want to do with it?

A. Random access to audio snippets

Page 45: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Random access to audio snippets

• Timing of fragments in URL

• e.g. Gaudi (Google Labs) everyzing.com (ramp.com)

• http://audio.weei.com/search?q=something• http://audio.weei.com/a/42828235/red-sox-p

regame-show.htm#q=something&seek=311.989

Page 46: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .
Page 47: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .
Page 48: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Random access to audio snippets• Audio objects in HTML5 (in the browser)e.g. http://www.phon.ox.ac.uk/jcoleman/useful_test.html

• W3C media fragments protocole.g. http://www.w3.org/2008/WebVideo/Fragments/Demo:

http://ninsuna.elis.ugent.be/MediaFragmentsPlayer

Page 49: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

URN’s for audio snippets

• Linked data/semantic web approach:refer to each specific word, phoneme etc as a specific audio object, not just a time range inside an audio file

• Challenge: need for an ontology for sounds and sound timelines in audio recordings

• Some progress in music ontologies

Page 50: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Conclusion• Sound and multimedia corpora/collections

are getting very big• In fact multimedia, not text, dominates the

internet• So, we need some standard ways for

representing audio structure and accessing its parts

• Forced alignment allows us to map transcriptions to audio, reasonably accurately

• For searching, there are several “demonstration” possibilities, but this is still work in progress

Page 51: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .

Thank you very much!


Recommended