+ All Categories
Home > Documents > Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Date post: 29-Dec-2015
Category:
Upload: cornelius-stevenson
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
65
Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik
Transcript
Page 1: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Audio Retrieval

LBSC 708A

Session 11, November 20, 2001

Philip Resnik

Page 2: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Agenda

• Questions

• Group thinking session

• Speech retrieval

• Music retrieval

Page 3: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Shoah Foundation Collection

• 52,000 interviews– 116,000 hours (13 years)– 32 languages

• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million

Page 4: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Audio Retrieval

• We have already discussed three approaches– Controlled vocabulary indexing– Ranked retrieval based on associated captions– Social filtering based on other users’ ratings

• Today’s focus is on content-based retrieval– Analogue of content-based text retrieval

Page 5: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Audio Retrieval

• Retrospective retrieval applications– Search music and nonprint media collections– Electronic finding aids for sound archives– Index audio files on the web

• Information filtering applications– Alerting service for a news bureau– Answering machine detection for telemarketing– Autotuner for a car radio

Page 6: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

The Size of the Problem• 30,000 hours in the Maryland Libraries

– Unique collections with limited physical access

• 116,000 hours in the Shoah collection

• Millions of hours of streaming audio each year– Becoming available worldwide on the web

• Broadcast news (audio/video)– Ex. Television archive

Page 7: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 8: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

HotBot Audio Search Results

Page 9: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Audio Genres

• Speech-centered– Radio programs– Telephone conversations– Recorded meetings

• Music-centered– Instrumental, vocal

• Other sources– Alarms, instrumentation, surveillance, …

Page 10: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Detectable Speech Features

• Content – Phonemes, one-best word recognition, n-best

• Identity – Speaker identification, speaker segmentation

• Language– Language, dialect, accent

• Other measurable parameters– Time, duration, channel, environment

Page 11: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

How Speech Recognition Works

• Three stages– What sounds were made?

• Convert from waveform to subword units (phonemes)

– How could the sounds be grouped into words?• Identify the most probable word segmentation points

– Which of the possible words were spoken?• Based on likelihood of possible multiword sequences

• All three stages are learned from training data– Using hill climbing (a “Hidden Markov Model”)

Page 12: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Using Speech Recognition

PhoneDetection

WordConstruction

WordSelection

Phonen-grams

Phonelattice

Words

Transcriptiondictionary

Languagemodel

One-besttranscript

Wordlattice

Page 13: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

• Segment broadcasts into 20 second chunks• Index phoneme n-grams

– Overlapping one-best phoneme sequences– Trained using native German speakers

• Form phoneme trigrams from typed queries– Rule-based system for “open” vocabulary

• Vector space trigram matching– Identify ranked segments by time

ETHZ Broadcast News Retrieval

Page 14: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Phoneme Trigrams

• Manage -> m ae n ih jh– Dictionaries provide accurate transcriptions

• But valid only for a single accent and dialect

– Rule-base transcription handles unknown words

• Index every overlapping 3-phoneme sequence– m ae n– ae n ih– n ih jh

Page 15: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

ETHZ Broadcast News Retrieval

Page 16: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Cambridge Video Mail Retrieval• Added personal audio (and video) to email

– But subject lines still typed on a keyboard

• Indexed most probable phoneme sequences

Page 17: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Cambridge Video Mail Retrieval

• Translate queries to phonemes with dictionary– Skip stopwords and words with 3 phonemes

• Find no-overlap matches in the lattice– Queries take about 4 seconds per hour of material

• Vector space exact word match– No morphological variations checked– Normalize using most probable phoneme sequence

• Select from a ranked list of subject lines

Page 18: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 19: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 20: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Contrast of Approaches• Rule-based transcription

– Potentially errorful– Broad coverage, handles unknown words

• Dictionary-based transcription– Good for smaller settings– Accurate

• Both susceptible to the problem of variability

Page 21: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

BBN Radio News Retrieval

Page 22: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Comparison with Text Retrieval

• Detection is harder– Speech recognition errors

• Selection is harder– Date and time are not very informative

• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks

Page 23: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Speaker Identification

• Gender– Classify speakers as male or female

• Identity– Detect speech samples from same speaker– To assign a name, need a known training sample

• Speaker segmentation– Identify speaker changes– Count number of speakers

Page 24: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

A Richer View of Speech

• Speaker identification– Known speaker and “more like this” searches– Gender detection for search and browsing

• Topic segmentation via vocabulary shift– More natural breakpoints for browsing

• Speaker segmentation– Visualize turn-taking behavior for browsing– Classify turn-taking patterns for searching

Page 25: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Other Possibly Useful Features

• Channel characteristics– Cell phone, landline, studio mike, ...

• Accent– Another way of grouping speakers

• Prosody– Detecting emphasis could help search or browsing

• Non-speech audio– Background sounds, audio cues

Page 26: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Competing Demands on the Interface

• Query must result in a manageable set– But users prefer simple query interfaces

• Selection interface must show several segments– Representations must be compact, but informative

• Rapid examination should be possible– But complete access to the recordings is desirable

Page 27: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Iterative Prototyping Strategy

• Select a user group and a collection• Observe information seeking behaviors

– To identify effective search strategies

• Refine the interface– To support effective search strategies

• Integrate needed speech technologies• Evaluate the improvements with user studies

– And observe changes to effective search strategies

Page 28: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

The VoiceGraph Project

• Exploring rich queries– Content-based, speaker-based, structure-based

• Multiple cues in the selection interface– Turn-taking, gender, query terms

• Flexible examination– Text transcript, audio skims

Page 29: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Depicting Turn Taking Behavior

• Time is depicted from left to right

• Speakers separated vertically within a depiction

• Depictions stacked vertically in rank order

• Actual recordings are more complex

1

2

3

4

Page 30: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 31: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 32: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Bootstrapping the Prototype

• Select a user population and a collection– Journalists and historians– Broadcast news from the 1960’s and 1970’s

• Mock up an interface– Pilot study to see if we’re on the right track

• Integrate “back end” speech processing– Recognition, identification, segmentation, ...

• Observe information seeking behaviors

Page 33: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

New Zealand Melody Index

• Index musical tunes as contour patterns– Rising, descending, and repeated pitch– Note duration as a measure of rhythm

• Users sing queries using words or la, da, …– Pitch tracking accommodates off-key queries

• Rank order using approximate string match– Insert, delete, substitute, consolidate, fragment

• Display title, sheet music, and audio

Page 34: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Contour Matching Example

• “Three Blind Mice” is indexed as:– *DDUDDUDRDUDRD

• * represents the first note

• D represents a descending pitch (U is ascending)

• R represents a repetition (detectable split, same pitch)

• My singing produces:– *DDUDDUDRRUDRR

• Approximate string match finds 2 substitutions

Page 35: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 36: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Muscle Fish Audio Retrieval

• Compute 4 acoustic features for each time slice– Pitch, amplitude, brightness, bandwidth

• Segment at major discontinuities– Find average, variance, and smoothness of segments

• Store pointers to segments in 13 sorted lists– Use a commercial database for proximity matching

• 4 features, 3 parameters for each, plus duration

– Then rank order using statistical classification

• Display file name and audio

Page 37: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Muscle Fish Audio Retrieval

Page 38: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Summary

• Limited audio indexing is practical now– Audio feature matching, answering machine detection

• Present interfaces focus on a single technology– Speech recognition, audio feature matching– Matching technology is outpacing interface design

Page 39: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

-

Page 40: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

October 1, 2001 LBSC 708R

Speech-Based Retrieval Systems

Douglas W. Oard

College of Library and Information Services

University of Maryland

Page 41: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

The Size of the Problem

• 30,000 hours in the Maryland Libraries– Unique collections with limited physical access

• Over 100,000 hours in the National Archives– With new material arriving at an increasing rate

• Millions of hours broadcast each year– Over 2,500 radio stations are now Webcasting!

Page 42: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Outline

• Retrieval strategies

• Some examples

• Comparing speech and text retrieval

• Speech-based retrieval interface design

Page 43: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Global Internet Audio

source: www.real.com, Mar 2001

10621438

English

OtherLanguages

Over 2500 Internet-accessible

Radio and TelevisionStations

Page 44: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Shoah Foundation Collection

• 52,000 interviews– 116,000 hours (13 years)– 32 languages

• Full description cataloging– 14,000 term thesaurus– 4,000 interviews for $8 million

Page 45: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Speech Retrieval Approaches

• Controlled vocabulary indexing

• Ranked retrieval based on associated text

Automatic feature-based indexing

• Social filtering based on other users’ ratings

Page 46: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 47: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 48: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

HotBot Audio Search Results

Page 49: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

ETH Zurich Radio News Retrieval

Page 50: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

BBN Radio News Retrieval

Page 51: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

AT&T Radio News Retrieval

Page 52: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

MIT “Speech Skimmer”

Page 53: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Cambridge Video Mail Retrieval

Page 54: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.
Page 55: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

CMU Television News Retrieval

Page 56: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Comparison with Text Retrieval

• Detection and ranking are harder– Because of speech recognition errors

• Selection is harder– Useful titles are sometimes hard to obtain– Date and time alone may not be informative

• Examination is harder– Browsing is harder in strictly linear media

Page 57: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

A Richer View of Speech

• Speaker identification– Known speakers– Gender labeling– “More like this” searches

• Topic segmentation– Find natural breakpoints for browsing

• Speaker segmentation– Extract turn-taking behavior

Page 58: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Visualizing Turn-Taking

Page 59: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Other Available Features

• Channel characteristics– Cell phone, landline, studio mike, ...

• Cultural factors– Language, accent, speaking rate

• Prosody– Emphasis detection

• Non-speech audio– Background sounds, audio cues

Page 60: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Competing Demands on the Interface

• Query must result in a manageable set– But users prefer simple query interfaces

• Selection interface must show several segments– Representations must be compact, but informative

• Rapid examination should be possible– But complete access to the recordings is desirable

Page 61: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

The VoiceGraph Project

• Exploring rich queries– Content-based, speaker-based, structure-based

• Multiple cues in the selection interface– Turn-taking, gender, query terms

• Flexible examination– Text transcript, audio skims

Page 62: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Pilot Study

• Student focus groups – 15 from Journalism, 3 from Library Science

• Preliminary drawing exercise

• Static screen shots and mock-ups

• Focused discussion

• User satisfaction questionnaire

• Structured interviews with domain experts– Journalism and Library Science faculty

Page 63: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Pilot Study Results

• Graphical speech representations appear viable– Expected to be useful for high level browsing

• When coupled with text transcripts and audio replay

– Some training will be needed

• Suggested improvements– Adjust result set spacing to facilitate rapid selection – Identify categories (monologue, conversation, …)

• Potentially useful for search or browsing

Page 64: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

For More Information

• Speech-based information retrieval– http://www.clis.umd.edu/dlrg/speech/

• The VoiceGraph project– http://www.clis.umd.edu/dlrg/voicegraph/

Page 65: Audio Retrieval LBSC 708A Session 11, November 20, 2001 Philip Resnik.

Comparison with Text Retrieval

• Detection is harder– Speech recognition errors

• Selection is harder– Date and time are not very informative

• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks


Recommended