+ All Categories
Home > Technology > Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Date post: 07-Jul-2015
Category:
Upload: xanguera
View: 196 times
Download: 0 times
Share this document with a friend
Description:
This is a 5 minutes presentation I was invited to give in a Daghstuhl seminar about low/zero resources processing. in November 2013
Popular Tags:
20
Multimedia analysis for the poor (in training resources) Xavier Anguera Telefonica research Daghstuhl Seminar 13451 - Inspirational talk
Transcript
Page 1: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Multimedia analysis for the poor(in training resources)

Xavier Anguera

Telefonica research

Daghstuhl Seminar 13451 - Inspirational talk

Page 2: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Does this affect me?

• You work in areas where there is not much training data available

– Maybe it exists in domains other than your test data.

• The task you are pursuing does not have a well annotated corpus for training

– E.g. finding structure in signals

• It is difficult / you do not know how to define training “units” in your task

• You like to work in complicated stuff

Page 3: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Typical Speech paper diagram

Labeled training data

My favorite ML technique

“I am a model”

My favorite decoding technique

Testing data My result

Page 4: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

…making it as complicated as you would like to

Page 5: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 6: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 7: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 8: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 9: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 10: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 11: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Resource-free technologies

• Summarization– Acoustic word cloud of most repeated acoustic items– Repetition-based summarization (MODIS software @

INRIA-Rennes)

• Structure analysis in music• Audio-visual unsupervised learning (e.g. the

Google cats)• Acquisition of unknown sounds (e.g. Tuomo’s

talk)• Exemplar-based ASR (Leuven Univ.)

Page 12: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

EXAMPLE: Spoken Audio Search (or Query-by-Example Spoken-Term Detection)

Given a single spoken query we search for instances at lexical level within spoken documents

It is similar to Spoken Term Detection (NIST STD2006, OpenKWS 2013) but…

Queries are spoken

Different speakers

Different acoustic conditions

No prior knowledge of the

language(s) might be available

Page 13: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Mediaeval SWS 2013• 9 languages in different acoustic contexts: 4 African

languages (isixhosa, isizulu, sepedi, setswana), Albanian, Basque, Czech, non-native English, Romanian

#utts time Avg. length/utt.

Search corpus 10762 19:57:55 6.67s

Dev Queries 505 0:11:26h 1.35s

Extended dev* 1046 0:08:42h 0.49s

Eval Queries 503 0:11:37h 1.38s

Extended eval* 1037 0:08:57h 0.51s

Total 13853 20:38:37h*Only Basque (3x) and Czech (10x) queries have extended versions

Page 14: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

5 10 20 40 60 80 90 95 98.0001.001

.004.01

.02.05

.1.2

.51

25

1020

40

Miss probability (in %)

False Alarm probability (in %

)

Primary system

s (evaluation)

Random Perform

anceGTTS (M

TWV=0.399, Thr=5.243)

L2F (MTW

V=0.342, Thr=3.551)CUHK (M

TWV=0.306, Thr=0.618)

BUT (MTW

V=0.297, Thr=0.914)CM

TECHETAL (MTW

V=0.257, Thr=18.153)IIITH (M

TWV=0.224, Thr=2.721)

ELIRF (MTW

V=0.159, Thr=2.759)TID (M

TWV=0.093, Thr=5.051)

GTC (MTW

V=0.084, Thr=3.341)SPEED (M

TWV=0.059, Thr=0.923)

LIA-Late (MTW

V=0.000, Thr=1079.003)UNIZA-Late (M

TWV=0.001, Thr=1.000)

TUKE-Late (MTW

V=0.000, Thr=3.000)

Mediaeval SWS 2013

Page 15: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

Mediaeval SWS 2013 (results per language)

Page 16: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 17: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

How do children learn?(from someone who is not a parent…)

1. They hear their environment and identify/isolate particular audio-visual stimuli they do not know

2. An expert (parent/grandparent) tells them the “meaning” of those stimuli.

– If the stimuli appears in different forms (or the child is not sharp) they will need to repeat it a couple of times…

3. The child learns and is able to identify this stimuli from then on.

Page 18: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

book

book

book

book

book

Machine earning Machine earning

“book” model “?” model

Page 19: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk
Page 20: Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

• How to incorporate acoustic modeling into dynamic programming techniques?

• How to describe the acoustic space (or whatever space) in an unsupervised (but robust) manner?

• How do we discriminate between “interesting/relevant” and “filler” events

• Does it all make any sense? (maybe we could consider we will always have enough training data?)


Recommended