+ All Categories
Transcript
Page 1: Evaluation in Audio Music Similarity

Evaluation in Audio Music Similarity

PhD dissertation

by

Julián Urbano

Leganés, October 3rd 2013 Picture by Javier García

Page 2: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

2

Page 3: Evaluation in Audio Music Similarity

Outline

• Introduction

– Scope

– The Cranfield Paradigm

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

3

Page 4: Evaluation in Audio Music Similarity

Information Retrieval

• Automatic representation, storage and search of unstructured information

– Traditionally textual information

– Lately multimedia too: images, video, music

• A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents

4

Page 5: Evaluation in Audio Music Similarity

Information Retrieval Evaluation

• IR systems are based on models to estimate relevance, implementing different techniques

• How good is my system? What system is better?

• Answered with IR Evaluation experiments

– “if you can’t measure it, you can’t improve it”

– But we need to be able to trust our measurements

• Research on IR Evaluation

– Improve our methods to evaluate systems

– Critical for the correct development of the field

5

Page 6: Evaluation in Audio Music Similarity

History of IR Evaluation research

6

1960

Cranfield 2 MEDLARS

SMART

1980 1970 1990 2000 2010

SIGIR

Page 7: Evaluation in Audio Music Similarity

History of IR Evaluation research

6

1960

TREC

CLEF NTCIR

Cranfield 2 MEDLARS

SMART

INEX

1980 1970 1990 2000 2010

SIGIR

Page 8: Evaluation in Audio Music Similarity

History of IR Evaluation research

6

1960

MIREX

TREC

CLEF NTCIR

ISMIR

Cranfield 2 MEDLARS

SMART

INEX

MusiCLEF

1980 1970 1990 2000 2010

MSD Challenge

SIGIR

Page 9: Evaluation in Audio Music Similarity

History of IR Evaluation research

6

1960

MIREX

TREC

CLEF NTCIR

ISMIR

Cranfield 2 MEDLARS

SMART

INEX

MusiCLEF

1980 1970 1990 2000 2010

MSD Challenge

SIGIR

Page 10: Evaluation in Audio Music Similarity

History of IR Evaluation research

6

1960

MIREX

TREC

CLEF NTCIR

ISMIR

Cranfield 2 MEDLARS

SMART

INEX

MusiCLEF

1980 1970 1990 2000 2010

MSD Challenge

SIGIR

Page 11: Evaluation in Audio Music Similarity

Audio Music Similarity

• Song as input to system, audio signal

• Retrieve songs musically similar to it, by content

• Resembles traditional Ad Hoc retrieval in Text IR

• (most?) Important task in Music IR

– Music recommendation

– Playlist generation

– Plagiarism detection

• Annual evaluation in MIREX

7

Page 12: Evaluation in Audio Music Similarity

Outline

• Introduction

– Scope

– The Cranfield Paradigm

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

8

Page 13: Evaluation in Audio Music Similarity

Outline

• Introduction

– Scope

– The Cranfield Paradigm

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

9

Page 14: Evaluation in Audio Music Similarity

The two questions

• How good is my system?

– What does good mean?

– What is good enough?

• Is system A better than system B?

– What does better mean?

– How much better?

• Efficiency? Effectiveness? Ease?

10

Page 15: Evaluation in Audio Music Similarity

Measure user experience

• We are interested in user-measures

– Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use …

– Their distributions describe user experience, fully

• User satisfaction is the bigger picture

– How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system?

• This is the ultimate goal: the good, the better

11

Page 16: Evaluation in Audio Music Similarity

The Cranfield Paradigm

• Estimate user-measure distributions

– Sample documents, queries and users

– Monitor user experience and behavior

– Representativeness, cost, ethics, privacy …

• Fix samples to allow reproducibility

– But cannot fix users and their behavior

– Remove users, but include a static user component, fixed across experiments: ground truth judgments

– Still need to include the dynamics of the process: user models behind effectiveness measures and scales

12

Page 17: Evaluation in Audio Music Similarity

Test collections

• Our goal is the users: user-measure = f(system)

• Cranfield measures systems: system-effectiveness = f(system, measure, scale)

• Estimators of the distributions of user-measures – Only source of variability is the systems themselves

– Reproducibility becomes easy

– Experiments are inexpensive (collections are not)

– Research becomes systematic

13

Page 18: Evaluation in Audio Music Similarity

Validity, Reliability and Efficiency

• Validity: are we measuring what we want to?

– How well are effectiveness and satisfaction correlated?

– How good is good and how better is better?

• Reliability: how repeatable are the results?

– How large do samples have to be?

– What statistical methods should be used?

• Efficiency: how inexpensive is it to get valid and reliable results?

– Can we estimate results with fewer judgments?

14

Page 19: Evaluation in Audio Music Similarity

Goal of this dissertation

Study and improve the validity, reliability and efficiency

of the methods used to evaluate AMS systems

Additionally, improve meta-evaluation methods

15

Page 20: Evaluation in Audio Music Similarity

Outline

• Introduction

– Scope

– The Cranfield Paradigm

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

16

Page 21: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

– System Effectiveness and User Satisfaction

– Modeling Distributions

• Reliability

• Efficiency

• Conclusions and Future Work

17

Page 22: Evaluation in Audio Music Similarity

Assumption of Cranfield

• Systems with better effectiveness are perceived by users as more useful, more satisfactory

• But different effectiveness measures and relevance scales produce different distributions

– Which one is better to predict user satisfaction?

• Map system effectiveness onto user satisfaction, experimentally

– If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory?

– What if DCG@20 = 0.46? 18

Page 23: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 24: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 25: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 26: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 27: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 28: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 29: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 30: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 31: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 32: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 33: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

Page 34: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

19

MIREX

Page 35: Evaluation in Audio Music Similarity

Experimental design

20

Page 36: Evaluation in Audio Music Similarity

What can we infer?

• Preference: difference noticed by user

– Positive: user agrees with evaluation

– Negative: user disagrees with evaluation

• Non-preference: difference not noticed by user

– Good: both systems are satisfactory

– Bad: both systems are not satisfactory

21

Page 37: Evaluation in Audio Music Similarity

Data

• Queries, documents and judgments from MIREX

• 4115 unique and artificial examples

• 432 unique queries, 5636 unique documents

• Answers collected via Crowdsourcing

– Quality control with trap questions

• 113 unique subjects

22

Page 38: Evaluation in Audio Music Similarity

Single system: how good is it?

• For 2045 examples (49%) users could not decide which system was better

What do we expect?

23

Page 39: Evaluation in Audio Music Similarity

Single system: how good is it?

• For 2045 examples (49%) users could not decide which system was better

23

Page 40: Evaluation in Audio Music Similarity

Single system: how good is it?

• Large ℓmin thresholds underestimate satisfaction

24

Page 41: Evaluation in Audio Music Similarity

Single system: how good is it?

• Users don’t pay attention to ranking?

25

Page 42: Evaluation in Audio Music Similarity

Single system: how good is it?

• Exponential gain underestimates satisfaction

26

Page 43: Evaluation in Audio Music Similarity

Single system: how good is it?

• Document utility independent of others

27

Page 44: Evaluation in Audio Music Similarity

Two systems: which one is better?

• For 2090 examples (51%) users did prefer one system over the other one

What do we expect?

28

Page 45: Evaluation in Audio Music Similarity

Two systems: which one is better?

• For 2090 examples (51%) users did prefer one system over the other one

28

Page 46: Evaluation in Audio Music Similarity

Two systems: which one is better?

• Large differences needed for users to note them

29

Page 47: Evaluation in Audio Music Similarity

Two systems: which one is better?

• More relevance levels are better to discriminate

30

Page 48: Evaluation in Audio Music Similarity

Two systems: which one is better?

• Cascade and navigational user models are not appropriate

31

Page 49: Evaluation in Audio Music Similarity

Two systems: which one is better?

• Users do prefer the (supposedly) worse system

32

Page 50: Evaluation in Audio Music Similarity

Summary

• Effectiveness and satisfaction are clearly correlated – But there is a bias of 20% because of user disagreement – Room for improvement through personalization

• Magnitude of differences does matter – Just looking at rankings is very naive

• Be careful with statistical significance

– Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX

• Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than

navigational and cascade – The more relevance levels, the better

33

Page 51: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

34

Page 52: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80

P@5 X X X X

AP@5 X X X X

RR@5 X X X X

CGl@5 X X X X X P@5 P@5 P@5 P@5

CGe@5 X X X X P@5 P@5 P@5 P@5

DCGl@5 X X X X X X X X X

DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5

EDCGl@5 X X X X X X X X X

EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5

Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5

Qe@5 X X X X AP@5 AP@5 AP@5 AP@5

RBPl@5 X X X X X X X X X

RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5

ERRl@5 X X X X X X X X X

ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5

GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5

ADR@5 X X X X X X X X

35

Page 53: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

– System Effectiveness and User Satisfaction

– Modeling Distributions

• Reliability

• Efficiency

• Conclusions and Future Work

36

Page 54: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

– System Effectiveness and User Satisfaction

– Modeling Distributions

• Reliability

• Efficiency

• Conclusions and Future Work

37

Page 55: Evaluation in Audio Music Similarity

Evaluate in terms of user satisfaction

• So far, arbitrary users for a single query

– P Sat Ql@5 = 0.61 = 0.7

• Easily for n users and a single query

– P Sat15 = 10 Ql@5 = 0.61 = 0.21

• What about a sample of queries 𝒬?

– Map queries separately for the distribution of P(Sat)

– For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials

38

Page 56: Evaluation in Audio Music Similarity

Expected probability of satisfaction

• Now we can compute point and interval estimates of the expected probability of satisfaction

• Intuition fails when interpreting effectiveness

39

Page 57: Evaluation in Audio Music Similarity

System success

• If P(Sat) ≥ threshold the system is successful

– Setting the threshold was rather arbitrary

– Now it is meaningful, in terms of user satisfaction

• Intuitively, we want the majority of users to find the system satisfactory

– P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5)

• Improving queries for which we are bad is worthier than further improving those for which we are already good

40

Page 58: Evaluation in Audio Music Similarity

Distribution of P(Sat)

• Need to estimate the cumulative distribution function of user satisfaction: FP(Sat)

• Not described by a typical distribution family

– ecdf converges, but what is a good sample size?

– Compare with Normal, Truncated Normal and Beta

• Compared on >2M random samples from MIREX collections, at different query set sizes

• Goodness of fit as to Cramér-von Mises ω2

41

Page 59: Evaluation in Audio Music Similarity

Estimated distribution of P(Sat)

• More than ≈25 queries in the collection

– ecdf approximates better

• Less than ≈25 queries in the collection

– Normal for graded scales, ecdf for binary scales

• Beta is always the best with the Fine scale

• The more levels in the relevance scale, the better

• Linear gain better than exponential gain

42

Page 60: Evaluation in Audio Music Similarity

Intuition fails, again

• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction

– E Δλ = −0.002

– E ΔP Sat = 0.001

– E ΔP Succ = 0.07

43

Page 61: Evaluation in Audio Music Similarity

Intuition fails, again

• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction

– E Δλ = −0.002

– E ΔP Sat = 0.001

– E ΔP Succ = 0.07

43

Page 62: Evaluation in Audio Music Similarity

Intuition fails, again

• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction

– E Δλ = −0.002

– E ΔP Sat = 0.001

– E ΔP Succ = 0.07

43

Page 63: Evaluation in Audio Music Similarity

Historically, in MIREX

• Systems are not as satisfactory as we thought

• But they are more successful

– Good (or bad) for some kinds of queries

44

Page 64: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40

P@5 X X

AP@5 X X

CGl@5 X X X X P@5 P@5

CGe@5 X X X P@5 P@5

DCGl@5 X X X X X X

DCGe@5 X X X DCGl@5 DCGl@5

Ql@5 X X X X AP@5 AP@5

Qe@5 X X X AP@5 AP@5

RBPl@5 X X X X X X

RBPe@5 X X X RBPl@5 RBPl@5

GAP@5 X X X X AP@5 AP@5

45

Page 65: Evaluation in Audio Music Similarity

Measures and scales

Measure Original Artificial Graded Artificial Binary

Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40

P@5 X X

AP@5 X X

CGl@5 X X X X P@5 P@5

CGe@5 X X X P@5 P@5

DCGl@5 X X X X X X

DCGe@5 X X X DCGl@5 DCGl@5

Ql@5 X X X X AP@5 AP@5

Qe@5 X X X AP@5 AP@5

RBPl@5 X X X X X X

RBPe@5 X X X RBPl@5 RBPl@5

GAP@5 X X X X AP@5 AP@5

46

Page 66: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

– System Effectiveness and User Satisfaction

– Modeling Distributions

• Reliability

• Efficiency

• Conclusions and Future Work

47

Page 67: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

– Optimality of Statistical Significance Tests

– Test Collection Size

• Efficiency

• Conclusions and Future Work

48

Page 68: Evaluation in Audio Music Similarity

Random error

• Test collections are just samples from larger, possibly infinite, populations

• If we conclude system A is better than B, how confident can we be?

– Δλ𝒬 is just an estimate of the population mean μΔλ

• Usually employ some statistical significance test for differences in location

• If it is statistically significant, we have confidence that the true difference is at least that large

49

Page 69: Evaluation in Audio Music Similarity

Statistical hypothesis testing

• Set two mutually exclusive hypotheses

– H0: μΔλ = 0

– H1: μΔλ ≠ 0

• Run test, obtain p-value= P μΔλ ≥ Δλ𝒬 H0

– p ≤ α: statistically significant, high confidence

– p > α: statistically non-significant, low confidence

• Possible errors in the binary decision

– Type I: incorrectly reject H0

– Type II: incorrectly accept H0

50

Page 70: Evaluation in Audio Music Similarity

Statistical significance tests

• (Non-)parametric tests

– t-test, Wilcoxon test, Sign test

• Based on resampling

– Bootstrap test, permutation/randomization test

• They make certain assumptions about distributions and sampling methods

– Often violated in IR evaluation experiments

– Which test behaves better, in practice, knowing that assumptions are violated?

51

Page 71: Evaluation in Audio Music Similarity

Optimality criteria

• Power

– Achieve significance as often as possible (low Type II)

– Usually increases Type I error rates

• Safety

– Minimize Type I error rates

– Usually decreases power

• Exactness

– Maintain Type I error rate at α level

– Permutation test is theoretically exact

52

Page 72: Evaluation in Audio Music Similarity

Experimental design

• Randomly split query set in two

• Evaluate all systems with both subsets

– Simulating two different test collections

• Compare p-values with both subsets

– How well do statistical tests agree with themselves?

– At different α levels

• All systems and queries from MIREX 2007-2011

– >15M p-values

53

Page 73: Evaluation in Audio Music Similarity

Power and success

• Bootstrap test is the most powerful

• Wilcoxon, bootstrap and permutation are the most successful, depending on α level

54

Page 74: Evaluation in Audio Music Similarity

Conflicts

• Wilcoxon and t-test are the safest at low α levels

• Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels

55

Page 75: Evaluation in Audio Music Similarity

Optimal measure and scale

• Power: CGl@5, GAP@5, DCGl@5 and RBPl@5

• Success: CGl@5, GAP@5, DCGl@5 and RBPl@5

• Conflicts: very similar across measures

• Power: Fine, Broad and binary

• Success: Fine, Broad and binary

• Conflicts: very similar across scales

56

Page 76: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

– Optimality of Statistical Significance Tests

– Test Collection Size

• Efficiency

• Conclusions and Future Work

57

Page 77: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

– Optimality of Statistical Significance Tests

– Test Collection Size

• Efficiency

• Conclusions and Future Work

58

Page 78: Evaluation in Audio Music Similarity

Acceptable sample size

• Reliability is higher with larger sample sizes

– But it is also more expensive

– What is an acceptable test collection size?

• Answer with Generalizability Theory

– G-Study: estimate variance components

– D-Study: estimate reliability of different sample sizes and experimental designs

59

Page 79: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 80: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 81: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 82: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 83: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 84: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 85: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 86: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 87: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 88: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 89: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

60

Page 90: Evaluation in Audio Music Similarity

G-study: variance components

• Fully crossed experimental design: s × q

λq,A = λ + λA + λq + εqA

σ2 = σs

2 + σq2 + σsq

2

• Estimated with Analysis of Variance

• If σs2 is small or σq

2 is large, we need more queries

60

Page 91: Evaluation in Audio Music Similarity

D-study: variance ratios

• Stability of absolute scores

Φ nq =σs2

σs2 +

σq2 + σe

2

nq

• Stability of relative scores

Eρ2 nq =σs2

σs2 +

σe2

nq

• We can easily estimate how many queries are needed to reach some level of stability (reliability)

61

Page 92: Evaluation in Audio Music Similarity

D-study: variance ratios

• Stability of absolute scores

Φ nq =σs2

σs2 +

σq2 + σe

2

nq

• Stability of relative scores

Eρ2 nq =σs2

σs2 +

σe2

nq

• We can easily estimate how many queries are needed to reach some level of stability (reliability)

61

Page 93: Evaluation in Audio Music Similarity

Effect of query set size

• Average absolute stability Φ = 0.97 • ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases • Fine scale slightly better than Broad and binary scales • RBPl@5 and nDCGl@5 are the most stable

62

Page 94: Evaluation in Audio Music Similarity

Effect of query set size

• Average relative stability Eρ 2 = 0.98

• ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases

• Fine scale better than Broad and binary scales

• CGl@5 and RBPl@5 are the most stable

63

Page 95: Evaluation in Audio Music Similarity

Effect of cutoff k

• What if we use a deeper cutoff, k=10?

– From 100 queries and k=5 to 50 queries and k=10

– Should still have stable scores

– Judging effort should decrease

– Rank-based measures should become more stable

• Tested in MIREX 2012

– Apparently in 2013 too

64

Page 96: Evaluation in Audio Music Similarity

Effect of cutoff k

• Judging effort reduced to 72% of the usual

• Generally stable – From Φ = 0.81 to Φ = 0.83

– From Eρ 2 = 0.93 to Eρ 2 = 0.95

65

Page 97: Evaluation in Audio Music Similarity

Effect of cutoff k

• Reliability given a fixed budged for judging?

– k=10 allows us to use fewer queries, about 70%

– Slightly reduced relative stability

66

Page 98: Evaluation in Audio Music Similarity

Effect of assessor set size

• More assessors or simply more queries?

– Judging effort is multiplied

• Can be studied with MIREX 2006 data

– 3 different assessors per query

– Nested experimental design: s × h: q

67

Page 99: Evaluation in Audio Music Similarity

Effect of assessor set size

• Broad scale: σ s2 ≈ σ h:q

2

• Fine scale: σ s2 ≫ σ h:q

2

• Always better to spend resources on queries

68

Page 100: Evaluation in Audio Music Similarity

Summary

• MIREX collections generally larger than necessary

• For fixed budget

– More queries better than more assessors

– More queries slightly better than deeper cutoff

• Worth studying alternative user model?

• Employ G-Theory while building the collection

• Fine better than Broad, better than binary

• CGl@5 and DCGl@5 best for relative stability

• RBPl@5 and nDCGl@5 best for absolute stability

69

Page 101: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

– Optimality of Statistical Significance Tests

– Test Collection Size

• Efficiency

• Conclusions and Future Work

70

Page 102: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

– Learning Relevance Distributions

– Low-cost Evaluation

• Conclusions and Future Work

71

Page 103: Evaluation in Audio Music Similarity

Probabilistic evaluation

• The MIREX setting is still expensive

– Need to judge all top k documents from all systems

– Takes days, even weeks sometimes

• Model relevance probabilistically

• Relevance judgments are random variables over the space of possible assignments of relevance

• Effectiveness measures are also probabilistic

72

Page 104: Evaluation in Audio Music Similarity

Probabilistic evaluation

• Accuracy increases as we make judgments

– E Rd ← rd

• Reliability increases too (confidence)

– Var Rd ← 0

• Iteratively estimate relevance and effectiveness

– If confidence is low, make judgments

– If confidence is high, stop

• Judge as few documents as possible

73

Page 105: Evaluation in Audio Music Similarity

Learning distributions of relevance

• Uniform distribution is very uninformative

• Historical distribution in MIREX has high variance

• Estimate from a set of features: P Rd = ℓ θd

– For each document separately

– Ordinal Logistic Regression

• Three sets of features

– Output-based, can always be used

– Judgment-based, to exploit known judgments

– Audio-based, to exploit musical similarity

74

Page 106: Evaluation in Audio Music Similarity

Learned models

• Mout : can be used even without judgments

– Similarity between systems’ outputs

– Genre and artist metadata

• Genre is highly correlated to similarity

– Decent fit, R2 ≈ 0.35

• Mjud : can be used when there are judgments

– Similarity between systems’ outputs

– Known relevance of same system and same artist

• Artist is extremely correlated to similarity

– Excellent fit, R2 ≈ 0.91

75

Page 107: Evaluation in Audio Music Similarity

Estimation errors

• Actual vs. predicted by Mout

– 0.36 with Broad and 0.34 with Fine

• Actual vs. predicted by Mjud

– 0.14 with Broad and 0.09 with Fine

• Among assessors in MIREX 2006

– 0.39 with Broad and 0.31 with Fine

• Negligible under the current MIREX setting

76

Page 108: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

– Learning Relevance Distributions

– Low-cost Evaluation

• Conclusions and Future Work

77

Page 109: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

– Learning Relevance Distributions

– Low-cost Evaluation

• Conclusions and Future Work

78

Page 110: Evaluation in Audio Music Similarity

Probabilistic effectiveness measures

• Effectiveness scores are also random variables

• Different approaches to compute estimates

– Deal with dependence of random variables

– Different definitions of confidence

• For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form

– Approximated with Delta method and Taylor series

79

Page 111: Evaluation in Audio Music Similarity

Ranking without judgments

1. Estimate relevance with Mout

2. Estimate relative differences and rank systems

• Average confidence in the rankings is 94%

• Average accuracy of the ranking is 92%

80

Page 112: Evaluation in Audio Music Similarity

Ranking without judgments

• Can we trust individual estimates?

– Ideally, we want X% accuracy when X% confidence

– Confidence slightly overestimated in [0.9, 0.99)

81

DCGl@5

Confidence Broad Fine

In bin Accuracy In bin Accuracy

[0.5, 0.6) 23 (6.5%) 0.826 22 (6.2%) 0.636

[0.6, 0.7) 14 (4%) 0.786 16 (4.5%) 0.812

[0.7, 0.8) 14 (4%) 0.571 11 (3.1%) 0.364

[0.8, 0.9) 22 (6.2%) 0.864 21 (6%) 0.762

[0.9, 0.95) 23 (6.5%) 0.87 19 (5.4%) 0.895

[0.95, 0.99) 24 (6.8%) 0.917 27 (7.7%) 0.926

[0.99, 1) 232 (65.9%) 0.996 236 (67%) 0.996

E[Accuracy] 0.938 0.921

Page 113: Evaluation in Audio Music Similarity

Relative estimates with judgments

1. Estimate relevance with Mout

2. Estimate relative differences and rank systems

3. While confidence is low (<95%) 1. Select a document and judge it

2. Update relevance estimates with Mjud when possible

3. Update estimates of differences and rank systems

• What documents should we judge? – Those that are the most informative

– Measure-dependent

82

Page 114: Evaluation in Audio Music Similarity

Relative estimates with judgments

• Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5

• Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate

– Expected accuracy improves slightly from 0.927 to 0.931

83

Page 115: Evaluation in Audio Music Similarity

Absolute estimates with judgments

1. Estimate relevance with Mout

2. Estimate absolute effectiveness scores

3. While confidence is low (expected error >±0.05) 1. Select a document and judge it

2. Update relevance estimates with Mjud when possible

3. Update estimates of absolute effectiveness scores

• What documents should we judge? – Those that reduce variance the most

– Measure-dependent

84

Page 116: Evaluation in Audio Music Similarity

Absolute estimates with judgments

• The stopping condition is overly confident – Virtually no judgments are even needed (supposedly)

• But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance

85

Page 117: Evaluation in Audio Music Similarity

Absolute estimates with judgments

• Practical fix: correct variance

• Estimates are better, but at the cost of judging

– Need between 15% and 35% of judgments

86

Page 118: Evaluation in Audio Music Similarity

Summary

• Estimate ranking of systems with no judgments

– 92% accuracy on average, trustworthy individually

– Statistically significant differences are always correct

• If we want more confidence, judge documents

– As few as 2% needed to reach 95% confidence

– 74% of estimates have >99% confidence and accuracy

• Estimate absolute scores, judging as necessary

– Around 25% needed to ensure error <0.05

87

Page 119: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

– Learning Relevance Distributions

– Low-cost Evaluation

• Conclusions and Future Work

88

Page 120: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

– Conclusions

– Future Work

89

Page 121: Evaluation in Audio Music Similarity

Validity

• Cranfield tells us about systems, not about users

• Provide empirical mapping from system effectiveness onto user satisfaction

• Room for personalization quantified in 20%

• Need large differences for users to note them

• Consider full distributions, not just averages

• Conclusions based on effectiveness tend to contradict conclusions based on user satisfaction

90

Page 122: Evaluation in Audio Music Similarity

Reliability

• Different significance tests for different needs

– Bootstrap test is the most powerful

– Wilcoxon and t-test are the safest

– Wilcoxon and bootstrap test are the most exact

• Practical interpretation of p-values

• MIREX collections generally larger than needed

• Spend resources on queries, not on assessors

• User models with deeper cutoffs are feasible

• Employ G-Theory while building collections

91

Page 123: Evaluation in Audio Music Similarity

Efficiency

• Probabilistic evaluation reduces cost, dramatically

• Two models to estimate document relevance

• System rankings 92% accurate without judgments

• 2% of judgments to reach 95% confidence

• 25% of judgments to reduce error to 0.05

92

Page 124: Evaluation in Audio Music Similarity

Measures and scales

• Best measure and scale depends on situation

• But generally speaking

– CGl@5, DCGl@5 and RBPl@5

– Fine scale

– Model distributions as Beta

93

Page 125: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

– Conclusions

– Future Work

94

Page 126: Evaluation in Audio Music Similarity

Outline

• Introduction

• Validity

• Reliability

• Efficiency

• Conclusions and Future Work

– Conclusions

– Future Work

95

Page 127: Evaluation in Audio Music Similarity

Validity

• User studies to understand user behavior

• What information to include in test collections

• Other forms of relevance judgment to better capture document utility

• Explicitly define judging guidelines

• Similar mapping for Text IR

96

Page 128: Evaluation in Audio Music Similarity

Reliability

• Corrections for Multiple Comparisons

• Methods to reliably estimate reliability while building test collections

97

Page 129: Evaluation in Audio Music Similarity

Efficiency

• Better models to estimate document relevance

• Correct variance when having just a few relevance judgments available

• Estimate relevance beyond k=5

• Other stopping conditions and document weights

98

Page 130: Evaluation in Audio Music Similarity

Conduct similar studies

for the wealth of tasks in

Music Information Retrieval

99

Page 131: Evaluation in Audio Music Similarity

Evaluation in Audio Music Similarity

PhD dissertation

by

Julián Urbano

Leganés, October 3rd 2013 Picture by Javier García


Top Related