+ All Categories
Home > Documents > Evaluation methods

Evaluation methods

Date post: 24-Feb-2016
Category:
Upload: wilbur
View: 32 times
Download: 0 times
Share this document with a friend
Description:
Evaluation methods. How do we judge speech technology components and applications?. Why should we talk about evaluation? •. It is – or should be – a central part of most, if not all, aspects of speech technology - PowerPoint PPT Presentation
Popular Tags:
32
Evaluation methods How do we judge speech technology components and applications?
Transcript

Evaluation methods

Evaluation methodsHow do we judge speech technology components and applications?Why should we talk about evaluation? It is or should be a central part of most, if not all, aspects of speech technologyThe higher grades (A, B; as tested in the home exam assignments and the project) require a measure of evaluationWhat is evaluation? the making of a judgment about the amount, number, or value of something (Google)the systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards (Wikipedia)

3What is evaluation?The systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards?What does this mean?The method can be formalized, described in detailWhy is this important?So that evaluations can be repeated,because we want to compare different systems,and verify evaluation results

It can be formalized.So that it can repeated.We want to compare.And we want to verify.

4What is evaluation?The systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards?(Google had value instead)What does this mean? We will return to this

Context dependant.These questions can be rephrased. For whom, to whom, when?5What is evaluation?The systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards?What are the criteria?We will come back to this, too...Who decides on the standards? GovernmentsOrganizations (e.g. ISO)Industry groupsResearch groups

Context dependant.These questions can be rephrased. For whom, to whom, when?6What if there is no standard?By the nature of things, there are many more things to evaluate than there are well-developed standardsNot necessarily advisable to use a mismatched standardFallback: systematic, formalized methodWhy evaluate?Wrong question. Start with For whom do we evaluate? ResearchersDevelopersProducersBuyersConsumer organizationsSpecial interest groupsFor whose benefit?

8So now: Why evaluate ? What do the groups we mentioned want from an evaluation? Researchers?Test of hypothesesDevelopersProof of progress, functionalityProducersDoes the manufacturing work?Is it cheaper?BuyersMore bang for the buck?Does it meet expectations?Consumer organizationsDoes it meet promises made?Special interest groupsDoes it meet specifcations and requirements?

What to evaluate? In other words, what does merit, worth, significance and value mean?

TMHBlade runnerHalMinority report10What to evaluate? In other words, what does merit, worth, significance and value mean?It depends.What is the purpose of the evaluation?What is the purpose of the evaluated?

TMHBlade runnerHalMinority report11In summary so farObjective to a pointBut be aware of the reason for the evaluation: who wants it, and what do they want to know?Standards are greatBut will not be available for all purposesSqueezing one type of evaluation into another type of standard will produce unpredictable resultsIf designing new methods, be very clear with the details in the descriptionMust be possible to repeat12How is evaluation done?Well use speech synthesis evaluation as our example domainHere, we focus on evaluations thatTest the functionality (with respect to a user)Prove a concept or an ideaCompare different varietiesWe largely disregard EfficiencyCostRobustnessUser studies representativenessUser selectionDemographicsEnvironmentSound environmentGeneral situationLab environments are rarely representative for the intended usage environment of speech technologyStimuli/systemOften not possible to text the exact system one is interested in

Synthesis evaluation overviewOverview used by MTM, the Swedish Agency for Accessible Media in educationProvides people with print impairments with accessible mediaBooks and papers (games, calendars)Braille and talking booksSpeech synthesis for about 50% of the production of university level text booksFilibusterIn-house developed unit selection systemTora & Folke (Swedish), Brage (Norwegian bokml), Martin (Danish)

MTM purposes of evaluation Ready for release Comparison of voices Intelligibility, human-likeness Fatigue, habituation

16Test methods: Grading testsOverall impression (mean opinion score, MOS)Grade the utterance on a scale

Specific aspects (categorical rating test, CRT) Intelligibility Human-likeness Speed Stress

17Test methods: Discrimination testsRepeat or write down what you heard

Choose between two or more given words Minimal pairs: bil pil

Suitable for diphone synthesis with a small voice database

Test methods: Preference testsComparison of two or more utterances

Typically words or short sentences

Choose which you like the best

Test methods: Comprehension testsListen to a text and answer questions

Test methods: CommentsComment fields The subjects wants to explain what is wrong

They are almost never right.

Time consuming!

21Test methods: problems for narrative synthesis testingYou want to evaluate large texts!

Grading, discrimination and preference testsDifficult to judge longer textsEvaluation of a very small part of the possible outcome of the US TTSTime consumingYou dont know what the subjects likde or disliked

Comprehension testsDoes not measure anything else

Ecological validityRepresentativeness again: ecological validitymeans that the methods, materials and setting of the study should approximate the real-world that is being examined

Userse.g. students, old people

Materialuniversity level text book or newspapers with synthetic speech

Situationreading long texts (in a learning or informational situation)

23Audience response system-based testsHollywood: evaluations of pilot episodes and movies Clicking a button when the dont like it

Voting in TV shows

Classroom engagement

24Audience response system-based testFor TTSClick when you hear somethingUnintelligibleIrritatingYou just dont like itLonger speech chunksPossible to give simple instructionsDetailed analysisEffectiveness5 listening minutes = 5 evaluated minutes

25Results number of clicks/subject

26Results number of clicks/subject

27Evaluation of conversational systems and conversational synthesisConversations are incremental and continuousNo straightforward way of segmentingThey are produced by all participants in collaborationErrors are commonplace, but rarely have an adversary effectStrict information transfer is often not the primary goalSo not much use for methods of evaluation that operate in terms ofEfficiencyQuality of single utterancesGrammaticalityEtc.Other methodsNew methods are being developed for evaluation of complex systems and interactions.ARS is one. Well look at some other examples.Analysis of captured interactionsMeasures of machine extractable features, e.g. tone, rhythm, interaction flow, durations, movement, gazeComparison to human-human interactions of the same typeThe colour experiment is an example of this

3rd-party participant/spectator behavioursPeople watching spoken interaction behave predictablyMonitoring people watching videos can give insights to their perception of the videoE.g. gaze patternsThank you!Questions?


Recommended