11
Speech User InterfacesSpeech User Interfaces
22
OutlineOutline
ReviewReview Motivation for speech UIsMotivation for speech UIs Speech recognitionSpeech recognition UI problems with speech UIsUI problems with speech UIs SpeechActs: Guidelines for speech SpeechActs: Guidelines for speech
UIsUIs Speech UI design toolsSpeech UI design tools Multimodal UIsMultimodal UIs
33
ReviewReview
Why do we prototype?Why do we prototype?• get feedback on our design from customers – faster & get feedback on our design from customers – faster &
cheapercheaper Why use low-fi prototypes?Why use low-fi prototypes?
• traditional methods take too long & focus designers & traditional methods take too long & focus designers & customers on the wrong (visual) issuescustomers on the wrong (visual) issues
What is the Wizard of Oz technique?What is the Wizard of Oz technique?• faking the interactionfaking the interaction
What is the advantage of using informal tools like What is the advantage of using informal tools like SILK, DENIM, & SUEDE?SILK, DENIM, & SUEDE?• advantages of electronic medium (editing, reuse, advantages of electronic medium (editing, reuse,
distribution, etc.)distribution, etc.)• faster than traditional UI toolsfaster than traditional UI tools• do not focus designers/customers on the wrong issuesdo not focus designers/customers on the wrong issues• ability to support testing & analysis of resulting dataability to support testing & analysis of resulting data
44
Motivation for Speech UIs:Motivation for Speech UIs:Pervasive Information AccessPervasive Information Access
Information
&
Services
I-Land vision by Streitz, et. al.
55
UIs in the Pervasive Computing EraUIs in the Pervasive Computing Era
Future computing devices won’tFuture computing devices won’t have the same UI as current PCs have the same UI as current PCs
• wide range of deviceswide range of devices small or embedded in environmentsmall or embedded in environment often w/ “alternative” I/O & w/o screens often w/ “alternative” I/O & w/o screens information appliancesinformation appliances
I-Land vision by Streitz, et. al.
66
Information Access via SpeechInformation Access via Speech
Read my important
77
Industry LeadersIndustry Leaders
NuanceNuance Corporation Corporation Applications: Applications: TellMeTellMe, …, … Users: Government, Computers- Users: Government, Computers-
Microsoft, IBM, Microsoft, IBM,
88
Speech UI MotivationSpeech UI Motivation
Smaller devices -> difficult I/OSmaller devices -> difficult I/O• people can talk at ~ 90 wpm -> high speedpeople can talk at ~ 90 wpm -> high speed
““Virtually unlimited” set of commandsVirtually unlimited” set of commands Freedom for other body partsFreedom for other body parts
• imagine you are working on your car & need to imagine you are working on your car & need to know something from the manualknow something from the manual
NaturalNatural• evolutionarily selected forevolutionarily selected for
reading, writing, & typing are not (too new)reading, writing, & typing are not (too new)
99
Why are Speech UIs Hard to Get Why are Speech UIs Hard to Get Right?Right?
Speech recognition far from perfectSpeech recognition far from perfect• imagine inputting commands w/ the imagine inputting commands w/ the
mouse & getting the wrong result 5-20% mouse & getting the wrong result 5-20% of the timeof the time
Speech UIs have no visible stateSpeech UIs have no visible state• can’t see what you have done before or can’t see what you have done before or
what affect your commands have hadwhat affect your commands have had Speech UIs are hard to learnSpeech UIs are hard to learn
• how do you explore the interface? how how do you explore the interface? how do you find out what you can say?do you find out what you can say?
1010
Speech recognitionSpeech recognition• the computer understanding what the customer is the computer understanding what the customer is
sayingsaying
Speech production (or synthesis)Speech production (or synthesis)• the computer talking to the customerthe computer talking to the customer
Speech UIs RequireSpeech UIs Require
1111
Speech RecognitionSpeech Recognition
Continuous vs. non-continuousContinuous vs. non-continuous Speaker independent vs. dependentSpeaker independent vs. dependent Speech often misunderstood by peopleSpeech often misunderstood by people
• feedback via speech, facial expressions, & gesturefeedback via speech, facial expressions, & gesture Recognizers trained with real samplesRecognizers trained with real samples
• often get gender-based problemsoften get gender-based problems Based on probabilities (HMMs - Bayes)Based on probabilities (HMMs - Bayes)
• trigrams of sounds or wordstrigrams of sounds or words Several popular recognizersSeveral popular recognizers
• Nuance, SpeechWorks, IBM ViaVoiceNuance, SpeechWorks, IBM ViaVoice
1212
Speech ProductionSpeech Production
Three frequency regions of great Three frequency regions of great intensity visible on oscilloscopeintensity visible on oscilloscope• come from larynx, throat, mouthcome from larynx, throat, mouth
Two needed for recognition but “tinny”Two needed for recognition but “tinny” Can generate emotion affect in speechCan generate emotion affect in speech
• DemoDemo anger, disgust, gladness, sadness, fear, & anger, disgust, gladness, sadness, fear, &
surprise surprise http://cahn.www.media.mit.edu/people/cahn/ehttp://cahn.www.media.mit.edu/people/cahn/emot-speech.htmlmot-speech.html
1313
Recognition ProblemsRecognition Problems
Good recognition Good recognition • humans < 1% error rate on dictationhumans < 1% error rate on dictation• top recognition systems get <1-X% error ratestop recognition systems get <1-X% error rates
computers don’t use much contextcomputers don’t use much context Key is to be application specific for lower error ratesKey is to be application specific for lower error rates
Background noiseBackground noise• even worse recognition rates (20-40% error)even worse recognition rates (20-40% error)
Speed Speed • Better as hardware getting fasterBetter as hardware getting faster
in 10 years gone from 5 high-end workstations required to in 10 years gone from 5 high-end workstations required to some speech systems running on laptops or even PDAssome speech systems running on laptops or even PDAs
1414
More Recognition ProblemsMore Recognition Problems
Isolated, short words difficultIsolated, short words difficult• common words become shortcommon words become short
SegmentationSegmentation• silly versus sill leasilly versus sill lea
SpellingSpelling• mail vs. male -> need to understand mail vs. male -> need to understand
languagelanguage
1515
Speech UI ProblemsSpeech UI Problems
Speech UI no-nosSpeech UI no-nos• modes (no feedback)modes (no feedback)
certain commands only work when in specific statescertain commands only work when in specific states• deep hierarchies (aka voice mail hell)deep hierarchies (aka voice mail hell)
Verbose feedback wastes time/patienceVerbose feedback wastes time/patience• only confirm consequential thingsonly confirm consequential things• use meaningful, short cuesuse meaningful, short cues
InterruptionInterruption• half-duplex communication (i.e., no barge-in support)half-duplex communication (i.e., no barge-in support)
Too much speech on the part of customer is Too much speech on the part of customer is tiringtiring
Speech takes up space in working memorySpeech takes up space in working memory• can cause problems when problem solvingcan cause problems when problem solving
1616
SpeechActs: SpeechActs: Guidelines for Speech UIsGuidelines for Speech UIs
Speech interface to computer toolsSpeech interface to computer tools• email, calendar, weather, stock quotesemail, calendar, weather, stock quotes
Establish common ground & shared contextEstablish common ground & shared context• make sure people know where they are in the conversationmake sure people know where they are in the conversation
PacingPacing• recog. delays are unnatural, make it clear when this occursrecog. delays are unnatural, make it clear when this occurs• barge-in lets user interrupt like in real conversationsbarge-in lets user interrupt like in real conversations• tapering of promptstapering of prompts• progressive assistance: short errors messages at first, progressive assistance: short errors messages at first,
longer when user needs more helplonger when user needs more help• implicit confirmation: include confirm in next commandimplicit confirmation: include confirm in next command
SpeechActs Video
1818
AnnouncementsAnnouncements
Task analysis / Contextual inquiry Task analysis / Contextual inquiry HWHW• average = 79/100, stdev. 8.4average = 79/100, stdev. 8.4
Low-fi user test due MondayLow-fi user test due Monday• questionsquestions
If you haven’t gotten a laptop yet, If you haven’t gotten a laptop yet, check with Wai-ling after classcheck with Wai-ling after class
1919
SUEDE:SUEDE:Low-fi Prototyping for Speech-based UIsLow-fi Prototyping for Speech-based UIs
Supports design practiceSupports design practice• example scriptsexample scripts• Wizard of OzWizard of Oz• error simulationerror simulation• iterative design iterative design ((design-test-design-test-
analysisanalysis))
Informal user interfaceInformal user interface• no speech no speech
recognition/synthesisrecognition/synthesis• need not be programming need not be programming
expertexpert• fast & fluid designfast & fluid design
machine prompt user response
2121
2222
2323
SUEDE SummarySUEDE Summary
SUEDE supports speech-based UI designSUEDE supports speech-based UI design• moving from concrete examples to abstractionsmoving from concrete examples to abstractions• allows designer to accept responses that aren’t allows designer to accept responses that aren’t
exactly what they originally had in mindexactly what they originally had in mind• embeds iterative design w/ design-test-analyzeembeds iterative design w/ design-test-analyze
Designers using SUEDE need not be experts Designers using SUEDE need not be experts in speech recognition technologyin speech recognition technology
2424
One Vision of Future User One Vision of Future User InterfacesInterfaces
Star Trek style UIStar Trek style UI• verbally ask the computer for informationverbally ask the computer for information• may be common in mobile/hands-busy situationsmay be common in mobile/hands-busy situations• problem: hard to design, build, & use!problem: hard to design, build, & use!
requires perfect speech recognition & language requires perfect speech recognition & language understandingunderstanding
2525
Our Vision of Future User Our Vision of Future User InterfacesInterfaces
Multimodal, Context-aware UIsMultimodal, Context-aware UIs• multimodalmultimodal
uses multiple input modalities (speech & gesture) to uses multiple input modalities (speech & gesture) to disambiguatedisambiguate
user says “move it to this screen” while pointinguser says “move it to this screen” while pointing
• context-awarecontext-aware apps can be aware of location, user, what they are doing, apps can be aware of location, user, what they are doing,
…… people are talking -> don’t rely on speech I/Opeople are talking -> don’t rely on speech I/O
Problem: how to prototype & test new ideas?Problem: how to prototype & test new ideas?• Informal UI Design Tools!Informal UI Design Tools!
combine Wizard of Oz & informal storyboardingcombine Wizard of Oz & informal storyboarding
2626
Multimodal Error CorrectionMultimodal Error Correction
Dictation error correction studyDictation error correction study• found users are better at correcting found users are better at correcting
recognition errors with a different input recognition errors with a different input modalitymodality
• recognizer got it wrong the first time -> it recognizer got it wrong the first time -> it will get it wrong the second timewill get it wrong the second time
hyperarticulating aggravateshyperarticulating aggravates
Correct dictation errors withCorrect dictation errors with• vocal spelling, writing, typing, etcvocal spelling, writing, typing, etc
2727
SummarySummary
Speech UIsSpeech UIs• may permit more natural computer accessmay permit more natural computer access• allow us to use computers in more situationsallow us to use computers in more situations• are hard to get to work wellare hard to get to work well
lack of visible state, tax working memory, recognition lack of visible state, tax working memory, recognition problems, etc.problems, etc.
UI tools are needed for speech UI designUI tools are needed for speech UI design Multimodal UIs address some of the problems Multimodal UIs address some of the problems
with pure speech UIswith pure speech UIs• help disambiguatehelp disambiguate• help w/ correctionhelp w/ correction