Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
The Universal Speech Interface (USI) PDG Progress Report
Thomas Harris, Stefanie Tomko, Arthur Toth, James Sanders,
Alex Rudnicky, Roni Rosenfeld
School of Computer Science
Carnegie Mellon University
4 June 2003
Outline
• USI Project Summary• USI Device Control• USI User Studies• Tech Transfer Initiative
– USI Application Generator
Program Goals and Plan
• Overall program goal: – Design a universal (i.e. device-independent)
interface for speech-based interaction with wearable and home devices
• Program plan & milestones:– Q1: analysis, interaction principles– Q2: build device-simulation environment– Q3: build first device prototype– Q4: initial user studies; development tools
Program Deliverables
• A novel universal design for speech-based interaction with wearable- and home-devices
• At least one demonstration system exemplifying the new interface
• A set of tools for rapid prototyping of compliant applications
The Universal Speech Interface (USI)In a Nutshell
• Unifying approach to human-machine speech communication
• Unified “look and feel” across all applications– analogous to the Xerox/Macintosh/Windows GUI
look-and-feel
• Stylized, semi-natural interaction– analogous to the “Graffiti” alphabet for the Palm
PDA
Existing Speech Paradigm 1:Command-and-control Systems
• Specialized language, optimized for a given application– each application has its own interface
• Intensive training of each user• Daily use helps retain knowledge
Existing Speech Paradigm 2:Unconstrained Dialog Systems
• “Off-the-street” users, no training required• System models existing human behavior• But this comes at a cost:
– each application requires a great deal of data, labor, human expertise
– Speech Recognition technology is pushed to the limit– user does not easily grasp the application’s
functional limits• Out-Of-Vocabulary words (OOV)• Out-Of-Domain concepts, requests
Is a Third Paradigm Needed?
• In practice, people are likely to use:– a handful of apps daily:
• scheduler, contact manager, email,...
– many apps occasionally:• weather, restaurants, ...
• To exploit this, we need:– flexible, powerful interface for familiar applications.– immediate engagement with occasional or new
applications.
Our Approach
• Identify application-independent universals:– user-side– machine-side
• Find suitable, general solutions– Human and machine meeting halfway
• Design a stylized, universal “look and feel”• Teach it in 5 minutes
Universal Semantic primitives
• Help primitives– what can the machine do? how do I do X? what can I say?
• Speech channel primitives– detect & correct ASR errors; finished talking?
• Interaction primitives– turn taking; question answering; session management; undo
• Application primitives– environment variables: query, set– objects (e.g. lists): describe, navigate, create, modify, delete
USI Systems Developed
• Information Access– MovieLine– FlightLine– ApartmentLine
• Device Control– Stereo system– X-10 control (e.g., lights)– Alarm Clock applet– Digital Video Camera– Windows Media Player
Device Interaction Analysis
• Analysis was done on multiple devices– alarm clock / radio– VCR– cell phone– MP3 player– memo pad / email / vmail– copier/fax
USI/Device Design Issues
• Confirmation strategy• Error handling strategy• Exploration• Navigation• Disambiguation / context mgmt• Orientation• Querying state variables
USI/Device Design Issues
• Confirmation strategy: restate-&-execute
• Error handling strategy: ignore
• Exploration: “OPTIONS”
• Navigation: use concept of ‘focus’
• Disambiguation / context mgmt: implicit
• Orientation: “STATUS”
• Querying state variables: “WHAT IS THE...?”
Hooking up with the PUC project
• Fits within the PUC project’s vision of automatically generated interfaces with different modalities and form factors
• But, can also be used as a standalone speech interface
• Compatibility with visual design is desirable, but not always natural:– nameless states (speech interface must have
name for everything!)– speech interface can have shortcuts (“MODE: CD”
vs. “CD”)
Meshing with the PUC project
• Device capabilities specified by XML doc• States vs. Action dichotomy of the visual
interface does not always conform to speech interface intuition.
• For now, creating our own interface specification document
• Ultimately, will augment XML DTD, so both interfaces can co-exist
USI Device control(a.k.a. James the Butler)
frequency...
station...
am
frequency...
station...
fm
(radioband)
forw ard
backw ard
seek
tuner auxiliary
play
pause
stop
(status)
#
disc
next track last track
random ... repeat...
cd
(m ode)<turns stereo on>
on
off
x-bass
volum e up
volum e dow n
volum e off
Stereo
digital camera...
James
Hardware hacking courtesy of the PUC project
User study
• Compared Speech Graffiti (SG) & natural language MovieLines
• How does Speech Graffiti compare to a natural language interface?– Subjective user satisfaction– Task completion rates– Word error rates
• How do well do users "get" Speech Graffiti?– How often do they speak within the grammar?– In what ways do they deviate from the grammar?
Subjective user satisfaction
• 17 of 23 preferred Speech Graffiti (SG)
1 2 3 4 5 6 7
system resp. acc.
likeability
cog. demand
annoyance
habitability
speed
OVERALL
mean user satisfaction rating
NL-ML
SG-ML
• SG user satisfaction ratings higher than NL in all categories
• SG ratings positive except in annoyance & habitability
Computer experience & training
• Computer Science / Engineering backgrounds and / or programming experience – Higher user satisfaction ratings– Better task completion rates
• Training in-domain vs. out-of-domain– No differences in user satisfaction or task
completion rates
Task completion
• Overall– 67.9% SG tasks– 67.4% NL tasks
• Individual means– 5.43 of 8 SG tasks– 5.30 of 8 NL tasks
0
1
2
3
4
5
6
7
8
mean t
ask
com
ple
tion r
ate
SG-ML NL-ML
Time-to-completion
• Completed tasks– 67.9 seconds SG – 73.4 seconds NL
• Incomplete tasks:
1 2 3 4
0
200
400
600
time, in seconds
“best case” “real world”
27.3
43.5
76.0
23.0
38.0
103.8
(inc)
81.5
34.0
(inc)
103.0
28.0
59 incompletes 59 incompletes
SGML SGMLNLML NLML
Turns-to-completion
• Completed tasks– 8.2 turns SG – 3.9 turns NL
• Incomplete tasks:
1 2 3 4
5
20
3535
5
20
(inc) (inc)
4
5
9.75
1
2
510
4
5
“best case” “real world”
# of turns
SG-ML SG-MLNL-ML NL-ML
59 incompletes 59 incompletes
2
Word error rates
• Very high for both systems– On "cleaned" set (on-task, non-noisy utts)
• Concept error is lower for USI – SG: –29.2% from WER– NL: +0.8% from WER
• Low error rate is key to acceptance– 6 who preferred NL-ML had highest SG WER
WER# of utts
subj mean
subj median
SG Movie 35.1% 3626 35.0% 30.0%NL Movie 51.2% 1854 50.3% 48.9%
WER & user satisfaction
• Good correlation for SG
SG-ML
% word-error rate0 20 40 60 80
1
2
3
4
5
6
0 20 40 60 801
2
3
4
5
6
user
sati
sfa
cti
on
rati
ng
NL-ML
How often do users speak within the Speech Graffiti grammar?
• Actually, pretty often!
… and
• grammaticality leads to user satisfaction
mean 80.5%median 87.4%
1
2
3
4
5
6
7
0% 20% 40% 60% 80% 100%
% grammatical
use
r sa
tisf
act
ion r
ati
ng
How do users deviate from the grammar?
slot only14.6%
time syntax1.3%
subject-verb agreement
5.7%
more syntax4%
plural+options
2%
disfluency4.3%
keyword problem8.1%
value+options
1%
missing is/are
11%
endpoint1.6%
value only6.7%
out-of-vocabulary
concept5.1%
out-of-vocabulary word
14.0%
general syntax20.6%
Future Interface Design Work
• Redesign Help facility– SG works best for those who "get it"– Current system provides no assistance to "clueless user"
• Error analysis– Compare failure cases in SG and NL interfaces– Compare user recovery attempts in SG and NL
• Address issues of generalizability– Promoting transparency of slot set and response sets– Accessing information sets rather than single items
• Adjust grammar components
Future Architecture Work
• Integrate current USI environments– Information Access– Device Control
• Improve interface between PUC and USI components
• Identify USI-specific techniques to achieve lower WER
• Improved documentation and distribution packaging
Tech Transfer Initiative
• Tools for creating new USI apps– 3 days to create a new application– prior exposure to speech technology highly
beneficial– decided to further reduce the barrier create an application generator
From 3 Days to a Few Hours
• A USI Application Generator• New USI applications w/out programming!• XML document fully specifies the
application– slot names– accepted inputs– data types– slot properties– ...
From a Few Hours to 15 minutes?
• Created a Web interface to generating the XML document
• Form filling, pulldown menus• Strong effort to further simplify the process,
minimize complexity of form– many defaults– for less common choices, edit the XML doc.
• More importantly, no computer savvy needed
Web Application Generator
• Repository and tool for creating USI database applications
• Abundant online help to guide users through process
• Accessible to anyone with an Internet connection
Web Application Generator
• Two step process:– General specification – Slot-by-slot specification
• choose datatype from built-in list, or create own
• Fully featured system with save, copy, delete functionality
• Hides intricacies of XML document writing• Advanced users have ability to further
alter the final XML document
Web Application Generator
• Built-in generic voice; can record own voice• DB backend
– Postgres– Oracle– ODBC (including ASCII files)– Ultimately: web tables
• Platform:– originally: mixed Unix/Windows, telephone based– converted to: pure Windows, telephone or laptop
Transferring USI to PDG members
• We do house calls!– Carnegie Mellon will install USI developer
environment for each interested member and will train member staff in the use of the developer environment
– Provide a short tutorial on USI principles and interface design