Post on 09-May-2015
transcript
Google TechTalk – Mar 6 th, 2009 1Paolo Baggia 1
March 6th, 2009
Voice Browser and Multimodal Interaction In 2009
Paolo BaggiaDirector of International Standards
Google TechTalk
Google TechTalk – Mar 6 th, 2009 2Paolo Baggia
Overview
A Bit of History
W3C Speech Interaction Framework TodayASR/DMTFTTSLexiconsVoice Dialog and Call ControlVoice Platforms and Next Evolutions
W3C Multimodal Interaction TodayMMI ArchitectureEMMA and InkMLA language for Emotions
Next Future
Google TechTalk – Mar 6 th, 2009 3Paolo Baggia
Company Profile
� Privately held company (fully owned by Telecom Italia), founded in 2001 as spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and expertise in voice processing.
� Global Company, leader in Europe and South America for award-winning, high quality voice technologies (synthesis, recognition, authentication and identification) available in 26 languages and 62 voices.
� Multilingual, proprietary technologies protected over 100 patents worldwide
� Financially robust, break-even reached in 2004, revenues and earnings growing year on year
� Growth-plan investment approved for the evolution of products and services.
� Offices in New York. Headquarters in Torino, local representative sales offices in Rome, Madrid, Paris, London, Munich
� Flexible: About 100 employees, plus a vibrant ecosystem of local freelancers.
Torino
Rome
Madrid
Paris
London
New York
Munich
Google TechTalk – Mar 6 th, 2009 4Paolo Baggia
International Awards
“Best Innovation in Automotive Speech Synthesis” Pri ze AVIOS-SpeechTEK West 2007
“Best Innovation in Expressive Speech Synthesis” Pri ze AVIOS-SpeechTEK West 2006
“Best Innovation in Multi-Lingual Speech Synthesis”Prize AVIOS-SpeechTEK West 2005
“2008 Frost & Sullivan European Telematics and Infot ainmentEmerging Company of the Year” Award
Winner of “Market leader-Best Speech Engine” Speech Industry Award 2007 and 2008
Loquendo MRCP Server: Winner of 2008 IP Contact Center Technology Pioneer Award
Google TechTalk – Mar 6 th, 2009 5Paolo Baggia
A Bit of History
Google TechTalk – Mar 6 th, 2009 6Paolo Baggia
Standard Bodies
Two main standard bodies:W3C – World Wide Web Consortium
Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).400 members all over the world, 50 Working, Interest and Coordination Groups.W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM, SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web Accessibility, Device Independence)
IETF – Internet Engineering Task ForceFounded in 1986, but growth in 1991as Internet Society. 1300 members.HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP) is very relevant for speech platforms.
Two industrial forums:VoiceXML Forum (www.voicexml.org)
Inventors of VoiceXML 1.0, then submitted to W3C for standardization.Current goal is to promote, disseminate and support VoiceXML and related standards.
SALT Forum (www.saltforum.org)Supported by Microsoft to define a lightweight markup for telephony and multimodal applications.
Other relevant bodies:3GPP, OMA, ETSI, NIST
Google TechTalk – Mar 6 th, 2009 7Paolo Baggia
SSML 1.0 W3C RecSRGS 1.0
W3C Rec
1998
1999
2000
2002
2004
W3C Voice Browser
WorkshopVoiceXML 1.0
Released
VoiceXML Forum Birth
W3C charters Voice Browser
WG
W3C charters Multimodal Interaction
WG
SALT Forum Birth
VoiceXML 2.0 W3C Rec
By AT&T, IBM,Lucent, Motorola,
By Cisco, Comverse, Intel, Microsoft, Philips,SpeechWorks,
Preparing to announce VoiceXML 1.0Friday Feb. 25 th, 2000Lucent, Naperville, Illinois
Left to right: Gerald Karam (AT&T), Linda Boyer (IBM), Ken Rehor (Lucent), Bruce Lucas (IBM),Pete Danielsen (Lucent), Jim Ferrans (Motorola), Dave Ladd (Motorola).
The (r)evolution of VoiceXML1998 - 2004
SISR 1.0 W3C Rec
2007
VoiceXML 2.0 W3C Rec
2008
EMMA 1.0 W3C Rec
PLS 1.0W3C REC
2009
Google TechTalk – Mar 6 th, 2009 8Paolo Baggia
Speech Interface Framework in 2000 (by Jim Larson)
DialogManager
WorldWideWeb
TelephoneSystem
ContextInterpretation
MediaPlanning
LanguageGeneration
TTS
ASRLanguage
Understanding
DTMF Tone Recognizer
Pre-recorded Audio Player
Speech SynthesisMarkup Language (SSML)
Pronunciation LexiconSpecification (PLS)
Reusable Components Call Control XML(CCXML)
Semantic Interpretation forSpeech Recognition (SISR)
N-gram Grammar ML
Speech RecognitionGrammar Spec. (SRGS)
Natural LanguageSemantics ML
VoiceXML 2.0
VoiceXML 2.1 EMMA
User
Google TechTalk – Mar 6 th, 2009 9Paolo Baggia
DialogManager
WorldWideWeb
TelephoneSystem
ContextContextContextContextInterpretationInterpretationInterpretationInterpretation
MediaPlanning
LanguageGeneration
TTS
ASR
DTMF Tone Recognizer
Pre-recorded Audio Player
Speech SynthesisMarkup Language (SSML)
Pronunciation LexiconSpecification (PLS)
Reusable Components Call Control XML(CCXML)
Semantic Interpretation forSpeech Recognition (SISR)
N-gram Grammar ML
Speech RecognitionGrammar Spec. (SRGS)
Natural LanguageSemantics ML
VoiceXML 2.0
VoiceXML 2.1 EMMA 1.0
User
LanguageLanguageLanguageLanguageUnderstandingUnderstandingUnderstandingUnderstanding
Speech Interface Framework - Today(by Jim Larson)
Google TechTalk – Mar 6 th, 2009 10Paolo Baggia
DialogManager
WorldWideWeb
TelephoneSystem
ContextContextContextContextInterpretationInterpretationInterpretationInterpretation
MediaPlanning
LanguageGeneration
TTS
ASR
DTMF Tone Recognizer
Pre-recorded Audio Player
Speech SynthesisMarkup Language (SSML)
Pronunciation LexiconSpecification (PLS)
Reusable Components Call Control XML(CCXML)
Semantic Interpretation forSpeech Recognition (SISR)
N-gram Grammar ML
Speech RecognitionGrammar Spec. (SRGS)
Natural LanguageSemantics ML
VoiceXML 2.0
VoiceXML 2.1 EMMA 1.0
User
LanguageLanguageLanguageLanguageUnderstandingUnderstandingUnderstandingUnderstanding
Speech Interface Framework - End of 2009 (by Jim Larson)
Google TechTalk – Mar 6 th, 2009 11Paolo Baggia
W3C Process
Google TechTalk – Mar 6 th, 2009 12Paolo Baggia
Architectural Changes
User Speech Applic.
ASR / DTMF
TTS / Audio
Traditional (proprietary) architecture
ProprietarySCE
Proprietaryplatform
User VoiceXML Browser
ASR / DTMF
TTS / Audio
Web Applic.HTTP
VoiceXML architecture
.vxml
.grxml/.gram, .pls
.ssml, .wav/.mp3, .pls
VoiceXMLplatform
Google TechTalk – Mar 6 th, 2009 13Paolo Baggia
The VoiceXML Impact
VoiceXML changed the landscape of IVRs and speech application creationFrom proprietary to standard-based speech applications
• Proprietary platforms(HW & SW)
• Proprietary applications (by proprietary SCE)
• Mainly DTMF and pre-recorded prompts
• First attempts to add speech into IVR
• Standard VoiceXMLplatforms
• Standards for SpeechTechnologies
• Standard tools forVoiceXML applications
• Integration of DTMFand ASR
• Still predominance ofDTMF, but more andmore speechapplications
Before After
Google TechTalk – Mar 6 th, 2009 14Paolo Baggia
Overview
� A Bit of History
W3C Speech Interaction Framework TodayASR/DMTFTTSLexiconsVoice Dialog and Call ControlVoice Platforms and Next Evolutions
� W3C Multimodal Interaction Today� MMI Architecture� EMMA and InkML� A language for Emotions
� Next Future
Google TechTalk – Mar 6 th, 2009 15Paolo Baggia
Standards for ASR and DTMFSRGS 1.0, SISR 1.0
Google TechTalk – Mar 6 th, 2009 16Paolo Baggia
SYNTAXDefines constraints on
admissible sentences fora specific recognition turn
W3C Standards for Speech/DTMF Grammars
Speech
grammar
SRGSSRGS
voicevoice dtmfdtmf
ABNFABNF XMLXML
SEMANTICSDescribes how to
produce results after an utterance is recognized
SISRSISR
literalliteral scriptscript
http://www.w3.org/TR/semantic-interpretation/http://www.w3.org/TR/speech-grammar/
Google TechTalk – Mar 6 th, 2009 17Paolo Baggia
SRGS/SISR Grammars for “Torino”
#ABNF 1.0 iso-8859-1;
mode voice;
tag-format < semantics/1.0 >;
{var unused=7;};
public $main = Torino {out="10100";} ;
<?xml version="1.0" encoding="UTF-8"?><grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format=" semantics/1.0 ">
<tag>var unused=7;</tag><rule id="main" scope="public">
<token>Torino</token><tag>out="10100";</tag>
</rule>
</grammar>
SISRscript
#ABNF 1.0 iso-8859-1;
mode voice;
tag-format < semantics/1.0-literals >;
public $main = Torino {10100} ;
<?xml version="1.0" encoding="UTF-8"?><grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format=" semantics/1.0-literals ">
<rule id="main" scope="public"><token>Torino</token><tag>10100</tag>
</rule>
</grammar>
SISRliteral
SRGS ABNFSRGS XML
Google TechTalk – Mar 6 th, 2009 18Paolo Baggia
SRGS/SISR Standards – Pros
Powerful syntax (CFG) and very powerful semantics (ECMA)DMTF and Voice input are transparent to the applicationWide and consistent adoption among technology vendors
Two syntax XML and ABNF are great!� Developers can choose (XML validation vs. compact format)
� Transformations are possibleXML � ABNF (easy, simple XSLT)ABNF � XML (requires a ABNF parser)
� Open Source tools might be created to:� Validate grammar syntax
� Transform grammars
� Debug grammars on written input� Coverage tests: explode covered sentences, GenSem, SemTester, etc.
Google TechTalk – Mar 6 th, 2009 19Paolo Baggia
SRGS/SISR Standards – Small Issues
Semantics declaration: tag-format attribute� If value “semantics/1.0”?
� Mandate SISR Script semantics inside semantic tags� If value “semantics/1.0-literal”?
� Mandate SISR Literal semantics inside semantic tags� If missing?
� Unclear! Risk of interoperability troubles
SISR Script Semantics� Clumsy default assignment: returns last referenced rule only
� Developer must properly pop-up results� Be careful to redefine “out”
� Assign a scalar value might result in errors
SISR Literal Semantics� Only useful for very simple word-list rules� No support for encapsulating rules
� SISR Literal grammars as external references ONLY!
Google TechTalk – Mar 6 th, 2009 20Paolo Baggia
SRGS/SISR – Encapsulated Grammars
Gr1.grxmlScript
Gr2.gramLiteral
Gr3.grxmlScript
Gr41.grxmlLiteral
Gr42.gramScript
Google TechTalk – Mar 6 th, 2009 21Paolo Baggia
SRGS/SISR Standards – Rich XML Results
Section 7 of SISR 1.0 specificationhttp://www.w3.org/TR/semantic-interpretation/#SI7
Serialization rules from SISR ECMA results into XMLEdge cases:
ArraysSpecial variable “_attribute ” and “_value ”
Creation of namespaces and prefixes{
drink: {_nsdecl: {
_prefix:"n1",_name:"http://www.example.com/n1"
},_nsprefix:"n1",liquid: {
_nsdecl: {_prefix:"n2",_name:"http://www.example.com/n2"
},_attributes: {
color: {_nsprefix:"n2",_value:"black"
}},_value:"coke"
},size:"medium"
}}
<n1:drink xmlns:n1="http://www.example.com/n1"><liquid n2:color="black“
xmlns:n2="http://www.example.com/n2">coke</liquid><size>medium</size>
</n1:drink>
Google TechTalk – Mar 6 th, 2009 22Paolo Baggia
SRGS/SISR Standards – Next Steps
Adoption of the PLS 1.0 lexicon� Clear entry point into PLS lexicons, <token> element
� Missing role attribute in <token> to allow homographs disambiguation
Next extensions via Errata� XML 1.1 support and IR
� Update normative references
���� No Major Extensions are needed!
Google TechTalk – Mar 6 th, 2009 23Paolo Baggia
Speech SynthesisSSML 1.0/1.1
Google TechTalk – Mar 6 th, 2009 24Paolo Baggia
TTS – Functional Architecture and Markup/Non-Markup support
StructureAnalysis
TextNormalization
Text-to-Phoneme
Conversion
ProsodyAnalysis
WaveformProduction
Markup support:<p>, <s>Non-Markup support:infer the structure by automatic text analysis
Markup support:<say-as> for date, time, phone number, numbers<sub> for acronyms and transliterationsNon-Markup support:automatically identify and convert constructs
Markup support:<phoneme> , <lexicon>Non-Markup support:look up in pronunciation dictionary
Markup support:<emphasis> , <break> , <prosody>Non-Markup support:automatically generate prosody through analysis of document structure and sentence syntax
Markup support:<voice> , <audio>Non-Markup support:
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6 th, 2009 25Paolo Baggia
SSML 1.0 – Language description (I)
Document Structure<speak> root element
<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0" xmlns="http://www.w3.org/2001/ 10/synthesis" xml:lang="en-US"><p>I don't speak Japanese.</p><p xml:lang="ja">Nihongo-ga wakarimasen.</p></speak>
version attributeSSML namespace attribute
Languages
Processing and Pronunciation – <p> and <s> (paragraph and sentence)
to give a structure to the text– <say-as> element
to indicate the type of text construct contained within the elementex. date, numbers, etc.
– <phoneme> elementto provides a phonetic pronunciation for the contained text in IPA
– <sub> elementto provide substitutions for expanding acronyms in sequence of words
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6 th, 2009 26Paolo Baggia
SSML 1.0 – Language description (II)
Style- <voice> element
<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lan g="en-US">
The moon is raising on the beach, when John says, looking Mary in the eyes:<voice name="simon">I love you!</voice>
but she suddenly replies:<voice name="susan"> Please, be serious! </voice>
</speak>
Other voice selection attributes are:name, xml:lang , gender , age , and variant
- <emphasis> elementrequests that the contained text be spoken with emphasis
level attribute can set it to strong , moderate , reduced , or none
- <break> elementcontrols the pausing between words
time attribute with two kind of values:Time expressions “5s”, “20ms”
strength attribute with values:none , x-weak , weak, medium (default value), strong , or x-strong
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6 th, 2009 27Paolo Baggia
SSML 1.0 – Language description (III)
Prosody<prosody> element
permits control of the pitch, speaking rate and volume of the speech output.
The attributes are:volume : the volume for the contained text.rate : the speaking rate in words-per-minute for the contained text.duration : a value in seconds or milliseconds for the desired time to take
to read the element contents.pitch : the baseline pitch for the contained text.range : the pitch range (variability) for the contained text in Hertz.contour : sets the actual pitch contour for the contained text.
Other elements<audio> element - to play an audio file<mark> element - to place a marker into the text/tag sequence<desc> element - to provide a description of a non-speech audio
source in <audio>http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6 th, 2009 28Paolo Baggia
Towards SSML 1.1 – Motivations
Internationalization needs:� Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
� Results:� No major needs for Eastern and Western European languages
� Many issues for Far East languages (Mandarin, Japanese, Korean)
� Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many Indian languages
� Mark input with or without vowels� Mark the transliteration schema used for input
Extensions required by Voice Browser:� More powerful error handling, selection of fall-back strategies� Trimming attributes
� Volume attribute to adopt a logarithmic scale (before was linear)
Alignment with PLS 1.0 specification for user lexicons
http://www.w3.org/TR/speech-synthesis11/
Google TechTalk – Mar 6 th, 2009 29Paolo Baggia
SSML 1.1 – Language Changes
<w> element
Lexicon extensions<lookup> element
permits control of the pitch, speaking rate and volume of the speech output.
Phonetic Alphabet Registry creation and adoption� "ipa " for International Phonetic Alphabet
� Registering policy for other phonetic alphabets, similar to LTRU for Language tags
� Candidates:� PinYin for Mandarin Chinese� JEITA for Japanese
� X-SAMPA, ASCII transliteration of IPA codes
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6 th, 2009 30Paolo Baggia
Pronunciation LexiconPLS 1.0
Google TechTalk – Mar 6 th, 2009 31Paolo Baggia
Pronunciation Lexicons
Pronunciation LexiconA mapping between words (or short phrases), their written representations,
and their pronunciations suitable for use by an ASR engine or a TTS engine
Pronunciation lexicons are not only useful for voice browsers They have also proven effective mechanisms to support accessibility for the
differently able as well as greater usability for all users
They are used to good effect in screen readers and user agents supporting multimodal interfaces
The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 isdesigned to enable interoperable specification of pronunciation lexicons
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6 th, 2009 32Paolo Baggia
PLS 1.0 – Language Overview
A PLS document is a container (<lexicon> ) of several lexical entries (<lexeme> )
Each lexical entry containsOne or more spellings (<grapheme> )
One or more pronunciations (<phoneme>) or substitutions (<alias> )
Each PLS document is related to a single unique language (xml:lang )
SSML 1.0 and SRGS 1.0 documents can reference one or more PLS documents
Current version doesn’t include morphological, syntactic and semantic information associated with pronunciations
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6 th, 2009 33Paolo Baggia
PLS 1.0 – An Example
<?xml version="1.0" encoding="UTF-8"?><lexicon version="1.0"
xmlns=" http://www.w3.org/2005/01/pronunciation-lexicon "xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance " xsi:schemaLocation=" http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/pronunciation-lexicon/pls.xsd "alphabet =" ipa " xml:lang =" en-US ">
<lexeme ><grapheme >Sepulveda</ grapheme ><phoneme>sə ˈ̍̍̍pȜȜȜȜlv ǺǺǺǺdə</ phoneme>
</ lexeme >
<lexeme ><grapheme >W3C</grapheme ><alias >World Wide Web Consortium</ alias >
</ lexeme >
</ lexicon >
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6 th, 2009 34Paolo Baggia
PLS 1.0 – Used for TTS
SSML 1.0<?xml version="1.0" encoding="UTF-8"?><speak version="1.0" … xml:lang="en-US">
<lexicon uri="http://www.example.com/SSMLexample.pl s"/>The title of the movie is: " La vita è bella " (Life is beautiful),which is directed by Benigni .
</speak>
PLS 1.0<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" … alphabet="ipa" xml:lang="en -US">
<lexeme>
<grapheme> La vita è bella </grapheme>
<phoneme> ˈ̍̍̍l ǡǡǡǡ ˈ̍̍̍vi ːːːːȎȎȎȎə ˈ̍̍̍ȤȤȤȤeǺǺǺǺ ˈ̍̍̍bǫǫǫǫl ə</phoneme></lexeme>
<lexeme>
<grapheme> Benigni </grapheme>
<phoneme>bǫǫǫǫ ˈ̍̍̍ni ːːːːnji</phoneme></lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6 th, 2009 35Paolo Baggia
PLS 1.0 – Used for ASR
SRGS 1.0<?xml version="1.0" encoding="UTF-8"?><grammar version="1.0“ xml:lang="en-US" root="movies " mode="voice">
<lexicon uri="http://www.example.com/SRGSexample.pl s"/><rule id="movies" scope="public">
<one-of><item>Terminator 2: Judgment Day</item> <item>Pluto's Judgement Day</item>
</one-of> </rule>
</grammar>
PLS 1.0<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" … alphabet="ipa" xml:lang="en -US">
<lexeme>
<grapheme> judgment </grapheme>
<grapheme> judgement </grapheme>
<phoneme> ˈ̍̍̍dʒȜȜȜȜdʒ.mənt</phoneme></lexeme>
</lexicon>http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6 th, 2009 36Paolo Baggia
Examples of Use
Multiple pronunciations for the same orthography
Multiple orthographies
Homophones
Homographs
Acronyms, Abbreviations, etc.
Detailed descriptions can be found in:W3C specification, WikipediaPaolo Baggia, SpeechTEK 2008 & Voice Search 2009
Google TechTalk – Mar 6 th, 2009 37Paolo Baggia
PLS 1.0 – Open Issues
No wide support of IPA in speech engines � Slowly changes are under way
� Phonetic Alphabet Registry will open doors to other alphabets in a controlled and interoperable way
Integration in ASR/TTS� SSML 1.1 will interoperate with PLS 1.0
� SRGS 1.0 still missing support of role attribute for PLS 1.0
No matching algorithm inside PLS, because it is mainly a data format
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6 th, 2009 38Paolo Baggia
Pronunciation AlphabetsIPA, SAMPA
Google TechTalk – Mar 6 th, 2009 39Paolo Baggia
International Phonetic Alphabet
Pronunciation is represented by a phonetic alphabet� Standard phonetic alphabets
International Phonetic Alphabet (IPA)
� Well known phonetic alphabetSAMPA - ASCII based (simple to write)Pinyin (Chinese Mandarin), JEITA (Japanese), etc.
� Proprietary phonetic alphabets
International Phonetic Alphabet (IPA)� Created by International Phonetic Association (active since 1896),
collaborative effort by all the major phoneticians around the world
� Universally agreed system of notation for sounds of languages� Covers all languages
� Requires UNICODE to write it
� Normatively referenced by PLS
Google TechTalk – Mar 6 th, 2009 40Paolo Baggia
IPA – Chart
IPA was founded in 1886It is the major international
association of phoneticiansThe IPA alphabet provides
symbols making possible the phonemic transcription of all known languages
IPA characters can be encoded in Unicode by supplementing ASCII with characters from other ranges, particularly:
IPA extensions (0250–02AF)
Latin Extended-A (0100-017F)
See the detailed: http://www.unicode.org/charts
Google TechTalk – Mar 6 th, 2009 41Paolo Baggia
Phonetic Alphabets – Issues
The real problem is how to write pronunciation in a reliable, unless
you are trained phonetician
Issues with fonts and authoring, browsers, but Unicode fonts today
support IPA extensions, see:
� http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm
There are very few tools to help writing pronunciations and to let
you listen to what you have written
� Make available pronunciations in IPA or other general phonetic
languages.
Google TechTalk – Mar 6 th, 2009 42Paolo Baggia
Voice Dialog languages:VoiceXML 2.0VoiceXML 2.1
Google TechTalk – Mar 6 th, 2009 43Paolo Baggia
VoiceXML 2.0 – Features, Elements
Menus, forms, sub-dialogs<menu>, <form> , <subdialog>
InputSpeech recognition<grammar>
Recording<record>
Keypad<grammar mode="dtmf">
OutputAudio files<audio>
Text-To-Speech<prompt>
Variables (ECMA-262)<var> , <assign> , <script>
scoping rules
Events<nomatch> , <noinput> , <help>,
<catch> , <throw>
Transition and submission<goto> , <submit>
TelephonyConnection control<transfer> , <disconnect>
Telephony informationPlatform specifics
<object>
PerformanceFetchProperties
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6 th, 2009 44Paolo Baggia
VoiceXML 2.0 – Execution Model
Execution is synchronous� Only disconnect event is handled (somewhat) asynchronous
Execution is always in a single dialog: <form> or <menu>� Form Interpretation Algorithm for <field> selection
Prompt are queued� Played only when encountering a waiting state� Played before a fetchaudio is started
Processing is always in one of two states:� Waiting for input in an input item:
<field> , <record> , <transfer> , etc.� Transitioning between input items in response of an input
Event-driven:� <nomatch> , <noinput> user’s input event handling� <catch> , <throw> generalized event mechanism� connection.* call event handling� error.* error event handling
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6 th, 2009 45Paolo Baggia
VoiceXML 2.1 – Extended Features
Dynamically referencing grammars and scripts:<grammar expr="…"> , <script expr="…">
Record user’s utterance during form fillingrecordutterance propertyAdd new shadow variables: recording , recordingsize , recordingduration
Detect barge-in during prompt playback (SSML <mark> )Add markexpr attributeAdd new shadow variables: markname and marktime
Fetch XML data without transitionUse read-only subset of DOM
Dynamically concatenate prompts <foreach>
Iterate throught ECMAScript arrays and execute content
Send data upon disconnect<disconnect namelist="…">
Additional transfer type<transfer type="consultation"> http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6 th, 2009 46Paolo Baggia
VoiceXML Applications
Static VoiceXML applications� The VoiceXML page is always the same, so the user experience
� No personalization or customization
Dynamic VoiceXML applications� User experience is customized
• After authentication (PIN) • Using caller-id or SIP-id
� Data driven
� Dynamic pages generated at runtimee.g. JSP, ASP, etc.
http://www.w3.org/TR/voicexml21/http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6 th, 2009 47Paolo Baggia
A Drawback of VoiceXML 2.0
A drawback of VoiceXML is that the transition from a VoiceXML page to another is a costly activity:� Fetch the new page, if not cached� Parse the page
� Initialize the context, possibly loading and initializing a new application root document
� Load or pre-compile scripts
The transitions are the only way to return data to the Web Application(if the VoiceXML is dynamic)
Pages must be created to include dynamic data
� VoiceXML 2.1 addresses part of this drawback by feeding dynamic data to a running VoiceXML page
http://www.w3.org/TR/voicexml21/http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6 th, 2009 48Paolo Baggia
Advantages of VoiceXML 2.1 - AJAX
Two of the eight new features in VoiceXML 2.1 helps to createmore dynamic VoiceXML applications:� <data> element� <foreach> element
Static VoiceXML document can fetch user-specific data at runtime, without changing the VoiceXML document<data> element allows retrieval of arbitrary XML data without VoiceXML document transitionsReturned XML data are accessible by a subset of DOM primitives<foreach> extend the prompts to allow the iteration on a dynamic array of information to create a dynamic prompt
This is similar to AJAX programming for HTML servicesIt decouples presentation layer (VoiceXML) from business logic (accessed via <data> )
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6 th, 2009 49Paolo Baggia
VoiceXML 2.1 – <data> Element
Attributes:� name the variable to be filled with the DOM of the retrieved data
� scr or srcexpr the URI of the location of the XML data
� namelist the list of variables to be submitted� method either ‘get ’ or ‘post ’
� enctype media encoding
� fetch and caching attributes
As <var> , it may appear in executable content (<form> and <vxml> )
The value of name must be a declared variableThe platform will fill the variable of the DOM of the fetched XML data<data> element is synchronous (the service stops to get data)
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6 th, 2009 50Paolo Baggia
VoiceXML 2.1 – <foreach> Element
Attributes:� array ECMAScript expression that must evaluate to ECMAScript array
� item the variable that stores the element to be processed
<foreach> allows the application to iterate on an ECMAScript array and to execute the content<foreach> may appear:� In executable content (all executable content elements may appear as
content of <foreach> )� In <prompt> (restrictions on the content are applied)
<foreach> allows sophisticated concatenation of prompts
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6 th, 2009 51Paolo Baggia
VoiceXML – Final Remarks
The changed landscape for speech application development:� Virtually all the IVRs today support VoiceXML
� New options related to VoiceXML:� SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)
� Large hosting of speech applications (TellMe, Voxeo)
� Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)
� Further changes may come from the CCXML adoption
… but:� Mainly system driven applications are actually deployed
� New challenges to incorporate more powerful dialog strategies,mixed-initiative are under discussion.
http://www.w3.org/TR/voicexml21/http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6 th, 2009 52Paolo Baggia
VoiceXML Resources
Voice Browser Working Group (spec, FAQ, implementations, resources):http://www.w3.org/Voice/
VoiceXML Forum site (resources, education, interest groups):http://www.voicexml.org/
VoiceXML Forum Review:http://www.voicexmlreview.org/
Interesting articles related to VoiceXML and moreExample code in the sections "First Words" and "Speak & Listen"
Ken Rehor’s World of VoiceXMLhttp://www.kenrehor.com/voicexml
Online documentation related to VoiceXML PlatformsLoquendo Café, Voxeo (http://www.vxml.org/ ), TellMe, VoiceGenie
Many books on VoiceXML:Jim Larson, "VoiceXML Introduction to Developing Speech Applications", Prentice-Hall,
2002.A. Hocek, D. Cuddihy, "Definitive VoiceXML", Prentice-Hall, 2002
Google TechTalk – Mar 6 th, 2009 53Paolo Baggia
Call Control:CCXML 1.0
Google TechTalk – Mar 6 th, 2009 54Paolo Baggia
CCXML 1.0 – Highlights
Asynchronous event processing
Acceptance or refusal of an incoming call
Different type of transfer call management
Outbound call activation (interaction with an external entity)
Use of ECMAScript adding scripting capabilities to call control applications
VoiceXML modularization
Conferencing management
Google TechTalk – Mar 6 th, 2009 55Paolo Baggia
CCXML 1.0 – Elements Relationship
Google TechTalk – Mar 6 th, 2009 56Paolo Baggia
CCXML 1.0 – Incoming Call
Event catching and processing
connection.alertingCCXML
Interpreter
<?xml version="1.0"encoding="UTF-8"?>
<ccxml version="1.0">
[…]
CCXML document
<transition event="connection.disconnected">
[…]</transition>
event$
name:’connection.alerting’;connectionid:‘0239023901903993’;eventid:’00001’; ....…..
<transition event="connection.alerting">[…]</transition>
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6 th, 2009 57Paolo Baggia
CCXML 1.0 – connection.alerting Event
Basic telephony information has been retrieved on alerting event and is available into CCXML document:Local URI, remote URI, protocol used, redirection info, etc.
Based on certain checked info, CCXML can accept or refuse the incoming call, even before contacting the dialog server;
Any error that can occur during the phone call can be managed byCCXML service (connection.failed , error.connection events)
Call Control Adapter
CCXML Interpreter
VoiceXMLInterpreter
connection.alerting
Analyzing events$ content<accept/> | <reject/>
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6 th, 2009 58Paolo Baggia
CCXML 1.0 – How to activate a new dialog
CCXML actions:� Receives alerting event from Call Control Adapter� Asks to dialog server to prepare a new dialog� Waits for the preparation� If the dialog has been successfully prepared, accept the call � Asks to dialog server to start the prepared new dialog
CCXML Interpreter
Call Control Adapter
VoiceXMLInterpreter
alerting
prepare a new dialog
dialog preparedcall accepted
start the prepared dialog
dialog started
connected
Google TechTalk – Mar 6 th, 2009 59Paolo Baggia
Call transfer
CCXML supports transfer call of different modality: "bridge ", "blind ", "consultation ";
Based on different modalities features CCXML language allows the expected interaction with the Call Control Adapter to correctly perform the transfer;
During the different phases of transfer call creation the CCXML can receive any asynchronous event and correctly manage it, interrupting the call, if requested
CCXML Interpreter
Call Control Adapter
VoiceXMLInterpreter
Performing a transfer
command1
answer1
[…]transfer complete …
Google TechTalk – Mar 6 th, 2009 60Paolo Baggia
External Events
CCXML Interpreter Context can receive events from an external entity able to use the HTTP protocol; Events generated in this way must be sent to a CCXML by a POST HTTP commandA event is so performed and: � It can be addressed on a new session whose creation must be requested� It can be addressed on an existent session, specifying the ID in the
requestCCXML
Interpreter
basic http event
External Entity
Event management
Event management result
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6 th, 2009 61Paolo Baggia
External event on a new session: the Outbound Call
A particular request arrived to Call Control from an external entity;A particular CCXML service associated with the received event is started and a set of operations between Call Control Adapter, Call Control and Dialog Server is activated: the outbound call is so placed
CCXML Interpreter
outbound call request
Call Control Adapter
VoiceXMLInterpreter
prepared
connection progressing …
Start the prepared dialog
Create a call
Prepare a dialog
connection connected
Google TechTalk – Mar 6 th, 2009 62Paolo Baggia
External event on a session:dialog termination request
An external entity performs a HTTP POST request towards the CCXML Interpreter Context, specifying a sessionid, requesting the termination of a particular dialog;The CCXML check the session id, if this is valid then CCXML Interpreter injects the event received in the session; The CCXML service has a transition on that event and performs the dialog termination on a particular dialog identifier;
CCXML Interpreter
Dialog termination request
Call Control Adapter
dialog.exit
dialogterminate (dialogid)
VoiceXMLInterpreter
disconnect(connId) dialogprepare
It depends on dialog.exit eventmanagement
Google TechTalk – Mar 6 th, 2009 63Paolo Baggia
Loading different CCXML documents:<fetch> and <goto> elements
<fetch> and <goto> elements are used respectively to asynchronously fetch content identified by the attributes of the <fetch> and to go in a fetched document, if it’s successfully loaded;
CCXML Interpreter
<fetchnext="'http://../Fetch/doc1.ccxml'" type="'application/ccxml+xml'" fetchid="result"/>
fetch the document "doc1.ccxml"
fetch.done / error.fetch
goto into the new document /continue to work on the same dialog
The first event occurred in a new documentis ccxml.loaded
- MODULARIZATION - SOURCE EXEMPLIFICATION- MORE READABILITY
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6 th, 2009 64Paolo Baggia
Simple CCXML Document
<?xml version="1.0" encoding="UTF-8"?><ccxml version="1.0" xmlns="http://www.w3.org/2002/09/ccxm l">
<var name="currentState"/><var name="myDialogId"/><var name="myConnId"/><eventprocessor statevariable="currentState">
<transition event=" connection.alerting "><assign name="myConnId" expr="event$.connectionid"/ ><accept connectionid="event$.connectionid"/>
</transition><transition event=" connection.connected ">
<dialogstart src="'http://www.example.com/flight.vxml'"connectionid="myConnId" dialogid="myDialogId"/>
</transition><transition event=" dialog.started ">
<log expr="’VoiceXML appl is running now’"/></transition><transition event=" connection.disconnected ">
<dialogterminate dialogid="myDialogId"/></transition><transition event=" dialog.exit ">
<disconnect connectionid="myConnId"/></transition><transition event="*">
<log expr="'Closing, unexpected:'+ event$.name"/><exit />
</transition></eventprocessor>
</ccxml>
Google TechTalk – Mar 6 th, 2009 65Paolo Baggia
CCXML 1.0 – Next Steps
CCXML specification is a Last Call Working Draft, all the feature requests and clarifications have been addressed;
An Implementation Report test suite is under development;
It is very close to be published as W3C Candidate Recommendation;
Internal or external companies will be invited to send implementation report on their CCXML platform;
After that, CCXML 1.0 specification will be able to become Proposed Recommendation and then W3C Recommendation.
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6 th, 2009 66Paolo Baggia
Speech Interface FrameworkTour Complete!
Google TechTalk – Mar 6 th, 2009 67Paolo Baggia
DialogManager
WorldWideWeb
TelephoneSystem
ContextContextContextContextInterpretationInterpretationInterpretationInterpretation
MediaPlanning
LanguageGeneration
TTS
ASR
DTMF Tone Recognizer
Pre-recorded Audio Player
Speech SynthesisMarkup Language (SSML)
Pronunciation LexiconSpecification (PLS)
Reusable Components Call Control XML(CCXML)
Semantic Interpretation forSpeech Recognition (SISR)
N-gram Grammar ML
Speech RecognitionGrammar Spec. (SRGS)
Natural LanguageSemantics ML
VoiceXML 2.0
VoiceXML 2.1 EMMA 1.0
User
LanguageLanguageLanguageLanguageUnderstandingUnderstandingUnderstandingUnderstanding
Speech Interface Framework - End of 2009 (by Jim Larson)
Google TechTalk – Mar 6 th, 2009 68Paolo Baggia
Architectural Changes
User VoiceXML Browser
ASR / DTMF
TTS / Audio
Web Applic.HTTP
VoiceXML architecture
.vxml
.grxml/.gram, .pls
.ssml, .wav/.mp3, .pls
VoiceXMLplatform
Google TechTalk – Mar 6 th, 2009 69Paolo Baggia
VoxNauta – Internal Architecture
Google TechTalk – Mar 6 th, 2009 70Paolo Baggia
Loquendo MRCP Server/LSS 7.0 Architecture
RTSP Parser
TTS and ASR API
RTSP(MRCPv1)
MRCP v1/v2 ServerAPAPI
RTP
AudioProvider
SDPMRCP v1 Parser
APInterf.
TTS & ASR interface
MP
Config
Logger
OS
Management(SNMP)
Configuration files
Log files
Win32/Linux
TTS and ASR API
LTTS LASR
NLSML / EMMA
SIP
SIP(SDP)
Load Balancer
MRCP v2 parser
LASR-SV
MRCP v2
GraphicManagement
Consolle
Google TechTalk – Mar 6 th, 2009 71Paolo Baggia
Media Resource Control Protocol MRCP are IETF standards� MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on
RTSP/RTP� MRCPv2 is Internet Draft,
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTPoffering the new audio recording and Speaker Verificationfunctionalities
Optimized client-server solution for the large-scale deployment of speech technologies in the telephony field, such as call centers, CRM, news and email-reading, self-service applications, etc.Allows standard interface of speech technologies in all IVR platforms
IETF MRCP Protocols
For more information read:Dave Burke, Speech Processing for IP Networks. Media
Resource Control Protocol (MRCP), ed. Wiley
Google TechTalk – Mar 6 th, 2009 72Paolo Baggia
Fixed/MobileNetwork
PBX
ACD
OptionalVoice Gateway for
Non SIP PBX
VOXNAUTA IVR
CTIServer
DataServer
Operators
WEBServer
VoiceXML in a Call Center
Google TechTalk – Mar 6 th, 2009 73Paolo Baggia
Fixed/MobileNetwork
VOICE GATEWAY
VOXNAUTA MRF
IPNetwork
VoiceXML in the IMS Architecture
Application Server
TDM protocols
SIP protocolsRTPVoiceXML on HTTPS
Google TechTalk – Mar 6 th, 2009 74Paolo Baggia
Overview
� A Bit of History
� W3C Speech Interaction Framework Today� ASR/DMTF� TTS� Lexicons� Voice Dialog and Call Control� Voice Platforms and Next Evolutions
W3C Multimodal Interaction TodayMMI ArchitectureEMMA and InkMLA language for Emotions
� Next Future
Google TechTalk – Mar 6 th, 2009 75Paolo Baggia
Modes, Modalities and Technologies
� Speech � Audio� Stylus� Touch� Accelerometer� Keyboard/keypad� Mouse/touchpad� Camera� Geolocation� Handwriting recognition� Speaker verification� Signature verification� Fingerprint identification� ….
Google TechTalk – Mar 6 th, 2009 76Paolo Baggia
Complement and Supplement
Speech Visual- Transient - Persistent- Linear - Spatial- Hands and Eyes-Free - Eyes- Suffers Noise - Suffers Light Conditions
����Enable to choose among different modalities or to mi x them
����Adaptable to different social, environmental conditio ns or to user preference
Google TechTalk – Mar 6 th, 2009 77Paolo Baggia
GUI VUI MUIor
MMUI
Google TechTalk – Mar 6 th, 2009 78Paolo Baggia
InteractionManager
Speakerverification
Speakerverification
Faceidentification
Faceidentification
Audiorecording
Audiorecording
fingerprintfingerprint
drawingdrawing
videovideophotographphotograph
Vitalsigns
Vitalsigns
geolocationgeolocation
speechspeech
texttext
mousemouse
handwritinghandwriting
accelerometeraccelerometer
User intentSensor
Recording
Identification
MMI has an Intrinsic Complexity
Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 79Paolo Baggia
MMI can Include Many Different Technologies
Interaction Manager
Speechrecognition
Handwritingrecognition
Accelerometer
Geolocation
Touchscreen
KeypadFingerprintrecognition
Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 80Paolo Baggia
Getting everything to work together is complicated.
One simplification is to represent the same information from different modalities in the same format.
The need a common language for representing the same information from different modalities
� EMMA (Extensible MultiModal Annotation) 1.0A uniform representation for multimodal information
Uniform Representation for MMI
Google TechTalk – Mar 6 th, 2009 81Paolo Baggia
Interaction Manager
Speechrecognition
Handwritingrecognition
Accelerometer
Geolocation
Touchscreen
KeypadFingerprintrecognition
EMMA
EMMA
EMMA
EMMA
EMMAEMMA
EMMA
Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 82Paolo Baggia
EMMA Structural Elements
Provide containers for application semantics and for multimodal annotation
<emma:emma …><emma:one-of>
<emma:interpretation>…
</emma:interpretation> <emma:interpretation>
…</emma:interpretation>
</emma:one-of></emma:emma>
emma:emma
EMMA Elements
emma:lattice
emma:interpretation
emma:one-of
emma:sequence
emma:group
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6 th, 2009 83Paolo Baggia
EMMA Annotations
Characteristics and processing of input, e.g.:
emma:hook
emma:medium emma:modeemma:function
emma:start emma:end
emma:source
emma:confidence
emma:media-type
emma:signal
emma:lang
emma:uninterpreted
emma:no-input
emma:process
emma:tokens
Timestamps (absolute/relative)
media type
uninterpretable input
lack of input
token of input
medium, mode, and function of input
annotation of input source
hook
confidence scores
reference to signal
human language of input
reference to processing
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6 th, 2009 84Paolo Baggia
EMMA 1.0 – Example Travel Application
INPUT:"I want to go from Boston to Denver on March 11"
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 85Paolo Baggia
EMMA 1.0 – Same meaning
<emma:interpretation medium="acoustic" mode="voice" id="int1">
<origin>Boston</origin>
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>
<emma:interpretation medium="tactile" mode="gui“id="int1">
<origin>Boston</origin>
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>
Speech
Mouse
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 86Paolo Baggia
EMMA 1.0 – Handwriting Input
<emma:interpretation medium="tactile" mode="ink" id="int1">
<origin>Boston</origin>
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 87Paolo Baggia
EMMA 1.0 – Biometrics Input
<emma:emma version="1.0"> <emma:interpretation
id="int1"emma:confidence=".75"emma:medium="visual" emma:mode="photograph" emma:verbal="false" emma:function="identification">
<person>12345</person> <name>Mary Smith</name>
</emma:interpretation> </emma:emma>
<emma:emma version="1.0"> <emma:interpretation
id="int1"emma:confidence=".80"emma:medium="acoustic" emma:mode="voice" emma:verbal="false"
emma:function="identification"> <person>12345</person> <name>Mary Smith</name>
</emma:interpretation> </emma:emma>
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6 th, 2009 88Paolo Baggia
EMMA 1.0 – Representing Lattices
Speech recognizers, Handwriting recognizers and other input processing components may provide lattice output:
A graph encoding a range of possible recognition results or interpretations
flights to fromplease
boston
austinportland
oakland
today
tomorrow1 2 3 54 6
78
http://www.w3.org/TR/emma/ From Michael Joshnston, AT&T Research
Google TechTalk – Mar 6 th, 2009 89Paolo Baggia
Lattices can be represented using EMMA elements:<emma:lattice emma:initial="?" emma:final="?">
<emma:arc emma:from="?" emma:to="?">
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"><emma:interpretation><emma:lattice emma:initial="1" emma:final="8">
<emma:arc emma:from="1" emma:to="2">flights</emma:ar c><emma:arc emma:from="2" emma:to="3">to</emma:arc><emma:arc emma:from="3" emma:to="4">boston</emma:arc ><emma:arc emma:from="3" emma:to="4">austin</emma:arc ><emma:arc emma:from="4" emma:to="5">from</emma:arc><emma:arc emma:from="5" emma:to="6">portland</emma:a rc><emma:arc emma:from="5" emma:to="6">oakland</emma:ar c><emma:arc emma:from="6" emma:to="7">today</emma:arc><emma:arc emma:from="7" emma:to="8">please</emma:arc ><emma:arc emma:from="6" emma:to="8">tomorrow</emma:a rc>
</emma:lattice></emma:interpretation></emma:emma>
EMMA 1.0 – Representing Lattices
http://www.w3.org/TR/emma/ From Michael Joshnston, AT&T Research
Google TechTalk – Mar 6 th, 2009 90Paolo Baggia
EMMA in Multimodal Frameworkhttp://www.w3.org/TR/mmi-framework
EMMA
Google TechTalk – Mar 6 th, 2009 91Paolo Baggia
InkML 1.0 – Digital Ink
Ink Markup Language (InkML), http://www.w3.org/TR/InkML
Data format for presenting digital Ink (pen, stylus, etc)Allows the input and processing of handwritings, gesture, sketches, music, etc.
<ink><trace>
10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 1 12, 9 126, 10 140,13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135,58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205
</trace><trace>
130 155, 144 159, 158 160, 170 154, 179 143, 179 12 9, 166 125,152 128, 140 136, 131 149, 126 163, 124 177, 128 19 0, 137 200,150 208, 163 210, 178 208, 192 201, 205 192, 214 18 0
</trace><trace>
227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 2 29 134,230 148, 234 162, 235 176, 238 190, 241 204
</trace><trace>
282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 2 90 129,291 143, 294 157, 294 171, 294 185, 296 199, 300 21 3
</trace><trace>
366 130, 359 143, 354 157, 349 171, 352 185, 359 19 7,371 204, 385 205, 398 202, 408 191, 413 177, 413 16 3,405 150, 392 143, 378 141, 365 150
</trace></ink>
http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6 th, 2009 92Paolo Baggia
InkML 1.0 – Status and Advances
Rich annotation for Ink:Trace, Trace formats and Trace collections
Contextual information
CanvasesEtc.
Result of classification of InkML traces may be a semantic representation in EMMA 1.0
Current status is Last Call Working Draft, next will be Candidate Recommendation with release of an Impl. Report test-suiteRaising interest from major industries
http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6 th, 2009 93Paolo Baggia
MMI Architecture Specification
“Multimodal Architecture and Interfaces“, W3C Working Draft,http://www.w3.org/TR/mmi-arch/
Runtime Framework provides the basic infrastructure and controls communication among the constituents. Interaction Manager (IM) coordinates Modality Components (MCs) by life-cycle events and contains the shared data (context).Event-based communication between IM and MCs.
Modality Component 1
Modality Component N
Runtime Framework
Data Component
InteractionManager
DeliveryContext
Component
Modality Component API
Ingmar Kliche, SpeechTEK 2008http://www.w3.org/TR/mmi-arch/
Google TechTalk – Mar 6 th, 2009 94Paolo Baggia
MMI Arch – Laboratory Implementation
Implementation of components using W3C markup languages.
Modality Component 1
Modality Component N
Runtime Framework
Data Component
InteractionManager
DeliveryContext
Component
Modality Component API Modality Component API
HTMLfor GUI
SCXML
VoiceXMLfor VUI
Ingmar Kliche, SpeechTEK 2008http://www.w3.org/TR/mmi-arch/
Google TechTalk – Mar 6 th, 2009 95Paolo Baggia
MMI Arch – Laboratory Implementation
SCXML based Interaction Manager.VoiceXML + HTML modality components.
HTML Browser
CCXML/VoiceXMLBrowser
Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)
Server
ClientPhone Client
Server
Telephony interface
GUI modality component Voice modality component
HTTP I/O Processor
SCXML interpreter
Ingmar Kliche, SpeechTEK 2008http://www.w3.org/TR/mmi-arch/
Google TechTalk – Mar 6 th, 2009 96Paolo Baggia
MMI Architecture – Open Issues
Profiles
Start-up, Registration, Delegationin distributed environment
Transport of Events
Extensibility of Events
http://www.w3.org/TR/mmi-arch/
Google TechTalk – Mar 6 th, 2009 97Paolo Baggia
Emotion in Wikipedia
From Wikipedia definition:
“An emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behaviours. It is a prime determinant of the sense of subjective well-being and appears to play a central role in many human activities. As a result of this generality, the subject has been explored in many, if not all of the human sciences and art forms. There is much controversy concerning howemotions are defined and classified.”
General goal: Make interaction between humans and machines more natural for the humans
Machines should become able:• to register human emotions (and related states)
• to convey emotions (and related states)
• to “understand” the emotional relevance of events
Google TechTalk – Mar 6 th, 2009 98Paolo Baggia
adventurous
triumphant
lusting
ambitious conceited
bellicose
self-confident
courageous feeling superior
convinced
light-hearted
enthusiastic
determined amused
passionate
expectant
elated
interested
joyous
excited
hostile
envioushateful
enraged defiant
contemptuousjealous
angry
disgusted
loathingindignant
impatientsuspicious
bored
distrustful
startled
insulted
bitterdiscontented
feel well impressed disappointed
EXCITED �
amourous astonished apatheticdissatisfied
confidenttakenabackcontent hopeful
relaxedlonging
solemn attentive
worried
uncomfortabledespondent
feel guilt
languid ashamed desperate
friendlycontemplative
pensive embarrassed
polite serious
conscientious
peaceful
reverentempathic
melancholic
hesitantwavering
anxious
lonely
doubtful
sad dejected insecure
DEPRESSED �
�
ASTONISHED
�
AROUSED
�
DELIGHTED
HAPPY �
PLEASED �GLAD �
SERENE �
CONTENT � AT EASE �SATISFIED � RELAXED
� CALM �
SLEEPY �
� TENSE�
ALARMED� ANGRY � AFRAID
ANNOYED �
�
DISTRESSED
FRUSTRATED �
MISERABLE �
� SAD
�
GLOOMY
� TIRED
� BORED
DROOPY �
Emotional States are Numerous
Scherer et al.Univ. Geneva
Happy
Angry
Sad
Hi Power/Control
Conducive
Obstructive
Lo Power/Control
Active
Positive Negative
Passive
Google TechTalk – Mar 6 th, 2009 99Paolo Baggia
HUMAINE project
HUMAINE projectEuropean Network of ExcellenceActivity: 01/2004 - 12/200733 partner institutions from many disciplines
Today: HUMAINE Association (since June 2007)125 membersWeb-site: http://emotion-research.net
Google TechTalk – Mar 6 th, 2009 100Paolo Baggia
Online Speaker Classification
Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) –preprocessing step to reduce feature vector dimensionK-nearest NeighborGaussian Mixture Models : model training data as Gaussian densitiesArtificial Neural Networks (ANN), e.g. MLP: interesting training algorithms
Support Vector Machines (SVM) : use “kernel functions” to separate non-linear decision boundariesClassification and Regression Trees (CART)Hidden Markov Models (HMMs) used to model temporal structure
Felix Burkhardt, Colloqium Hochschule Zittau/Görlitz 4 .8.2008, Seite 1.
Classification Techniques
Google TechTalk – Mar 6 th, 2009 101Paolo Baggia
Text+expressive tags
Selection
style 1
style 2
style n
Waveform
1. Different speech databases, one for each expressive style:
� Effective solution, feasible only for a very limited range of emotions
2. Speech signal manipulation according to style dependent prosodic models
� Flexible solution, but requires accurate models and effective signal processing capabilities
Selection
Signal Processing
Prosodic Model
Text+expressive tags
Waveformneutral style
Expressive TTS – Two Approaches
From Enrico Zovato, Loquendo
Google TechTalk – Mar 6 th, 2009 102Paolo Baggia
Expressive TTS – Example Prosodic Patterns
Time (s)0 1.8
0
500
Fre
qu
en
cy (
Hz)
Time (s)0 1.8
0
500
Fre
qu
en
cy (
Hz)
0
100
200
300
400
500
POS (“happy”)
NEG (“sad”)
Synthesis of two basic emotional styles through prosodic modification:
� different intonation contours
� different acoustic units duration
Male-UK
POSNEG
Female-UKFrom Enrico Zovato, Loquendo
Google TechTalk – Mar 6 th, 2009 103Paolo Baggia
Emotions in ECAs
From Piero Cosi, CNR, Padova
Google TechTalk – Mar 6 th, 2009 104Paolo Baggia
W3C Emotion Incubator
“The W3C Incubator Activity fosters rapid development, on a time scale of a year or less , of new Web-related concepts. Target concepts include innovative ideas for specifications, guidelines, and applications that are not (or not yet) clear candidates as Web standardsdeveloped through the more thorough process afforded by the W3C Recommendation Track.”
W3C Emotion Incubator Aims:First Charter XG (2006-2007):
“...to investigate the prospects of defining a general-purpose Emotion annotation and representation language...”“...which should be usable in a large variety of technological contexts where emotions need to be represented.”
Second Charter XG (Nov. 2007 – Nov. 2008):Prioritize the requirements; Release a first specification draft; Illustrate how to combine the Emotion Markup Language with existing markup languages.
Google TechTalk – Mar 6 th, 2009 105Paolo Baggia
W3C Emotion Incubator – Members
W3C Members:DFKI
LoquendoDeutsche Telekom
SRI International
NTUAFraunhofer
Chinese Acad. Science
Invited Experts:Emotion AI
Univ. Paris 8Uuniv. Basque Country
Univ. C. Cork
OFAI, AustriaIPCA, Portugal
Tech.Univ. Munich
Web space: http://www.w3.org/2005/Incubator/emotion
Results:• Use case description document• Requirements document• Final Report (20 Nov 2008): Elements of an EmotionML 1.0� http://www.w3.org/2005/Incubator/emotion/XGR-emotio nml/
Chairman: Marc Schröder, DFKI
Google TechTalk – Mar 6 th, 2009 106Paolo Baggia
W3C Emotion Incubator – EmotionML 1.0
Document structure: container element (<emotionml> ), single emotion annotation (<emotion> )
Representation of emotions:<category> element, <dimensions> element, <appraisals> element,
<action-tendency> element, <intensity> element
Meta information:confidence attribute, <modality> element, <metadata> element
Links and time:<link> element, <timing> element
Scale valuesvalue attribute, <traces> element
Google TechTalk – Mar 6 th, 2009 107Paolo Baggia
EmotionML 1.0 – Examples
Expression of emotions in SSML 1.1:
<?xml version="1.0"?><speak version="1.1" xmlns="http://www.w3.org/2001/ 10/synthesis"
xmlns:emo="http://www.w3.org/2008/11/emotionml"xml:lang="en-US">
<s><emo:emotion>
<emo:category set="everydayEmotions" name="doubt"/><emo:intensity value="0.4"/>
</emo:emotion>
Do you need help?</s>
</speak>
Detection of emotions in EMMA 1.0:
<emma:emma version="1.0" xmlns:emma="http://www.w3.o rg/2003/04/emma"xmlns="http://www.w3.org/2008/11/emotionml" >
<emma:interpretation start="12457990" end="12457995" mode="voice" verbal="false">
<emotion><intensity value="0.1" confidence="0.8"/><category set="everydayEmotions" name="boredom" con fidence="0.1"/>
</emotion>
</emma:interpretation></emma:emma>
Google TechTalk – Mar 6 th, 2009 108Paolo Baggia
Overview
� A Bit of History
� W3C Speech Interaction Framework Today� ASR/DMTF� TTS� Lexicons� Voice Dialog and Call Control� Voice Platforms and Next Evolutions
� W3C Multimodal Interaction Today� MMI Architecture� EMMA and InkML� A language for Emotions
Next Future
Google TechTalk – Mar 6 th, 2009 109Paolo Baggia
W3C VBWG/MMIWG – Next Future
Spec for the next generation of Voice Browsing
SCXML 1.0
VoiceXML 3.0
Google TechTalk – Mar 6 th, 2009 110Paolo Baggia
State Charts - SCXML
State Chart XML (SCXML): http://www.w3.org/TR/2008/WD-scxml-20080516/
Powerful State-Machine LanguageBased on David Harel’s State Charts (see his book)
Adopted by in UMLStandard under development by W3C VBWG
http://www.w3.org/TR/scxml/
States, Transitions, EventsData model extends basic finite state automatonConditions on transitions
Nested StatesRepresents task decompositionIn multiple dependent states at same time
Parallel StatesRepresent fork/join logic
Wide interest:VBWG, MMI WG, Other W3C groups, Universities, IndustriesAlready available Open Source Implementations
Google TechTalk – Mar 6 th, 2009 111Paolo Baggia
SCXML 1.0 – Parallel State Charts
Google TechTalk – Mar 6 th, 2009 112Paolo Baggia
SCXML as MMI Interaction Manager
Voice Modality
Visual Modality
Gesture Modality
SCXML Interaction Manager
Google TechTalk – Mar 6 th, 2009 113Paolo Baggia
SCXML for VoiceXML 3.0
Voice Modality
Visual Modality
Gesture Modality
SCXML Interaction Manager
Google TechTalk – Mar 6 th, 2009 114Paolo Baggia
SCXML 1.0 – Open Issues
Data model:ECMA Script (ECMA-262) or other formats?
Definition of Profiles
Other
Google TechTalk – Mar 6 th, 2009 115Paolo Baggia
Re-Thinking VoiceXML – VoiceXML 3.0
Well-founded:From syntactic description to a semantic model
Extensible:SIV, EMMA support, rich media, VCR control, etc.
Profiled:light profile (mobile?), media profile (scalability), VoiceXML 2.1 profile (interoperability), etc.
Flexibility:Customization of FIA (Form Interpretation Algorithm)
Google TechTalk – Mar 6 th, 2009 116Paolo Baggia
VoiceXML 3.0 – Separation of Concerns
SCXML 1.0Application and interaction logic
VoiceXML 3.0:Voice Interaction only, under control of SCXML
VoiceXML 3.0 has been published as a First Working Draft, http://www.w3.org/TR/2008/WD-voicexml30-20081219/� Send public comments
Google TechTalk – Mar 6 th, 2009 117Paolo Baggia
THANK YOUTHANK YOU
for clarifications or questions:
paolo.baggia@loquendo.com
Google TechTalk – Mar 6 th, 2009 118Paolo Baggia
For more information please:Keep an eye on: www.loquendo.com
Contact: paolo.baggia@loquendo.com
Loquendo S.p.A.Loquendo S.p.A.745 Fifth Ave, 27th Floor New York, NY 10151USATel. +1 212.310.9075Fax. +1 212.310.9001www.loquendo.com
THANK YOUTHANK YOU
Loquendo S.p.A.Loquendo S.p.A.Via Olivetti, 610148 TORINOItalyTel. +39 011 291 3111 Fax +39 011 291 3199www.loquendo.com
Keep in touch with Loquendo news, subscribe to the Loquendo Newsletter
Try our interactive TTS demo : insert your text, choose a language, and listen
The latest News at a click
Consult the Loquendo Newsletter online
Keep up to date on events and initiatives
For further information, fill in our Contacts Form