Speech Synthesis: Speech Synthesis: Overview Overview
11-752: Class Overview
OverviewOverview◆ Speech Phonetics Prosody Perception and Synthesis:Speech Phonetics Prosody Perception and Synthesis:◆ Sub-topics:Sub-topics:
● Text analysis and normalization (Alan)Text analysis and normalization (Alan)● Lexicons and pronunciation modelingLexicons and pronunciation modeling● Spectragram reading (Maxine) Feb/MarSpectragram reading (Maxine) Feb/Mar● Crowdsourcing in speech (plus project) (Maxine) Feb/MarCrowdsourcing in speech (plus project) (Maxine) Feb/Mar● Prosody Generation (Alan)Prosody Generation (Alan)● Waveform Generation (Alan)Waveform Generation (Alan)● Multilinguality and Limited Resources (Alan)Multilinguality and Limited Resources (Alan)● Evaluation (Alan)Evaluation (Alan)● Voice/Language/Style/Emotion conversion (Alan)Voice/Language/Style/Emotion conversion (Alan)● Project (Alan) MayProject (Alan) May
CourseworkCoursework◆ Practical course: i.e. you Practical course: i.e. you dodo things things◆ Weekly courseworks:Weekly courseworks:
● Short exercises related to current topicShort exercises related to current topic● Dur Monday's at noon.Dur Monday's at noon.
◆ Crowdsourcing Project (March)Crowdsourcing Project (March)
◆ Final Project (May)Final Project (May)
◆ GradingGrading● Best 5 weekly courseworks (50%)Best 5 weekly courseworks (50%)● Crowdsourcing Project (20%)Crowdsourcing Project (20%)● Final Project (30%)Final Project (30%)
ExamplesExamples◆ Example weekly courseworkExample weekly coursework
● From given data build a duration modelFrom given data build a duration model● Design a diphone prompt list for some languageDesign a diphone prompt list for some language
◆ Crowdsourcing projectCrowdsourcing project● Challenge defined by MaxineChallenge defined by Maxine
◆ Final projectFinal project● Something novel in synthesisSomething novel in synthesis● Evaluation technique, machine learning, new language,Evaluation technique, machine learning, new language,● Singing, ...Singing, ...
First task First task ◆ Technically not a weekly courseworkTechnically not a weekly coursework
● But necessary for all other courseworkBut necessary for all other coursework
◆ Install the software and test itInstall the software and test it● Install SPTK and Edinburgh Speech ToolsInstall SPTK and Edinburgh Speech Tools● Install Festival and FestVox voice building toolsInstall Festival and FestVox voice building tools
◆ Any platform: Windows, OSX or Linux/Unix butAny platform: Windows, OSX or Linux/Unix but● You'll want to hear the result (i.e. working audio)You'll want to hear the result (i.e. working audio)● So use a laptop/desktop rather than a serverSo use a laptop/desktop rather than a server
◆ See instructions on course website (linked from awb's home See instructions on course website (linked from awb's home page)page)
◆ Ask if you get stuckAsk if you get stuck
◆ And, test it by ….And, test it by ….
Build a Talking Clock Build a Talking Clock ◆ Build a talking synthesizer clock from your own voiceBuild a talking synthesizer clock from your own voice◆ Record 24 standard promptsRecord 24 standard prompts
● ““The time is now, about five past one, in the morning”The time is now, about five past one, in the morning”● ““The time is now, just after ten past two, in the morning”The time is now, just after ten past two, in the morning”● ......
◆ Use the FestVox limited domain tools to build a talking clockUse the FestVox limited domain tools to build a talking clock
◆ It will work with It will work with youryour accent and sound just like you accent and sound just like you
◆ You may have to use audacity to record your promptsYou may have to use audacity to record your prompts● Make sure you generate 16KHz, Mono, .wav (RIFF) formatMake sure you generate 16KHz, Mono, .wav (RIFF) format● And name the files properly (time_0001.wav, time_0002.wav ...)And name the files properly (time_0001.wav, time_0002.wav ...)
◆ Due Monday 20Due Monday 20thth Jan at noon by email to Jan at noon by email to [email protected]@cs.cmu.edu
◆ See detailed instructions on website.See detailed instructions on website.
◆ Ask if you get stuckAsk if you get stuck
Physical ModelsPhysical Models
• Blowing air through tubes…
– von Kemplen’s
synthesizer 1791
• Synthesis by physical models– Homer Dudley’s Voder. 1939
More Computation – More DataMore Computation – More Data
◆ Formant synthesis (60s-80s)Formant synthesis (60s-80s)● Waveform construction from componentsWaveform construction from components
◆ Diphone synthesis (80s-90s)Diphone synthesis (80s-90s)● Waveform by concatenation of small number of Waveform by concatenation of small number of
instances of speechinstances of speech◆ Unit selection (90s-00s)Unit selection (90s-00s)
● Waveform by concatenation of very large number of Waveform by concatenation of very large number of instances of speechinstances of speech
◆ Statistical Parametric Synthesis (00s-..)Statistical Parametric Synthesis (00s-..)● Waveform construction from parametric modelsWaveform construction from parametric models
Waveform GenerationWaveform Generation
- Formant synthesisFormant synthesis- Random word/phrase concatenationRandom word/phrase concatenation- Phone concatenationPhone concatenation- Diphone concatenationDiphone concatenation- Sub-word unit selectionSub-word unit selection- Cluster based unit selectionCluster based unit selection- Statistical Parametric SynthesisStatistical Parametric Synthesis
Building a Research FieldBuilding a Research Field
◆ ToolsTools● Allow others to easily join the fieldAllow others to easily join the field
◆ Common Data SetsCommon Data Sets● Be able to concentrate on techniquesBe able to concentrate on techniques● Have common comparisonsHave common comparisons
◆ EvaluationEvaluation● Realistically compare techniquesRealistically compare techniques
◆ Have UsersHave Users● Some one has to care about your resultsSome one has to care about your results
◆ Don’t become stifled Don’t become stifled ● Ensure there are new tasks and directionsEnsure there are new tasks and directions
Festival Speech Synthesis SystemFestival Speech Synthesis Systemhttp://festvox.org/festivalGeneral system for multi-lingual TTSC/C++ code with Scheme scripting languageGeneral replaceable modules lexicons, LTS, duration, intonation, phrasing, POS tagging tokenizing, diphone/unit selectionGeneral Tools intonation analysis (F0, Tilt), signal processing CART building, n-grams, SCFG, WFST, OLSNo fixed theoriesNew languages without new C++ codeMultiplatform (Unix, Windows, OSX)Full sources in distributionFree Software
CMU FestVox ProjectCMU FestVox Project
http://festvox.org“I want it to speak like me!”-Festival is an engine, how do you make voices- Building Synthetic Voices - Tools, scripts, documentation - Discussion and examples for building voices - Example voice databases - Step by Step walkthroughs of processes-Support for English and other languages-Support for different waveform techniques: - diphone, unit selection, limit domain, HMM- Other support: lexicon, prosody, text analysers
The CMU Flite projectThe CMU Flite projecthttp://cmuflite.org“But I want it to run on my phone!”- FLITE a fast, small, portable run-time synthesizer- C based (no loaded files)- Basic FestVox voices compiled into C/data- Thread safe- Suitable for embedded devices - Ipaq, Linux, WinCE, PalmOS, Symbian- Scalable: - quality/size/speed trade offs - frequency based lexicon pruning- Sizes: - 2.4Meg footprint (code+data+runtime RAM) - < 0.025 secs “time-to-speak”
Common Data SetsCommon Data Sets◆ Data drive techniques need dataData drive techniques need data◆ Diphone DatabasesDiphone Databases
● CSTR and CMU US English Diphone sets (kal and ked)CSTR and CMU US English Diphone sets (kal and ked)◆ CMU ARCTIC DatabasesCMU ARCTIC Databases
● 1200 phonetically balanced utterances (about 1 hour)1200 phonetically balanced utterances (about 1 hour)● 7 different speakers (2 male 2 female 3 accented)7 different speakers (2 male 2 female 3 accented)● EGG, phonetically labeledEGG, phonetically labeled● Utterances chosen from out-of-copyright textUtterances chosen from out-of-copyright text● Easy to sayEasy to say● Freely distributableFreely distributable● Tools to build your own in your own languageTools to build your own in your own language
Blizzard ChallengeBlizzard Challenge
◆ Realistic evaluationRealistic evaluation● Under the same conditionsUnder the same conditions
◆ Blizzard Challenge [Black and Tokuda]Blizzard Challenge [Black and Tokuda]● Participants build voice from common datasetParticipants build voice from common dataset● Synthesis test sentencesSynthesis test sentences● Large set of listening experimentsLarge set of listening experiments● Since 2005, now in 9Since 2005, now in 9thth year year● 15-20 groups (Academia, Research Labs and 15-20 groups (Academia, Research Labs and
Commercial Companies)Commercial Companies)
How to test synthesisHow to test synthesis
◆ Blizzard tests:Blizzard tests:● Do you like it? (MOS scores)Do you like it? (MOS scores)● Can you understand it?Can you understand it?
→ SUS sentenceSUS sentence→ The unsure steaks overcame the zippy rudderThe unsure steaks overcame the zippy rudder
◆ Can’t this be done automatically?Can’t this be done automatically?● Not yet (at least not reliably enough)Not yet (at least not reliably enough)● But we now have lots of data for training techniquesBut we now have lots of data for training techniques
◆ Why does it still sound like robot?Why does it still sound like robot?● Need better (appropriate testing)Need better (appropriate testing)
Speech Synthesis TechniquesSpeech Synthesis Techniques
◆ Unit selectionUnit selection◆ Statistical parameter synthesisStatistical parameter synthesis◆ Automated voice buildingAutomated voice building
● Database designDatabase design● Language portabilityLanguage portability
◆ Voice conversionVoice conversion
Unit SelectionUnit Selection
• Target cost and Join cost [Hunt and Black 96]– Target cost is distance from desired unit to actual
unit in the databases• Based on phonetic, prosodic metrical context
– Join cost is how well the selected units join
Clustering UnitsClustering Units
• Cluster units [Donovan et al 96, Black et al 97]
Unit Selection IssuesUnit Selection Issues
• Cost metrics– Finding best weights, best techniques etc
• Database design– Best database coverage
• Automatic labeling accuracy– Finding errors/confidence
• Limited domain:– Target the databases to a particular application– Talking clocks – Targeted domain synthesis
Unit Selection vs ParametricUnit Selection vs Parametric
Unit SelectionThe “standard” method“Select appropriate sub-word units from
large databases of natural speech”Parametric Synthesis: [NITECH: Tokuda et al]
HMM-generation based synthesisCluster units to form modelsGenerate from the models“Take ‘average’ of units”
Old vs NewOld vs New
Unit Selection:large carefully labelled databasequality good when good examples availablequality will sometimes be badno control of prosody
Parametric Synthesis:smaller less carefully labelled databasequality consistentresynthesis requires vocoder, (buzzy)can (must) control prosody
model size much smaller than Unit DB
Parametric SynthesisParametric Synthesis
• Probabilistic Models
• Simplification
• Generative model– Predict acoustic frames from text
SPSSSPSS
◆ ASR vs SPSSASR vs SPSS● Similar techniques but not the sameSimilar techniques but not the same
◆ Model training techniquesModel training techniques● Alignment, and cluster featuresAlignment, and cluster features● MLLR (adaptation from multi-speaker models)MLLR (adaptation from multi-speaker models)
◆ Model improvement techniquesModel improvement techniques● Minimum generation errorMinimum generation error● Label optimizationLabel optimization
◆ Parameterization techniquesParameterization techniques● MFCC, LSP, STAIGHT, HSMMFCC, LSP, STAIGHT, HSM● Excitation modeling techniquesExcitation modeling techniques
SPSS GoalsSPSS Goals
◆ Require optimal paramerization thatRequire optimal paramerization that● Is derivable from speechIs derivable from speech● Can generate high quality speechCan generate high quality speech● Is predictable from textIs predictable from text
◆ CandidatesCandidates● Spectral, F0, excitationSpectral, F0, excitation● Formants, nasality, aspirationFormants, nasality, aspiration● Articulatory features Articulatory features
SPSS SystemsSPSS Systems
◆ HTS (NITECH)HTS (NITECH)● Based on HTKBased on HTK● Predicts HMM-statesPredicts HMM-states● (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter● Supported in FestivalSupported in Festival
◆ Clustergen (CMU)Clustergen (CMU)● No use of HTKNo use of HTK● Predicts FramesPredicts Frames● (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter● More tightly coupled with FestivalMore tightly coupled with Festival
Building Synthetic VoicesBuilding Synthetic Voices The “standard” voice requires …The “standard” voice requires …- A phone setA phone set- Pronunciations:Pronunciations:
- Lexicon/letter-to-sound rulesLexicon/letter-to-sound rules- Phonetically and prosodically balanced corpusPhonetically and prosodically balanced corpus
- Spoken by a good speakerSpoken by a good speaker- Text analysis:Text analysis:
- Number, symbol expansion, etcNumber, symbol expansion, etc- Prosodic modelingProsodic modeling
- Phrasing, intonation, duration etcPhrasing, intonation, duration etc- Waveform generationWaveform generation
- Diphones, unit selection, parametric synthesisDiphones, unit selection, parametric synthesis- Something else that is hard:Something else that is hard:
- No vowels (Arabic), no word segmentation, number declensionsNo vowels (Arabic), no word segmentation, number declensions
Designing a good corpusDesigning a good corpus- From a large set of textFrom a large set of text
- Select “nice” utterancesSelect “nice” utterances- 5 to 15 words, easy to say5 to 15 words, easy to say- All words in lexicon, no homographsAll words in lexicon, no homographs
- Convert text to phoneme stringsConvert text to phoneme strings- Possibly with lexical stress, onset/coda, tone etcPossibly with lexical stress, onset/coda, tone etc
- Select utterances that maximize di/triphone coverageSelect utterances that maximize di/triphone coverage- Looking for around 1000 utterancesLooking for around 1000 utterances- Can seed initial data with “domain” dataCan seed initial data with “domain” data- CMU ARCTIC databasesCMU ARCTIC databases
- 7 x single speaker English DBS7 x single speaker English DBS- 1200 phonetically balanced utterances1200 phonetically balanced utterances
Hard Synthesis ProblemsHard Synthesis Problems
◆ Text NormalizationText Normalization◆ Intonation modelingIntonation modeling
● Intonation evaluationIntonation evaluation
◆ Style modelingStyle modeling● Choosing the right styleChoosing the right style● Evaluating the resultEvaluating the result
Text NormalizationText Normalization
◆ Finding the wordsFinding the words● Tokenizing, homograph disambiguation etcTokenizing, homograph disambiguation etc● ““$1.25” vs “$1.25 million” vs “$1.25 song”$1.25” vs “$1.25 million” vs “$1.25 song”
◆ Very large number of rare eventsVery large number of rare events◆ Formalized systems existFormalized systems exist
● Trained from data, optimized and out-of-dateTrained from data, optimized and out-of-date◆ Long term updated hacks rule systemsLong term updated hacks rule systems◆ ML ChallengeML Challenge
● Such a problem Such a problem cannot cannot be done by machine learningbe done by machine learning
Intonation ModelingIntonation Modeling
◆ Accents, Phrases and F0Accents, Phrases and F0● Lots of statistical models availableLots of statistical models available● Lots of “objective” measures: Lots of “objective” measures:
→ RMSE, CorrelationRMSE, Correlation● No good subjective measuresNo good subjective measures
◆ Listening testsListening tests● Natural Intonation: goodNatural Intonation: good● Naïve intonation: badNaïve intonation: bad● Various cute models for intonation: mehVarious cute models for intonation: meh
Improving UnderstandingImproving Understanding
◆ Take reading comprehension storiesTake reading comprehension stories● For children’s reading tests, or TOEFLFor children’s reading tests, or TOEFL
◆ Synthesis with:Synthesis with:● Natural IntonationNatural Intonation● Naïve modelsNaïve models● Various cute modelsVarious cute models
◆ Human listening testsHuman listening tests● Answer questions about storiesAnswer questions about stories● Best system: Naïve models Best system: Naïve models
Style ModelingStyle Modeling
◆ Classic Emotion ModelingClassic Emotion Modeling● Happy, sad, angry and neutralHappy, sad, angry and neutral● But no one needs thatBut no one needs that
◆ Style ModelingStyle Modeling● Polite, command, empathicPolite, command, empathic
◆ Style usageStyle usage● When can it be used?When can it be used?● How much should be used?How much should be used?
Dialog with StyleDialog with Style
◆ Record human-human dialogRecord human-human dialog● Label dialog states:Label dialog states:
→ Implicit confirmation, corrections, discourse markersImplicit confirmation, corrections, discourse markers
◆ Build dialog state sensitive voiceBuild dialog state sensitive voice● Using dialog state in featuresUsing dialog state in features
◆ Must be closely integrated into SDSMust be closely integrated into SDS● Timing, dialog state appropriateTiming, dialog state appropriate
◆ But how do you test it?But how do you test it?
Voice TransformationVoice Transformation
- Collect small amount of dataCollect small amount of data- 50 utterances50 utterances
- Adapt existing voice to target voiceAdapt existing voice to target voice- Adaptation: What makes a voice:Adaptation: What makes a voice:
- Lexical choiceLexical choice- Phonetic variationPhonetic variation- ProsodyProsody- Spectral/vocal tract/articulatory movementSpectral/vocal tract/articulatory movement- Excitation modeExcitation mode
- Use articulatory modeling for transformation (Toth)Use articulatory modeling for transformation (Toth)
Voice Transformation Voice Transformation
- Festvox GMM transformation suite (Toda) Festvox GMM transformation suite (Toda)
awb bdl jmk sltawb bdl jmk slt
awbawb
bdlbdl
jmkjmk
sltslt
ApplicationsApplications
◆ Speech output is only one componentSpeech output is only one component◆ Need to integrate with larger applicationsNeed to integrate with larger applications
● Spoken Dialog SystemsSpoken Dialog Systems● Speech-to-Speech Translation SystemsSpeech-to-Speech Translation Systems● Talking HeadsTalking Heads● Conversational participantsConversational participants● Information deliveryInformation delivery
ConclusionsConclusions
• Synthesis has improved– But there is still much to do– Isolated sentences are clear …– … But conversational speech still in the future
• Speech Systems must adapt– To their usage– And their funding conditions
• But we can always fall back on our talents