Download - Speech Synthesis: Overviewtts.speech.cs.cmu.edu/11-752/slides/tts_first.pdfInstall SPTK and Edinburgh Speech Tools ... Make sure you generate 16KHz, Mono, .wav (RIFF) format

Speech Synthesis: Speech Synthesis: Overview Overview

11-752: Class Overview

OverviewOverview◆ Speech Phonetics Prosody Perception and Synthesis:Speech Phonetics Prosody Perception and Synthesis:◆ Sub-topics:Sub-topics:

● Text analysis and normalization (Alan)Text analysis and normalization (Alan)● Lexicons and pronunciation modelingLexicons and pronunciation modeling● Spectragram reading (Maxine) Feb/MarSpectragram reading (Maxine) Feb/Mar● Crowdsourcing in speech (plus project) (Maxine) Feb/MarCrowdsourcing in speech (plus project) (Maxine) Feb/Mar● Prosody Generation (Alan)Prosody Generation (Alan)● Waveform Generation (Alan)Waveform Generation (Alan)● Multilinguality and Limited Resources (Alan)Multilinguality and Limited Resources (Alan)● Evaluation (Alan)Evaluation (Alan)● Voice/Language/Style/Emotion conversion (Alan)Voice/Language/Style/Emotion conversion (Alan)● Project (Alan) MayProject (Alan) May

CourseworkCoursework◆ Practical course: i.e. you Practical course: i.e. you dodo things things◆ Weekly courseworks:Weekly courseworks:

● Short exercises related to current topicShort exercises related to current topic● Dur Monday's at noon.Dur Monday's at noon.

◆ Crowdsourcing Project (March)Crowdsourcing Project (March)

◆ Final Project (May)Final Project (May)

◆ GradingGrading● Best 5 weekly courseworks (50%)Best 5 weekly courseworks (50%)● Crowdsourcing Project (20%)Crowdsourcing Project (20%)● Final Project (30%)Final Project (30%)

ExamplesExamples◆ Example weekly courseworkExample weekly coursework

● From given data build a duration modelFrom given data build a duration model● Design a diphone prompt list for some languageDesign a diphone prompt list for some language

◆ Crowdsourcing projectCrowdsourcing project● Challenge defined by MaxineChallenge defined by Maxine

◆ Final projectFinal project● Something novel in synthesisSomething novel in synthesis● Evaluation technique, machine learning, new language,Evaluation technique, machine learning, new language,● Singing, ...Singing, ...

First task First task ◆ Technically not a weekly courseworkTechnically not a weekly coursework

● But necessary for all other courseworkBut necessary for all other coursework

◆ Install the software and test itInstall the software and test it● Install SPTK and Edinburgh Speech ToolsInstall SPTK and Edinburgh Speech Tools● Install Festival and FestVox voice building toolsInstall Festival and FestVox voice building tools

◆ Any platform: Windows, OSX or Linux/Unix butAny platform: Windows, OSX or Linux/Unix but● You'll want to hear the result (i.e. working audio)You'll want to hear the result (i.e. working audio)● So use a laptop/desktop rather than a serverSo use a laptop/desktop rather than a server

◆ See instructions on course website (linked from awb's home See instructions on course website (linked from awb's home page)page)

◆ Ask if you get stuckAsk if you get stuck

◆ And, test it by ….And, test it by ….

Build a Talking Clock Build a Talking Clock ◆ Build a talking synthesizer clock from your own voiceBuild a talking synthesizer clock from your own voice◆ Record 24 standard promptsRecord 24 standard prompts

● ““The time is now, about five past one, in the morning”The time is now, about five past one, in the morning”● ““The time is now, just after ten past two, in the morning”The time is now, just after ten past two, in the morning”● ......

◆ Use the FestVox limited domain tools to build a talking clockUse the FestVox limited domain tools to build a talking clock

◆ It will work with It will work with youryour accent and sound just like you accent and sound just like you

◆ You may have to use audacity to record your promptsYou may have to use audacity to record your prompts● Make sure you generate 16KHz, Mono, .wav (RIFF) formatMake sure you generate 16KHz, Mono, .wav (RIFF) format● And name the files properly (time_0001.wav, time_0002.wav ...)And name the files properly (time_0001.wav, time_0002.wav ...)

◆ Due Monday 20Due Monday 20thth Jan at noon by email to Jan at noon by email to [email protected]@cs.cmu.edu

◆ See detailed instructions on website.See detailed instructions on website.

◆ Ask if you get stuckAsk if you get stuck

Physical ModelsPhysical Models

• Blowing air through tubes…

– von Kemplen’s

synthesizer 1791

• Synthesis by physical models– Homer Dudley’s Voder. 1939

More Computation – More DataMore Computation – More Data

◆ Formant synthesis (60s-80s)Formant synthesis (60s-80s)● Waveform construction from componentsWaveform construction from components

◆ Diphone synthesis (80s-90s)Diphone synthesis (80s-90s)● Waveform by concatenation of small number of Waveform by concatenation of small number of

instances of speechinstances of speech◆ Unit selection (90s-00s)Unit selection (90s-00s)

● Waveform by concatenation of very large number of Waveform by concatenation of very large number of instances of speechinstances of speech

◆ Statistical Parametric Synthesis (00s-..)Statistical Parametric Synthesis (00s-..)● Waveform construction from parametric modelsWaveform construction from parametric models

Waveform GenerationWaveform Generation

- Formant synthesisFormant synthesis- Random word/phrase concatenationRandom word/phrase concatenation- Phone concatenationPhone concatenation- Diphone concatenationDiphone concatenation- Sub-word unit selectionSub-word unit selection- Cluster based unit selectionCluster based unit selection- Statistical Parametric SynthesisStatistical Parametric Synthesis

Building a Research FieldBuilding a Research Field

◆ ToolsTools● Allow others to easily join the fieldAllow others to easily join the field

◆ Common Data SetsCommon Data Sets● Be able to concentrate on techniquesBe able to concentrate on techniques● Have common comparisonsHave common comparisons

◆ EvaluationEvaluation● Realistically compare techniquesRealistically compare techniques

◆ Have UsersHave Users● Some one has to care about your resultsSome one has to care about your results

◆ Don’t become stifled Don’t become stifled ● Ensure there are new tasks and directionsEnsure there are new tasks and directions

Festival Speech Synthesis SystemFestival Speech Synthesis Systemhttp://festvox.org/festivalGeneral system for multi-lingual TTSC/C++ code with Scheme scripting languageGeneral replaceable modules lexicons, LTS, duration, intonation, phrasing, POS tagging tokenizing, diphone/unit selectionGeneral Tools intonation analysis (F0, Tilt), signal processing CART building, n-grams, SCFG, WFST, OLSNo fixed theoriesNew languages without new C++ codeMultiplatform (Unix, Windows, OSX)Full sources in distributionFree Software

CMU FestVox ProjectCMU FestVox Project

http://festvox.org“I want it to speak like me!”-Festival is an engine, how do you make voices- Building Synthetic Voices - Tools, scripts, documentation - Discussion and examples for building voices - Example voice databases - Step by Step walkthroughs of processes-Support for English and other languages-Support for different waveform techniques: - diphone, unit selection, limit domain, HMM- Other support: lexicon, prosody, text analysers

The CMU Flite projectThe CMU Flite projecthttp://cmuflite.org“But I want it to run on my phone!”- FLITE a fast, small, portable run-time synthesizer- C based (no loaded files)- Basic FestVox voices compiled into C/data- Thread safe- Suitable for embedded devices - Ipaq, Linux, WinCE, PalmOS, Symbian- Scalable: - quality/size/speed trade offs - frequency based lexicon pruning- Sizes: - 2.4Meg footprint (code+data+runtime RAM) - < 0.025 secs “time-to-speak”

Common Data SetsCommon Data Sets◆ Data drive techniques need dataData drive techniques need data◆ Diphone DatabasesDiphone Databases

● CSTR and CMU US English Diphone sets (kal and ked)CSTR and CMU US English Diphone sets (kal and ked)◆ CMU ARCTIC DatabasesCMU ARCTIC Databases

● 1200 phonetically balanced utterances (about 1 hour)1200 phonetically balanced utterances (about 1 hour)● 7 different speakers (2 male 2 female 3 accented)7 different speakers (2 male 2 female 3 accented)● EGG, phonetically labeledEGG, phonetically labeled● Utterances chosen from out-of-copyright textUtterances chosen from out-of-copyright text● Easy to sayEasy to say● Freely distributableFreely distributable● Tools to build your own in your own languageTools to build your own in your own language

Blizzard ChallengeBlizzard Challenge

◆ Realistic evaluationRealistic evaluation● Under the same conditionsUnder the same conditions

◆ Blizzard Challenge [Black and Tokuda]Blizzard Challenge [Black and Tokuda]● Participants build voice from common datasetParticipants build voice from common dataset● Synthesis test sentencesSynthesis test sentences● Large set of listening experimentsLarge set of listening experiments● Since 2005, now in 9Since 2005, now in 9thth year year● 15-20 groups (Academia, Research Labs and 15-20 groups (Academia, Research Labs and

Commercial Companies)Commercial Companies)

How to test synthesisHow to test synthesis

◆ Blizzard tests:Blizzard tests:● Do you like it? (MOS scores)Do you like it? (MOS scores)● Can you understand it?Can you understand it?

→ SUS sentenceSUS sentence→ The unsure steaks overcame the zippy rudderThe unsure steaks overcame the zippy rudder

◆ Can’t this be done automatically?Can’t this be done automatically?● Not yet (at least not reliably enough)Not yet (at least not reliably enough)● But we now have lots of data for training techniquesBut we now have lots of data for training techniques

◆ Why does it still sound like robot?Why does it still sound like robot?● Need better (appropriate testing)Need better (appropriate testing)

Speech Synthesis TechniquesSpeech Synthesis Techniques

◆ Unit selectionUnit selection◆ Statistical parameter synthesisStatistical parameter synthesis◆ Automated voice buildingAutomated voice building

● Database designDatabase design● Language portabilityLanguage portability

◆ Voice conversionVoice conversion

Unit SelectionUnit Selection

• Target cost and Join cost [Hunt and Black 96]– Target cost is distance from desired unit to actual

unit in the databases• Based on phonetic, prosodic metrical context

– Join cost is how well the selected units join

Clustering UnitsClustering Units

• Cluster units [Donovan et al 96, Black et al 97]

Unit Selection IssuesUnit Selection Issues

• Cost metrics– Finding best weights, best techniques etc

• Database design– Best database coverage

• Automatic labeling accuracy– Finding errors/confidence

• Limited domain:– Target the databases to a particular application– Talking clocks – Targeted domain synthesis

Unit Selection vs ParametricUnit Selection vs Parametric

Unit SelectionThe “standard” method“Select appropriate sub-word units from

large databases of natural speech”Parametric Synthesis: [NITECH: Tokuda et al]

HMM-generation based synthesisCluster units to form modelsGenerate from the models“Take ‘average’ of units”

Old vs NewOld vs New

Unit Selection:large carefully labelled databasequality good when good examples availablequality will sometimes be badno control of prosody

Parametric Synthesis:smaller less carefully labelled databasequality consistentresynthesis requires vocoder, (buzzy)can (must) control prosody

model size much smaller than Unit DB

Parametric SynthesisParametric Synthesis

• Probabilistic Models

• Simplification

• Generative model– Predict acoustic frames from text

SPSSSPSS

◆ ASR vs SPSSASR vs SPSS● Similar techniques but not the sameSimilar techniques but not the same

◆ Model training techniquesModel training techniques● Alignment, and cluster featuresAlignment, and cluster features● MLLR (adaptation from multi-speaker models)MLLR (adaptation from multi-speaker models)

◆ Model improvement techniquesModel improvement techniques● Minimum generation errorMinimum generation error● Label optimizationLabel optimization

◆ Parameterization techniquesParameterization techniques● MFCC, LSP, STAIGHT, HSMMFCC, LSP, STAIGHT, HSM● Excitation modeling techniquesExcitation modeling techniques

SPSS GoalsSPSS Goals

◆ Require optimal paramerization thatRequire optimal paramerization that● Is derivable from speechIs derivable from speech● Can generate high quality speechCan generate high quality speech● Is predictable from textIs predictable from text

◆ CandidatesCandidates● Spectral, F0, excitationSpectral, F0, excitation● Formants, nasality, aspirationFormants, nasality, aspiration● Articulatory features Articulatory features

SPSS SystemsSPSS Systems

◆ HTS (NITECH)HTS (NITECH)● Based on HTKBased on HTK● Predicts HMM-statesPredicts HMM-states● (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter● Supported in FestivalSupported in Festival

◆ Clustergen (CMU)Clustergen (CMU)● No use of HTKNo use of HTK● Predicts FramesPredicts Frames● (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter● More tightly coupled with FestivalMore tightly coupled with Festival

Building Synthetic VoicesBuilding Synthetic Voices The “standard” voice requires …The “standard” voice requires …- A phone setA phone set- Pronunciations:Pronunciations:

- Lexicon/letter-to-sound rulesLexicon/letter-to-sound rules- Phonetically and prosodically balanced corpusPhonetically and prosodically balanced corpus

- Spoken by a good speakerSpoken by a good speaker- Text analysis:Text analysis:

- Number, symbol expansion, etcNumber, symbol expansion, etc- Prosodic modelingProsodic modeling

- Phrasing, intonation, duration etcPhrasing, intonation, duration etc- Waveform generationWaveform generation

- Diphones, unit selection, parametric synthesisDiphones, unit selection, parametric synthesis- Something else that is hard:Something else that is hard:

- No vowels (Arabic), no word segmentation, number declensionsNo vowels (Arabic), no word segmentation, number declensions

Designing a good corpusDesigning a good corpus- From a large set of textFrom a large set of text

- Select “nice” utterancesSelect “nice” utterances- 5 to 15 words, easy to say5 to 15 words, easy to say- All words in lexicon, no homographsAll words in lexicon, no homographs

- Convert text to phoneme stringsConvert text to phoneme strings- Possibly with lexical stress, onset/coda, tone etcPossibly with lexical stress, onset/coda, tone etc

- Select utterances that maximize di/triphone coverageSelect utterances that maximize di/triphone coverage- Looking for around 1000 utterancesLooking for around 1000 utterances- Can seed initial data with “domain” dataCan seed initial data with “domain” data- CMU ARCTIC databasesCMU ARCTIC databases

- 7 x single speaker English DBS7 x single speaker English DBS- 1200 phonetically balanced utterances1200 phonetically balanced utterances

Hard Synthesis ProblemsHard Synthesis Problems

◆ Text NormalizationText Normalization◆ Intonation modelingIntonation modeling

● Intonation evaluationIntonation evaluation

◆ Style modelingStyle modeling● Choosing the right styleChoosing the right style● Evaluating the resultEvaluating the result

Text NormalizationText Normalization

◆ Finding the wordsFinding the words● Tokenizing, homograph disambiguation etcTokenizing, homograph disambiguation etc● ““$1.25” vs “$1.25 million” vs “$1.25 song”$1.25” vs “$1.25 million” vs “$1.25 song”

◆ Very large number of rare eventsVery large number of rare events◆ Formalized systems existFormalized systems exist

● Trained from data, optimized and out-of-dateTrained from data, optimized and out-of-date◆ Long term updated hacks rule systemsLong term updated hacks rule systems◆ ML ChallengeML Challenge

● Such a problem Such a problem cannot cannot be done by machine learningbe done by machine learning

Intonation ModelingIntonation Modeling

◆ Accents, Phrases and F0Accents, Phrases and F0● Lots of statistical models availableLots of statistical models available● Lots of “objective” measures: Lots of “objective” measures:

→ RMSE, CorrelationRMSE, Correlation● No good subjective measuresNo good subjective measures

◆ Listening testsListening tests● Natural Intonation: goodNatural Intonation: good● Naïve intonation: badNaïve intonation: bad● Various cute models for intonation: mehVarious cute models for intonation: meh

Improving UnderstandingImproving Understanding

◆ Take reading comprehension storiesTake reading comprehension stories● For children’s reading tests, or TOEFLFor children’s reading tests, or TOEFL

◆ Synthesis with:Synthesis with:● Natural IntonationNatural Intonation● Naïve modelsNaïve models● Various cute modelsVarious cute models

◆ Human listening testsHuman listening tests● Answer questions about storiesAnswer questions about stories● Best system: Naïve models Best system: Naïve models

Style ModelingStyle Modeling

◆ Classic Emotion ModelingClassic Emotion Modeling● Happy, sad, angry and neutralHappy, sad, angry and neutral● But no one needs thatBut no one needs that

◆ Style ModelingStyle Modeling● Polite, command, empathicPolite, command, empathic

◆ Style usageStyle usage● When can it be used?When can it be used?● How much should be used?How much should be used?

Dialog with StyleDialog with Style

◆ Record human-human dialogRecord human-human dialog● Label dialog states:Label dialog states:

→ Implicit confirmation, corrections, discourse markersImplicit confirmation, corrections, discourse markers

◆ Build dialog state sensitive voiceBuild dialog state sensitive voice● Using dialog state in featuresUsing dialog state in features

◆ Must be closely integrated into SDSMust be closely integrated into SDS● Timing, dialog state appropriateTiming, dialog state appropriate

◆ But how do you test it?But how do you test it?

Voice TransformationVoice Transformation

- Collect small amount of dataCollect small amount of data- 50 utterances50 utterances

- Adapt existing voice to target voiceAdapt existing voice to target voice- Adaptation: What makes a voice:Adaptation: What makes a voice:

- Lexical choiceLexical choice- Phonetic variationPhonetic variation- ProsodyProsody- Spectral/vocal tract/articulatory movementSpectral/vocal tract/articulatory movement- Excitation modeExcitation mode

- Use articulatory modeling for transformation (Toth)Use articulatory modeling for transformation (Toth)

Voice Transformation Voice Transformation

- Festvox GMM transformation suite (Toda) Festvox GMM transformation suite (Toda)

awb bdl jmk sltawb bdl jmk slt

awbawb

bdlbdl

jmkjmk

sltslt

ApplicationsApplications

◆ Speech output is only one componentSpeech output is only one component◆ Need to integrate with larger applicationsNeed to integrate with larger applications

● Spoken Dialog SystemsSpoken Dialog Systems● Speech-to-Speech Translation SystemsSpeech-to-Speech Translation Systems● Talking HeadsTalking Heads● Conversational participantsConversational participants● Information deliveryInformation delivery

ConclusionsConclusions

• Synthesis has improved– But there is still much to do– Isolated sentences are clear …– … But conversational speech still in the future

• Speech Systems must adapt– To their usage– And their funding conditions

• But we can always fall back on our talents