+ All Categories
Home > Documents > Kishore Prahallad ([email protected]), IIIT Hyderabad 1 Building a Limited Domain Voice Using...

Kishore Prahallad ([email protected]), IIIT Hyderabad 1 Building a Limited Domain Voice Using...

Date post: 25-Dec-2015
Category:
Upload: lauren-obrien
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
21
Kishore Prahallad ([email protected]), IIIT Hyder abad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009) Kishore Prahallad Email: [email protected] International Institute of Information Technology (IIIT) Hyderabad, India & Language Technologies Institute, Carnegie Mellon University
Transcript

Kishore Prahallad ([email protected]), IIIT Hyderabad1

Building a Limited Domain Voice Using Festvox

(Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

Kishore PrahalladEmail: [email protected]

International Institute of Information Technology (IIIT) Hyderabad, India&

Language Technologies Institute, Carnegie Mellon University

Kishore Prahallad ([email protected]), IIIT Hyderabad2

Objective

• Objective: To provide introduction to the inner details of Festival Synthesis system

• Best Resources: Documentation of Festival, Festvox and Speech Tools and their mailing lists

• Topics: – Festival, Festvox and Speech Tools– Modules and data structures in Festival– Synthesis Flow– Building a limited domain voice

Kishore Prahallad ([email protected]), IIIT Hyderabad3

Festival & Speech Tools

• Festival – Full text to speech system– Multi-lingual– A general framework for building new voices in

existing and new languages– APIs: Shell Level, C++ Library, Emacs interface

• Speech Tools– A set of modules for common tasks found in speech

processing• Example: Feature Extraction

– Interface: Stand alone executables and a set of library calls linked into user programs

Kishore Prahallad ([email protected]), IIIT Hyderabad4

Festvox

• Voice building tool

• Interface created on top of Festival and Speech Tools to build voices

Kishore Prahallad ([email protected]), IIIT Hyderabad5

How Festival, Festvox & Speech Tools are Related

Speech Tools

Festival Multi-lingual Synthesis Engine

FestvoxEnvironment

To build voices

Kishore Prahallad ([email protected]), IIIT Hyderabad6

Output of Festvox

Speech Tools

Festival Multi-lingual Synthesis Engine

FestvoxEnvironment

To build voices

Voice

• Festvox uses SpeechTools and Festival to create a new voice

• The Voice created is put back into Festival framework to synthesize text

Kishore Prahallad ([email protected]), IIIT Hyderabad7

User Interface with Festival

Speech Tools

Festival Multi-lingual Synthesis Engine

FestvoxEnvironment

To build voices

Voice

UserWorld

Kishore Prahallad ([email protected]), IIIT Hyderabad8

Some Festival-Specific Terminology

• Utterance: *Name* of a data structure used in Festival

• Segment: A phone is referred to as segment

Kishore Prahallad ([email protected]), IIIT Hyderabad9

Basic Modules of Festival TTS system

There are many modules in the Festival system - the basic modules used for text-to-speech are:

• Token_POS– basic token identification

• Token– Apply the token to word rules (handle non-standard words)

• POS– A standard part of speech tagger

• Phrasify– A Chunker, detect the phrase boundaries

• Word– Implements letter to sound rules

Tokens: White Space separated

European language: Space, CR, newline, tab, vertical tab etc..

Asian Languages: No white space separators – Use dictionaries

Punctuation: The boy----was usually late-----but arrived on time!! We have orange/apple/banana flavors

Kishore Prahallad ([email protected]), IIIT Hyderabad10

Basic Modules of Festival TTS system contd..

• Pauses– Prediction of pauses, inserting silences.

• Intonation– Prediction of accents: Which syllables have accent (stress)

• PostLex– Post lexicon rules that can modify segments based on their context.

This is used for things like vowel reduction, contractions, etc. • Duration

– Prediction of durations of segments. • Int_Targets

– Realization of F0 contour: given the accents/tones generate an F0 contour.

• Wave_Synth– A general function that in turn calls the appropriate method to actually

generate the waveform.

Kishore Prahallad ([email protected]), IIIT Hyderabad11

Data Structure in Festival

• Utterance: A dashboard data structure (as all modules read/write on a common memory)

• *Utterance* is the input and the output of every module in the Festival

Module

Utterance Utterance

Kishore Prahallad ([email protected]), IIIT Hyderabad12

Utterance consist of ?

• *Items* and *Relations*• Items:

– It is an object to store strings representing word, segment etc.

• Relation: – A graph which links the items – For example: “syllable” is a relation which

links the items storing segment-names together

Kishore Prahallad ([email protected]), IIIT Hyderabad13

What Each Module Does to an Utterance

• Each module access *items* and *relations* in an utterance and generate new features, items and relations in the same utterance– For ex: Token_POS

• Input: Utterance with one item - a string representing a sentences

• Output: Utterance with multiple items – each item represents a token

• Synthesis process in Festival is viewed as applying a set of modules to an utterance

Kishore Prahallad ([email protected]), IIIT Hyderabad14

Synthesis Flow

ModulesJune 25

Relations

Text

Kishore Prahallad ([email protected]), IIIT Hyderabad15

Synthesis Flow

ModulesJune 25

Relations

Text

June 25 Token

Tokenize

Twenty FifthJune Word

Token2Word

POS NumNoun Num

Kishore Prahallad ([email protected]), IIIT Hyderabad16

Synthesis FlowTwenty FifthJune Word

POS NumNoun Num

1 1 0 1

jh uu n t w e n t ii f i f th

Syllable

Segment

Word

Wave Synthesize Wave

Kishore Prahallad ([email protected]), IIIT Hyderabad17

Installation of Festival & Festvox

• Step 1: Install Speech tools

• Step 2: Install Festival – Synthesize text in English to check the sound

card, rate of speech etc.

• Step 3: Install Festvox

• Detailed Notes available from course web site

Kishore Prahallad ([email protected]), IIIT Hyderabad18

Building Limited Domain• Unit selection is applied to a limited with restricted vocabulary

• High quality speech systems

• Units are words – Implementation in Festival:

• The units are still phone, but are restricted to be coming from a specific word – /p/ from “Pennsylvania” is differentiated from /p/ from “Pittsburgh”– To synthesize “Pittsburgh” all the phones should come from the word

“Pittsburgh” (there may be many examples of the same word).

• Talking clock, Weather Prediction, Rail/Air Inquiry Systems• http://www.cs.cmu.edu/~awb/papers/ICSLP2000_ldom/index.html

Kishore Prahallad ([email protected]), IIIT Hyderabad19

Limited Domain Setup (http://festvox.org/bsv/bsv-ldom-ch.html)

• 1. Set the Environment:$FESTVOXDIR/src/ldom/setup_ldom iiit time pra

#This would give a talking clock set up. #To change it to any another domain, all you have to do is to replace "etc/time.data"

#with the domain specific training sentences. #For non-english languages, these sentences are transliterated in English.

• 2. Generate Prompts – Synthesize the sentence which *you* are going to speak – How can you synthesize? – mostly applicable to English languages only– Why Synthesize at all? – To *prompt* you what to speakfestival -b festvox/build_ldom.scm '(build_prompts "etc/txt.done.data")'

• 3. Record prompts– For new languages, switch off the * playing of the prompt* by commenting na_play in bin/prompt_thembin/prompt_them etc/txt.done.data

• 4. Label Automatically– Uses dynamic programming for labeling the speech– Labeling builds the correspondence between the text and the speechbin/make_labs prompt-wav/*.wav

• 4.1 Manually correct the labeling errorsemulabel etc/emu_lab time0001

Kishore Prahallad ([email protected]), IIIT Hyderabad20

Contd…

• 5. Generate Pitch markers bin/make_pm_wave wav/*.wav

• 6. Correct the pitch markersbin/make_pm_fix pm/*.pm

• 7. Generate Mel Cepstral coefficientsbin/make_mcep wav/*.wav

• 8. Generate Utterance Structurefestival -b festvox/build_ldom.scm '(build_utts "etc/txt.done.data")'

• 9. Cluster the units festival -b festvox/build_ldom.scm '(build_clunits "etc/txt.done.data")'

• 10. Test the voice.festival festvox/iiit_time_pra_ldom '(voice_iiit_time_pra_ldom)'

• To see the units selected (set! utt (SayText "abhii samaya hai....")(clunits::units_selected utt "-")

Kishore Prahallad ([email protected]), IIIT Hyderabad21

References

• http://festvox.org• 11-752 CMU course slides

– http://festvox.org/festtut/

• 11-752 CMU Course Lecture Notes– http://festvox.org/festtut/notes/festtut_toc.html

• Building Synthetic Voices – http://www.festvox.org/bsv/

• The Festival Speech Synthesis System– http://www.festvox.org/docs/manual-1.4.3/festival_toc.html

• Edinburgh Speech Tools Library– http://www.festvox.org/docs/speech_tools-1.2.0/book1.htm


Recommended