+ All Categories
Home > Documents > Intonation and Multi-Language Scenarios Andrew Rosenberg Candidacy Exam Presentation October 4,...

Intonation and Multi-Language Scenarios Andrew Rosenberg Candidacy Exam Presentation October 4,...

Date post: 21-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
70
Intonation and Multi- Language Scenarios Andrew Rosenberg Candidacy Exam Presentation October 4, 2006
Transcript

Intonation and Multi-Language Scenarios

Andrew Rosenberg

Candidacy Exam Presentation

October 4, 2006

10/04/2006 2

Talk Overview

Use and Meaning of Intonation Automatic Analysis of Intonation “Multi-Language Scenarios”

Second Language Learning Systems Speech-to-Speech Translation

10/04/2006 3

Use and Meaning of Intonation

Why do multi-language scenarios need intonation? Intonation indicates focus and contrast Intonation disambiguates meaning Intonation indicates how language is being used

Discourse structure, Speech acts, Paralinguistics

10/04/2006 4

Examples of Intonational FeaturesToBI Examples

Categorical Features Pitch Accent

H* - Mariana(H*) won it L+H* - Mariana(L+H*) won it L* - Will you have marmalade(L*) or jam(L*)

Phrase Boundaries Intermediate Phrase Boundary (3) Intonational Phrase Boundary (4) Oh I don’t know (4) it’s got oregano (3) and marjoram (3) and some fresh

basil (4) Continuous Features

Pitch Intensity Duration

10/04/2006 5

Use and Meaning of IntonationPaper List Emphasis

Accent is Predictable (If You’re a Mind Reader)Bolinger, 1972

Prosodic Analysis and the Given/New DistinctionBrown, 1983

The Prosody of Questions in Natural DiscourseHedberg and Sosa, 2002

Syntax The Use of Prosody in Syntactic Disambiguation

Price, et al., 1991 Discourse Structure

Prosodic Analysis of Discourse Segments in Direction Giving MonologuesNakatani and Hirschberg, 1996

Paralinguistics Acoustic Correlates of Emotion Dimension in View of Speech Synthesis

Schröder, et al., 2001

10/04/2006 6

Accent is Predictable (If You're a Mind Reader)Dwight Bolinger, 1972Harvard University

Nuclear Stress Rule Stress is assigned to the rightmost stress-able vowel in a major constituent (Chomsky and Halle 1968) “Once the speaker has selected a sentence with a

particular syntactic structure and certain lexical items...the choice of stress contour is not a matter subject to further independent decision”

Selected Counterexamples to NSR Coordinated Infinitives can be accented or not

I have a clock to clean and oil v. I still have most of the garden to weed and fertilize

Terminal prepositions are rarely accented I need a light to read by

Focus v. Topic v. Comment Why are you coming indoors? -- I’m coming indoors because the sun is shining

Predictable or less semantically rich items are less likely to be accented I have a point to make v. I have a point to emphasize I’ve got to go see a guy v. I’ve got to go see a friend [semi-pronouns?]

10/04/2006 7

Prosodic Analysis and the Given/New DistinctionGillian Brown, 1983

Information Structure of Discourse Entities (Prince 1981) Given (or Evoked) Information: “recoverable either anaphorically or

situational” (Halliday,1967) New Information: “non recoverable...” Inferable Information: e.g. driver is inferable given bus

Experiment One subject was asked to describe a diagram to another who would

reproduce it. Entities are marked as new/brand-new, new/inferred, evoked/context (pen,

paper, etc.), evoked/current (most recent mention) or evoked/displaced (previously mentioned)

Prominence realizations 87% of new/brand-new and 79% of new/inferred entities 2% of evoked/context, 0% of evoked current, 4% of evoked/displaced

10/04/2006 8

The Prosody of Questions in Natural DiscourseNancy Hedberg and Juan Sosa, 2002Simon Fraser University

Accenting behavior in question types Wh-Questions (whq) vs. Yes/No Questions (ynq) in spontaneous speech

from “McLaughlin Group” and “Washington Week” The “locus of interrogation”

Either Wh-word or Fronted Auxiliary Verb Where are you? Do you like pie?

Wh-words are often accented with L+H* and rarely deaccented Ynqs show no consistent accenting behavior

70.5% of positive ynqs deaccent 88% of negative ynqs use L+H*

Nuclear Tune Whqs are produced with falling intonation 80% of the time Only 34% of Ynqs are produced with rising intonation

Topic Pitch Accent The topic of both whqs and ynqs are less often accented with L+H* than the locus of

interrogation

10/04/2006 9

The Use of Prosody in Syntactic DisambiguationPatti Price1, Mari Ostendorf2, Stefanie Shattuck-Hufnagel3, Cynthia Fong2, 19911SRI, 2Boston University, 3MIT

Relationship between syntax and intonation. Methodology

7 Types of syntactically ambiguous sentences spoken by 4 professional radio announcers

Ambiguous sentences were produced within disambiguating paragraphs. The speakers were not informed of the sentence of interest and only produced

one version per session. Subjects selected the more appropriate surrounding context. Subjects only rated one version per session.

Analysis Manual labelling of phrase breaks and accents [not ToBI] Phrase breaks and their relative size differentiate the two versions.

Characterized by lengthening, pauses and boundary tone

10/04/2006 10

Example Syntactic ambiguities Parentheticals v. non-parenthetical clause

[Mary knows many languages,][you know] [Mary knows many languages (that) you (also) know]

Apposition v. attached NP [Only one remembered,][the lady in red] [Only one remembered the lady in red]

Main clauses w/ coordinating conjunction v. main and subordinate clause [Jane rides in the van][and Ella runs] [Jane rides in the van Ann Della runs]

Tag question v. attached NP [Mary and I don’t believe][do we?] [Mary and I don’t believe Dewey.]

10/04/2006 11

Example Syntactic ambiguities Far v. Near attachment of final phrase

[Raoul murdered the man][with the gun] (Raoul has a gun) [Raoul murdered [the man with the gun]] (the man has a gun)

Left v. Right attachment of middle phrase [When you learn gradually][you worry more] [When you learn][gradually you worry more]

Particles v. Prepositions [They may wear down the road] (the treads hurt the road) [They may wear][down the road] (the treads erode)

10/04/2006 12

Prosodic Analysis of Discourse Segments in Direction Giving MonologuesJulia Hirschberg1 and Christine Nakatani2,19961AT&T Labs, 2Harvard University

Intonation is used to indicate discourse structure Is a speaker beginning a topic? ending one? Entails a broader information (linguistic, attentional, intentional) structure than

given/new entity status. Boston Directions Corpus

Manual ToBI annotation, and Discourse segmentation Acoustic-prosodic correlates of discourse segment initial, medial and final

phrases. Segment Initial v. non-initial

Higher max, mean F0 and Energy. Longer preceding and shorter following pauses Increases in F0 and Energy from previous phrase

Segment Medial v. Final Medial has a slower speaking rate and shorter subsequent pause Relative increase in F0 and Energy from previous phrase

10/04/2006 13

BDC Discourse structure example [describe green line portion of journey]

and get on the Green Line [describe direction to take on green line]

we will take the Green Linesouthtoward Park Street [describe which green line to take (any)]

we can get on any of the Green Linesat Government Center

and take them southto Park Street [describe getting off the green line]

once we are at Park Street we will get off [describe red line portion of journey]

and get on the red lineof the T

10/04/2006 14

Acoustic Correlates of Emotion Dimension in View of Speech SynthesisMarc Schröder1, Roddy Cowie2, Ellen Douglas-Cowie2, Machiel Westerdijk3, Stan Gielen3, 20011University of Saarland, 2Queen’s University, 3University of Nijmegen

Paralinguistic Information That information that is transmitted via language that is not

strictly “linguistic”. E.g. emotion, humor, charisma, deception

Emotional Dimensions Activation - Degree of readiness to act Evaluation - Positive v. Negative Power - Dominance v. Submission For example,

Happiness - High Activation, High Evaluation, High Power Anger - High Activation, Low Evaluation, Low Power Sadness - Low Activation, Low Evaluation, Very Low Power

10/04/2006 15

Acoustic Correlates to Emotion Dimensionctd.

Manual annotation of emotional content of spontaneous speech from 100 speakers in activation-evaluation-power space.

High Activation strongly correlates High F0 mean and range, longer phrases, shorter pauses, large and

fast F0 rise and fall, increased intensity, flat spectral slope Negative Evaluation correlates

Fast F0 falls, long pauses, increased intensity, more pronounced intensity maxima

High Power correlates Low F0 mean, (female) shallow F0 rise and falls, reduced intensity

(male) increased intensity

10/04/2006 16

Use and Meaning Of IntonationSummary

Intonation can provide information about: Focus Contrast

I want the red pen (...not the blue one) Information Status (given/new) Speech Acts Discourse Status Syntax Paralinguistics

10/04/2006 17

Automatic Analysis of Intonation

How can the information transmitted via intonation be understood computationally?

What computational techniques are available? How much human annotation is needed?

10/04/2006 18

Automatic Analysis of Intonation Paper List (1/2)

Supervised Methods Automatic Recognition of Intonational Features

Wightman and Ostendorf, 1992 An Automatic Prosody Recognizer Using a Coupled Multi-Stream Acoustic Model and a

Syntactic-Prosodic Language ModelAnanthakrishnan and Narayanan, 2005

Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous SpeechIshi, et al., 2003

Alternate ways of representing Intonation Direct Modeling of Prosody: An Overview of Applications in Automatic Speech

ProcessingShriberg and Stolcke 2004

The Tilt Intonation ModelTaylor, 1998

10/04/2006 19

Automatic Analysis of Intonation Paper List (2/2) Unsupervised Methods

Unsupervised and Semi-supervised Learning of Tone and Pitch AccentLevow, 2006

Reliable Prominence Identification in English Spontaneous SpeechTamburini, 2006

Feature Analysis Spectral Emphasis as an Additional Source of Information in Accent Detection

Heldner, 2001 Duration Features in Prosodic Classification: Why Normalization Comes Second, and

what they Really Encode.Batliner, et al., 2001

10/04/2006 20

Supervised Methods

Require annotated data Pitch Accent and Phrase Boundaries are the

two main prosodic events that are detected and classified

10/04/2006 21

Automatic Recognition of Intonational FeaturesColin Wightman and Mari Ostendorf, 1992Boston University

Detection of boundary tones and pitch accents on syllables Decision tree-based acoustic quantization for use with an HMM

Four-way classification {Pitch Accent, Boundary Tone, Both, Neither} Features

Is the syllable lexically stressed? F0 contour representation Max, min F0 context normalization Duration Pause information Mean energy

Results Prominence: Correct 86% False alarm 14% Boundary Tone: Correct 77% False alarm 3%

10/04/2006 22

An Automatic Prosody Recognizer Using a Coupled Multi-Stream Acoustic Model and a Syntactic-Prosodic Language ModelShankar Aranthakrishnan and Shrikanth Narayanan, 2005University of Sothern California

There are three asynchronous information streams that contribute to intonation Pitch - duration and distance from mean of piecewise linear fit of f0 Energy - frame level intensity normalized w.r.t utterance Duration - normalized vowel duration of the current syllable and following pause duration

Coupled HMM trained on 1 hour of radio news speaker data with ASR hypotheses and POS tags Tag syllable as long/short, stressed/unstressed, boundary/non-boundary Includes language model relating POS and prosodic events

Syntax alone provides the best results for boundary tone detection: Correct 82.1% False Alarm 12.93%

Stress detection false alarm rate is nearly halved by inclusion of acoustic information Syntax alone: 79.7% / 22.25% Syntax + acoustics: 79.5% / 13.21%

10/04/2006 23

Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous SpeechCarlos Toshinori Ishi, Parham Mokhtari, Nick Campbell, 2003ATR/Human Information Science Labs

Phrase-final behavior can indicate speech act, certainty, discourse/topic structure, etc. Classification of phrase-final behavior in Japanese

Pitch features Mean F0 of first and second half of phrase final Pitch target of first and second half Min, max, (pseudo-) slope, reset of phrase final

Using a classification tree, 11 tone classes could be classified with 75.9% accuracy Majority class baseline: 19.6%

10/04/2006 24

Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous SpeechCarlos Toshinori Ishi, Parham Mokhtari, Nick Campbell, 2003ATR/Human Information Science Labs

Tone Type Perceptual Properties (Hattori 2002) X-JToBI BPM

1a Low L%

1b Low+Falling tone L%

1bE Low+Falling+Extended L%

2a High L%+H%

2aA High+Aspirated L%+H%

2b High+Lengthened L%+H%>

2c Low+Rising tine L%+LH%

2cE Low+Rising+Extended L%+LH%

2cS Low+Rising+Short L%+LH%

3 High+Falling tone L%+HL%

5 High+Fall-Rise tone L%+HLH%

10/04/2006 25

Direct Modeling of Prosody: An Overview of Applications in Automatic Speech ProcessingElizabeth Shriberg and Andreas Stolcke, 2004SRI, ICSI

Do we need to explicitly model prosodic features? Why not provide acoustic/prosodic information directly to other

statistical models? Task-based integration of features and models Event Language Model

Augment a typical n-gram language model with prosodic event classes Event Prosodic Model

Grow decision trees or use GMMs to generate P(Event|Signal) Continuous Prosodic Features

Duration from ASR Pitch, energy, voicing normalizations and stylizations Task specific features: e.g. Number of repeat attempts

10/04/2006 26

Structural Tagging Sentence/topic boundary and disfluency (interruption point) detection Uses Language Model + Event Prosodic Model Sentence boundary results

Telephone: accuracy improved 7% BN: 19% error reduction

Pragmatics/Paralinguistics Dialog act classification and frustration detection Uses Language Model and Dialog “grammar” + Event Prosodic Model Results:

Statement v. Question 16% error reduction Agreement v. Backchannel 16% error reduction Frustration 27% error reduction (using “repeated attempt” feature)

Direct Modeling of Prosody Tasks

10/04/2006 27

Direct Modeling of Prosody Tasks Speaker Recognition

Typical approaches use spectral information Use Continuous Prosodic Features Including phone duration features can reduce error by

50% Word Recognition

Words can be recognized simultaneously with prosodic events (Event Language Model)

Spectral and prosodic information can be used to model word hypotheses Phone duration, pause information along with sentence and

disfluency detection reduces error by 3.1%

10/04/2006 28

The Tilt Intonation ModelPaul Taylor, 1998University of Edinburgh

The Tilt Model describes accent and boundary tones as “intonational events” characterized by pitch movement

Events (accent, boundary, neither, silence) are automatically detected using an HMM with pitch, energy, and first and second order difference of both Accuracy ranged from 35%-47% with correct identification

of events between 60.7% and 72.7% Tilt parameter was then extracted from the HMM

hypotheses. F0 synthesis with machine and human derived Tilt

parameters differed by < 1Hz rmse on DCEIM test set

10/04/2006 29

Tilt parameter

tilt Arise A fall2(Arise A fall )

Drise Dfall

2(Drise Dfall )

10/04/2006 30

Unsupervised Models of Intonation Annotating Intonation is expensive

100x real time for full ToBI labeling Human Annotations are errorful

Human agreement ranges from 80-90% Unsupervised Methods are

1. Inexpensive Data doesn’t require manual annotation

2. Consistent Performance is not reliant on human consistency

10/04/2006 31

Unsupervised and Semi-supervised Learning of Tone and Pitch AccentGina-Anne Levow, 2006University of Chicago

What can we do without gold-standard data Also, does Lexical Tone in Mandarin Chinese vary in similar

dimensions as Pitch Accent? Semi- and Unsupervised speaker-dependent clustering into

4 accent classes (unaccented, high, low, downstepped) Forced alignment-based syllable Features:

Speaker normalized f0, f0 slope and intensity Context: prev. following syllables values and first order differences

Semi-supervised: Laplacian Support Vector Machines Tone (clean speech): 94% accuracy (99% supervised) Pitch Accent (2-way): 81.5% accuracy (84% supervised)

Unsupervised: k-means clustering, Asymmetric k-lines clustering Tone (clean speech): 77% accuracy Pitch Accent (4-way): 78.4% accuracy (80.1% supervised)

10/04/2006 32

Reliable Prominence Identification in English Spontaneous SpeechFabio Tamburini, 2005University of Bologna

Unsupervised metric to describe the prominence of a syllable Calculated over the nucleus

Prom = en500-4000 + dur + enov (Aevent + Devent) High spectrum energy Duration Full spectrum energy Tilt parameters (f0 amplitude and duration)

By tuning a threshold, 18.64% syllable error rate on TIMIT

10/04/2006 33

Feature Analysis

Intonation is generally assumed to be realized as a modification of Pitch Energy Duration

How do each of these contribute to realization of specific prosodic events?

10/04/2006 34

Spectral Emphasis as an Additional Source of Information in Accent DetectionMattias Heldner, 2001Umeå University

Close inspection of spectral emphasis as discriminating accented and non-accented syllables in read Swedish

Spectral emphasis: difference (in dB) of energy in the first formant and full spectrum First formant energy was extracted using a dynamic low pass filter with

a cut off that followed f0 Classifier: The word in a phrase with the highest spectral

emphasis/intensity/pitch is “focally accented”. Results:

Spectral Emphasis: 75% correct Overall Intensity: 69% correct Pitch peak: 67% correct

10/04/2006 35

Duration Features in Prosodic Classification: Why Normalization Comes Second, and what they Really EncodeAnton Batliner, Elmar Nöth, Jan Buckow, Richard Huber, Volker Warnke, Heinrich Niemann, 2001University of Erlangen-Nuremberg

When vowels are stressed, accented or phrase-final they tend to be lengthened. What’s the best way to measure the duration of a word?

Duration is normalized in three ways DURNORM - normalized w.r.t. ‘expected’ duration

Expected duration calculated by the mean and std.dev. of a vowel scaled by a ROS approximation.

DURSYLL - normalized w.r.t. number of syllables DURABS - raw duration

In both German and English on boundary and accent tasks, DURABS classified the best followed by DURSYLL followed by DURNORM Duration inadvertently encodes semantic information

Complex words tend to have more syllables and tend to be accented more frequently; common words (particles, backchannels) tend to be shorter

DURNORM and DURSYLL are able to classify well (if worse) despite obfuscating this information

10/04/2006 36

Automatic Analysis of IntonationSummary Various of models can be used to analyze both

pitch accents and phrase boundaries: Supervised Direct Discriminative modelling Semi- and unsupervised learning

Research has also examined how accents and phrase breaks are realized in a constrained acoustic dimensions

10/04/2006 37

Second Language Learning Systems Automated systems can be used to improve

pronunciation and intonation of second language learners.

Native intonation is rarely emphasized in classrooms and is often the last thing non-native speakers learn.

Focus will be more on computational approaches (diagnosis, evaluation) over pedagogical concerns

10/04/2006 38

Second Language Learning SystemsPaper List (1/2)

Pronunciation Evaluation The SRI EduSpeakTM System: Recognition and Pronunciation Scoring for Language

Learning Franco, et al., 2000

Automatic Localization and Diagnosis of Pronunciation Errors for Second-Language Learners of English Herron, et al., 1999

Automatic Syllable Stress Detection Using Prosodic Features for Pronunciation Evaluation of Language Learners Tepperman and Narayanan, 2005

10/04/2006 39

Second Language Learning SystemsPaper List (2/2) Fluency, Nativeness and Intonation Evaluation

A Visual Display for the Teaching of Intonation Spaai and Hermes, 1993

Quantitative Assessment of Second Language Learner’s Fluency: An Automatic Approach Cucchiarini, et al., 2002

Prosodic Features for Automatic Text-Independent Evaluation of Degree of Nativeness for Language Learners Teixeira, et al., 2000

Modeling and Automatic Detection of English Sentence Stress for Computer-Assisted English Prosody Learning System Imoto, et al., 2002

A study of sentence stress production in Mandarin speakers of American English Chen, et al., 2001

10/04/2006 40

Pronunciation Evaluation

The segmental context and lexical stress of a production determines whether it is pronounced correctly or not.

10/04/2006 41

The SRI EduSpeakTM System: Recognition and Pronunciation Scoring for Language LearningHoracio Franco, Victor Abrash, Kristin Precoda, Harry Bratt, Ramana Rao, John Butzberger, Romain Rossier, Federico Cesari, 2000SRI

Recognition Non-native speech recognition is errorful. A native HMM recognizer was adapted to non-native speech.

Non-native WER was reduced by half, while not affecting native performance

Pronunciation Evaluation Combine scores using a regression tree to generate scores that

correlate with scores from human raters Spectral Match: Compare the spectrum of a candidate phone to a native,

context-independent phone model. Also used for mispronunciation detection

Phone Duration: Compare the candidate duration to a model of native duration, normalized by rate of speech

Speaking rate: phones/sentence

10/04/2006 42

Automatic Localization and Diagnosis of Pronunciation ErrorsDaniel Herron1, Wolfgang Menzel1, Erica Atwell2, Roberto Bisiani6, Fabio Deaneluzzi4, Rachel Morton5, Juergen Schmidt3, 1999 1U. of Hamburg, 2U. of Leeds, 3Ernst Klett Verlag, 4Dida*El S.r.l., 5Entropic, 6U. of Milan-Bicocca

Locating and describing errors is critical for instruction Identifying segmental errors

In response to a read prompt, lax recognition followed by strict recognition

Some errors are predictable based on L1. Vowel, pre-vocalic consonant, and word-final devoicing errors are

modelled explicitly, and tested on artificial data. Vowel - /ih/ -> /ey/ “it” -> “eet” PV consonant - /w/ - > /v/ “was” -> “vas” WF devoicing - /g/ -> /k/ “thinking” -> “thinkink”

Using a word-internal tri-phone model vowel and PV consonant errors can be diagnosed with >80% accuracy with a FA rate less than 5%. WF devoicing can only be diagnosed with ~40% accuracy.

Stress-errors are detected by deviation from trained models of stressed and unstressed syllables

10/04/2006 43

Automatic Syllable Stress Detection Using Prosodic Features for Pronunciation Evaluation of Language LearnersJoseph Tepperman and Shrikanth Narayanan, 2005University of Southern California

Lexical stress can change POS and meaning “INsult” v. “inSULT” or “CONtent” v. “conTENT”

Detecting stress on read speech with content determined a priori Use forced alignment to id syllable nuclei (vowels) Extract f0 and energy features. Duration features are manually

normalized by context. Classified using a supervised Gaussian Mixture Model Post-processed to guarantee exactly 1 stressed classification per

word. Mean f0, energy and duration discriminate with >80% accuracy on

English spoken by Italian and German speakers

10/04/2006 44

Nativeness, Fluency and Intonation Evaluation

Intonational information can influence the proficiency and understandability of a second-language speaker

Proficient second-language speakers often have difficulty producing native-like intonation

10/04/2006 45

A Visual Display for the Teaching of IntonationGerard Spaai and Dik Hermes, 1993Institute for Perception Research

Tools for guided instruction of intonation Intonation is difficult to learn

It is acquired early, so it is resistant to change Native language intonation expectations may

impair the perceptions of foreign intonation Intonation Meter

Display the pitch contour. Interpolate non-voiced region Mark vowel onsets

10/04/2006 46

Quantitative Assessment of Second Language Learners' Fluency: An Automatic ApproachCatia Cucchiarini, Helmer Strik and Lou Boves, 2002University of Nijmegen

Does ‘fluency’ always mean the same thing? Linguistic knowledge, segmental pronunciation, native-like intonation.

Three groups of raters, 3 phoneticians, 6 speech therapists, assessed fluency of read speech on a scale from 1-10. With 1 exception the raters agreed with > .9 Native speakers are consistently rated as more fluent than non-native

Time/Lexical correlates to fluency High rate of speech (segments/duration) High phonation/time (+/- pauses) ratio High mean length of runs Low number & duration of pauses

10/04/2006 47

Prosodic Features for Automatic Text-Independent Evaluation of Degree of Nativeness for Language LearnersCarlos Teixeira1,2, Horacio Franco2, Elizabeth Shriberg2, Kristin Precoda2, Kemal Sönmez2, 20001IST-UTL/INESC, 2SRI

Can a model be trained to assess speakers nativeness similarly to humans without text information?

Construct Feature-Specific Decision Trees Word stress (duration of longest vowel, duration of lexically stressed vowel, duration of

vowel with max f0) Speaking rate approximations (durations between vowels) Pitch (max, slope “bigram” modeling) Forced alignment + pitch (duration between max f0 to longest vowel nucleus, location of

max f0) Unique events (durations of longest pauses, longest words)

Combination (max or expectation) of “posterior probabilities” from decision trees

Results Pitch-based features do not generate human-like scores

Only weak correlation (<.434) between machine and human scores Inclusion of posterior recognition scores and rate of speech helps considerably.

Correlation = ~.7

10/04/2006 48

Modeling and Automatic Detection of English Sentence Stress for Computer Assisted English Prosody Learning SystemKazunori Imoto, Yasushi Tsubota, Antione Rau, Tatsuya Kawahara and Masatake Dantsuji, 2002Kyoto University

L1 specific errors need to be accounted for Japanese speakers tend not to use energy and duration to indicate stress Syllable structure “strike” -> /s-u-t-o-r-ay-k-u/ Incorrect phrasing

Classification of stress levels Syllable alignment was performed with a recognizer trained with common native

Japanese English speech (including segmental errors) Supervised HMM training using pitch, power, 4-th order MFCC & first and second order

differences Using distinct models for each stress type/syllable structure/position combination (144

HMMs), 93.7%/79.3% native/non-native accuracies were achieved Two stage recognition increased accuracy to 95.1%/84.1%

Primary + Secondary stress v. Non-stressed Primary v. Secondary stress

10/04/2006 49

A study of sentence stress production in Mandarin Speakers of American EnglishYang Chen1, Michael Robb2, Harvey Gilbert2 and Jay Lerman2, 20011University of Wyoming, 2University of Connecticut

Do native Mandarin speakers produce American English pitch accents “natively”?

Experiment Compare native Mandarin English and native American English

productions of “I bought a cat there” with varied location of pitch accent. Pitch Energy and duration of vowels were calculated and compared

across language group and gender Vowel onset/offset were determined manually.

Results Mandarin speakers produced stressed words with shorter duration than

American speakers. Female mandarin speakers produced stressed words with greater rise

in f0

10/04/2006 50

Second Language Learning SystemsSummary

Performance assessment Pronunciation Intonation

Error diagnosis and (Instruction) Influence of L1 on L2 instruction and

evaluation

10/04/2006 51

Speech-to-Speech Translation

ASR, MT and TTS components all exist independently

Challenges specific to translation of speech Can speech information be used to reduce the

impact of ASR errors on MT? Can information conveyed by intonation be

translated via this framework?

10/04/2006 52

Speech-to-Speech TranslationPaper List

Cascaded Approaches Janus-III: Speech-to-Speech Translation in Multiple Languages

Lavie et al. 1997 A Unified Approach in Speech Translation: Integrating Features of Speech Recognition

and Machine Translation Zhang et al. 2004

Explicit Use of Prosodic Information On the Use of Prosody in a Speech-to-Speech Translator

Strom et al. 1997 A Japanese-to-English Speech Translation System: ATR-MATRIX

Takezawa et al. 1998 Integrated Approaches

Finite State Speech-to-Speech Translation Vidal 1997

On the Integration of Speech Recognition and Statistical Machine Translation Matusov 2005

Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Gao 2003

10/04/2006 53

Cascaded Approach to Speech-to-Speech Translation

ASR TTSMT

10/04/2006 54

Janus-III: Speech-to-Speech Translation in Multiple LanguagesAlon Lavie, Alex Waibel, Lori Levin, Michael Finke, Donna Gates, Marsal Galvadà, Torsten Zeppenfeld , Puming Zhan, 1998Carnegie Mellon University, University of Karlsruhe

Interlingua and Frame-Slot based Spanish-English translation limited domain (conference registration) spontaneous speech

Two semantic parse techniques GLR* Interlingua parsing (transcript 82.9%; ASR 54%)

Manually constructed, robust grammar to parse input into interlingua Search for the maximal subset covered by the grammar

Phoenix (transcript 76.3%; ASR 48.6%) Identifies key concepts and their structure Parsing grammar contains specific patterns which represent domain concepts and a

generation structure Phoenix is used as a back-off when GLR* fails.

Transcript: 83.3%; ASR 63.6% Late stage disambiguation

Multiple translations are processed through the whole system. Translation hypothesis selection occurs just before generation using scores from

recognition, parsing and discourse processing.

10/04/2006 55

A Unified Approach in Speech-to-Speech Translation: Integrating Features of Speech Recognition and Machine TranslationRuiqiang Zhang, Genichiro Kikui, Hirofumi Yamamoto, Taro Watanabe, Frank Soong, Wai Kit Lo, 2004ATR

Process many hypotheses, then select one. In a cascaded architecture:

HMM-based ASR produces N-best recognition hypotheses IBM Model 4 MT (a noisy channel model) processes all N.

Rescore MT hypotheses based on weighted log-linear combination of ASR and MT model scores Construct the feature weight model by optimizing for a translation distance

metric (mWER, mPER, BLEU, NIST) using Powell’s search algorithm Experiment Results

Corpus: 162k/510/508 Japanese-English parallel sentences Baseline: no optimization of MT features Significant improvement was obtained by optimizing MT feature weights based

on distance metric Additional improvement is achieved by including ASR features

10/04/2006 56

Explicit Use of Prosodic Information

How can prosodic information improve translation?

How can prosodic information be translated?

10/04/2006 57

On The Use of Prosody in a Speech-to-Speech TranslatorVolker Strom1, Anja Elsner1, Wolfgang Hess1, Walter Kasper4, Alexandra Klein2, Hans Ulrich Krieger4, Jörg Spilker3, Hans Weber3 and Günther Görz3, 19971University of Bonn, 2University of Wien, 3University of Erlangen-Nürnberg, 4DFKI GmbH

INTARC - German-English Translator produced for VERBMOBIL project. Spontaneous, limited domain (appointment scheduling) 80 minutes of prosodically labeled speech

Phrase Boundary (PB) Detector Gaussian classifier based on F0, energy and time features with a 4 syl. window (acc.

80.76%)

Focus Detector Rule based approach: Identifies location of steepest F0 decline (acc. 78.5%)

Syntactic parsing search space is reduced by 65% Baseline syntactic parsing uses

Decoder factor: product of acoustic and bi-gram scores Grammar factor: grammar model probability of a parse using the hypothesized word

Prosody factor: 4-gram model of words and phrase boundaries

10/04/2006 58

On The Use of Prosody in a Speech to Speech Translator

Semantic parsing search space is reduced by 24.7% The semantic grammar was augmented, labeling rules as “segment-

connecting”(SC) and “segment-internal” (SI) SC rules are applied when there is a PB between segments, SI are applied when there

are not.

Ideal phrase boundaries reduced the number of hypotheses by 65.4% (analysis trees by 41.9%)

Automatically hypothesized PBs required a backoff mechanism to handle errors and PBs that are not aligned with grammatical phrase boundaries.

Prosodically driven translation is used when deep transfer (translation) fails A focused word determines (probabilistically) a dialog act which is translated

based on available information from the word chain. Correct: 50%, Incomplete: 45%, Incorrect: 5%

10/04/2006 59

A Japanese-to-English Speech Translation System: ATR-MATRIXToshiyuki Takezawa, Tsuyoshi Morimoto, Yoshinori Sagisaka, Nick Campbell, Hitoshi Iida, Fumiaki Sugaya, Aiko Yokoo and Seiichi Yamamoto, 1998ATR

Limited domain translation system (Hotel Reservations) Cascaded approach

ASR: sequential model ~2k word vocabulary MT: syntactically driven ~12k word vocabulary TTS: CHATR (concatenative synthesis)

Early Example of “Interactive” Speech-to-Speech Translation. Speech Information is used in three ways in ATR-MATRIX

Voice Selection Based on the source voice, either a male or female voice is used for synthesis

Hypothesized phrase boundaries Using pause information along with POS N-gram information the source utterance

is divided into “meaningful chunks” for translation. Phrase Final Behavior

If phrase final rise is detected, it is passed to the MT module as a “lexical” item potentially indicating a question.

10/04/2006 60

Integrated Approach to Speech-to-Speech Translation

ASR+MT TTS

10/04/2006 61

Finite-State Speech-to-Speech TranslationEnrique Vidal, 1997Universidad Politécnica de Valencia

FSTs can naturally be applied to translation. FSTs for statistical MT can be learned from parallel corpora. (OSTIA)

Speech input is handled in two ways: Baseline cascaded approach Integrated approach

1. Create an translation FST on parallel text2. Replace each edge with an acoustic model of the source lexical item

A major drawback of using this approach is large training data requirement. Align the source and target utterances, reducing their “asynchronicity” Cluster lexical items, reducing the vocabulary size

10/04/2006 62

Finite-State Speech-to-Speech TranslationExperiments Proof of concept experiment

Text: ~30 lexical items used in 16k paired sentences (Spanish- English) Greater than 99% translation accuracy is achieved

Speech: 50k/400 (training/testing) paired utterances, spoken by 4 speakers Best performance: 97.2% translation acc. 97.4% recognition accuracy

Requires inclusion of source and target 4-gram LMs in FST training. Travel domain experiment

Text: ~600 lexical items in 169k/2k paired sentences 0.7% translation WER w/ categorization; 13.3% WER w/o

Speech: 336 test utterances (~3k words) spoken by 4 speakers Text transducer was used, edges replaced by concatenation of “phonetic

elements” modeled by a continuous HMM. 1.9% translation WER and 2.2% recognition WER were obtained.

10/04/2006 63

On the Integration of Speech Recognition and Statistical Machine TranslationE. Matusov, S. Kanthak and H. Ney2005

Use word lattices weighted by HMM ASR scores as input to a weighted FST for translation

Noisy Channel Model from source signal to target text TextTarget = argmax Pr(TextSource, TextTarget| Align) Pr(Signal| TextSource)

Material: 4 parallel corpora Spontaneous speech in the travel domain 3k - 66k paired sentences in Italian-English, Spanish-English and Spanish-Catalan Vocabulary size 1.7k-15k words

Results On all metrics (mWER, mPER, BLEU, NIST), the translation results are as follows:

1. Correct text

2. Word lattice w/ acoustic scores

3. Fully integrated ASR and MT (FUB Italian-English only)

4. Word lattice w/o acoustic scores

5. Single best ASR hypothesis (lower mPER than lattice w/o scores on FUB I-E)

10/04/2006 64

Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech TranslationYuqing Gao2003

Application of discriminative modeling to ASR, with the goal of recognizing interlingua text for MT.

Composing models (e.g., noisy channel models) can lead to local or sub-optimal solutions

Discriminative Modeling tries to avoid these by creating a single maximum entropy model p(text|acoustics,...) Includes other non-independent observations as features.

Major considerations: To simplify computational complexity, acoustic features are quantized. Since the feature vector can get very large, reliable feature selection is

necessary. In preliminary experiments, 150M features were reduced to 500K via

feature selection

10/04/2006 65

Speech-to-Speech TranslationSummary

Existing systems can be used to construct speech-to-speech translation systems

However, two significant problems are encountered Intonational Information is generally ignored

Prosodic Boundaries, Pitch Accent, Affect, etc. are important information carriers which ASR transcripts do not encode

Local Minima The best recognized string may not generate the best translated

string

10/04/2006 66

Intonation and Multi-Language Scenarios

Use and Meaning of Intonation What information can intonation provide?

Automatic Analysis of Intonation How can this information be represented

computationally? Multi-Language Scenarios

Second Language Learning Systems How can computers help teach a second language?

Speech-to-Speech Translation How can machines translate speech?

Thank you

Questions.

10/04/2006 68

Automatic Analysis of IntonationSupervised? Detected Events Algorithm

Wightman Yes Accent, Boundary DTree->HMM

Ananthakrishnan Yes Accent, Boundary CHMM

Ishi Yes 11 phrase final types DTree

Shriberg Yes* Accent, Boundary, other.

Many

Taylor Yes* “Intonational Events” HMM

Levow No Accent Spectral Clustering /

Laplacian SVM

Tamburini No Accent, Lex. Stress Threshold tuning

Heldner Yes Pitch Accent Manual Rule

Batliner Yes Accent, Boundary Decision Tree

10/04/2006 69

Second Language LearningHuman corr? L1 Infl. Seg. Stress Timing /

DurationSupra.

Franco Yes No Yes No Yes No

Herron Artificial Errors German / Italian

Yes Yes No No

Tepperman No No No Yes No No

Spaai No No No Implicit Implicit Yes

Cucciarini Yes No No No Yes No

Teixeira Yes No Yes No Yes Yes

Imoto No Japanese No Yes No Yes

Chen No Mandarin No No Yes Yes

10/04/2006 70

Speech-to-Speech TranslationMT approach Cascaded /

IntegratedLanguages Domain

Lavie Interlingua Cascaded Japanese German Spanish

Meeting Scheduling

Zhang SMT Cascaded Japanese Travel

Strom Interlingua Integrated German Meeting Scheduling

Takezawa SMT Cascaded Japanese Hotel Desk

Vidal SMT Integrated Spanish German Italian

Travel

Matusov SMT Integrated Italian Spanish Catalan

Travel / Scheduling / Hotel Desk

Gao Interlingua Generation Integrated NA NA


Recommended