+ All Categories
Home > Documents > Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification...

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification...

Date post: 18-Jan-2018
Category:
Upload: horatio-simon
View: 217 times
Download: 0 times
Share this document with a friend
Description:
1. Introduction  Standard approach to language identification Phonotactic modeling Acoustic-Phonetic modeling as a pre-processing  Alternative features are crucial Phonological features (structure of the vowel system, etc.) Prosodic features (intonation, rhythm, stress, etc.) High level cues (lexicon, etc.)  Importance of prosody and rhythm One of the most salient features for Language Identification by humans  Difficult to define  Even more difficult to model!
18
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1 , Jérôme Farinas 1 & François Pellegrino 2 2 Laboratoire Dynamique du Langage UMR 5596 CNRS Université Lumière Lyon 2 Lyon - France 1 Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS Université Paul Sabatier Toulouse - France This research is supported by the Région Rhône-Alpes and the French Ministère de la Recherche [email protected] [email protected] [email protected]
Transcript
Page 1: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification

Jean-Luc Rouas1, Jérôme Farinas1 & François Pellegrino2

2Laboratoire Dynamique du LangageUMR 5596 CNRS Université Lumière Lyon 2

Lyon - France

1Institut de Recherche en Informatique de Toulouse

UMR 5505 CNRS Université Paul SabatierToulouse - France

This research is supported by the Région Rhône-Alpes and the French Ministère de la Recherche

[email protected] [email protected] [email protected]

Page 2: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

Overview

1. Introduction

2. Motivations

3. Rhythm unit extraction & modeling

4. Fundamental frequency extraction & modeling

5. Vowel System Modeling

6. Language identification: experiments

7. Conclusion and perspectives

Page 3: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

1. Introduction

Standard approach to language identification Phonotactic modeling Acoustic-Phonetic modeling as a pre-processing

Alternative features are crucial Phonological features (structure of the vowel system, etc.) Prosodic features (intonation, rhythm, stress, etc.) High level cues (lexicon, etc.)

Importance of prosody and rhythm One of the most salient features for Language Identification by

humans Difficult to define Even more difficult to model!

Page 4: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

2. Motivations2.1. Relevance of rhythm

What is Rhythm? Pattern periodically repeated: syllable or stress or mora Alternative theory (Dauer, 1983)

Is rhythm important? Major role in early language acquisition (e.g. Cutler & Mehler, 1993) Structure related to the emergence of language (Frame-Content

Theory) (MacNeilage & Davis, 2000) Role in speech perception (numerous works)

Neural Network Modeling of Rhythm (Dominey & Ramus, 2000) Recurrent network dedicated to temporal sequence processing Results:

78 % of correct identification for L1-L2 coherent pair (EN – JA), chance for L1-L2 incoherent pair (EN – DU)

But inputs consist of hand C/V labelling

Page 5: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

2. Motivations2.2. Relevance of intonation

Is intonation relevant for language discrimination? Linguistic grouping between languages using tone as a lexical marker or

not Tone driven language: Mandarin Chinese

The use of changes of F0, or tones, assigned to syllables distinguish lexical items

English uses stress at the level of the sentence Two groups of languages with distinctive prosodic signatures

The challenge Extract prosodic features in a fully unsupervised and language

independent way Model these features and evaluate their relevance

Page 6: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

Freq

uenc

y (kH

z)

8

4

00

0 0.2 0.4 0.6 0.8 1.0Time (s)

Am

plitu

de

0 0.2 0.4 0.6 0.8 1.0Time (s)

NonVowel PauseVowel

3. Rhythm unit extraction 3.1. Speech segmentation and vowel detection

Speech segmentation: statistical segmentation (André-Obrecht, 1988)

Speech Activity Detection Vowel detection (Pellegrino & Obrecht,

2000)

Page 7: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

3. Rhythm unit extraction 3.2 Rhythm units

Syllable: a good candidate as rhythm unit Syllable seems to be crucial in speech perception (Mehler et al. 1981,

Content et al., 2001)

But Syllable parsing seems to be a tricky language-specific mechanism No automatic language-independent algorithm can be derived (yet)

A roundabout trick: the “pseudo-syllable” Derived from the most frequent syllable structure in the world: CV Using the Vowel segments as milestones The speech signal is parsed in patterns matching the structure:

Cn V (n integer, can be 0).

Page 8: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

3. Rhythm unit extraction 3.2 Pseudo-syllable modeling

5 pseudo-syllables

0 0.2 0.4 0.6 0.8 1.0Time (s)

Am

plitu

de

0 0.2 0.4 0.6 0.8 1.0} }} } }}Rhythm :- Duration C- Duration V- Complexity C

Intonation :- Skewness(F0)- Kurtosis(F0)

350

150msms

41,025,0

CCVV CCV CV CCCV CV CCC CCV CCV CV CCCV CV

Page 9: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

4. Fundamental frequency modeling

Fundamental frequency extraction: « MESSIGNAIX » toolbox: combination of three methods (amdf,

spectral comb, autocorrelation) Spline interpolation of the F0 curve allows to get values even on

unvoiced segments

Fundamental frequency modeling: Computation of statistics on each pseudo-syllable: skewness &

kurtosis of the F0 distribution

For each language, a Gaussian Mixture Model is trained using the EM algorithm

Page 10: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

5. Vowel system modeling

Each vowel segment detected by the vowel detection algorithm is represented by: 8 Mel Frequency Cepstral Coefficients (MFCCs), 8 Delta MFCCs, Energy, Delta Energy, Duration of the segment.

Cepstral subtraction is applied for removal of the channel effect and speaker normalization

For each language, a Gaussian Mixture Model is trained using the EM algorithm

Page 11: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

6. Experiments

Corpus: MULTEXT 5 European languages (EN, FR, GE, IT, SP) 50 different speakers (male and female) Read utterances from EUROM1 Limitation: the same texts are produced on average by 3.75

speakers(possible partial text dependency of the models)

Identification task 20 s duration test utterances Very limited number of speakers:Cross validation: 9 speakers for training and 1 for testThe learning-testing procedure is iterated for each speaker of the

corpus.

Page 12: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

6. Experiments 6.1. Rhythm modeling

Matrix of confusion: 20s test sentences duration Average correct identification rate: 79 %

EN FR GE IT SPEN 62 4 16 11 7FR - 100 - - -GE 11 1 86 2 -IT 10 1 3 62 23

SP 1 4 - 3 91

ModelItem

Page 13: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

6. Experiments 6.2. F0 modeling

Matrix of confusion: 20s test sentences duration Average correct identification rate: 53 %

EN FR GE IT SPEN 25 44 9 - 22FR - 70 - - 30GE - 36 51 - 12IT - 20 9 43 29

SP - 14 1 - 85

ModelItem

Page 14: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

6. Experiments 6.3. Vowel system modeling

Matrix of confusion: 20s test sentences duration Average correct identification rate: 70 %

EN FR GE IT SPEN 44 - - 38 18FR - 92 1 1 6GE 2 - 96 2 -IT 30 - - 46 24

SP 5 10 - 13 72

ModelItem

Page 15: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

6. Experiments 6.4. Merging

Simple weighted addition of the log-likelihoods from the three models (Rhythm, F0 & vowel systems)

Weights (experimental): Rhythm model: 0.8 F0 model: 0.1 Vowel system model: 0.1

Matrix of confusion: 20s test sentences duration Average correct identification rate: 84 %EN FR GE IT SP

EN 67 1 3 10 19FR - 100 - - -GE - - 100 - -IT 13 - - 64 23

SP 1 4 - 6 89

ModelItem

Page 16: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

7. Conclusion and perspectives

Conclusion First approach dedicated to automatic LId with merging of

rhythmic and intonation features Rhythmic modeling based on a “Pseudo-syllable” parsing Fundamental frequency described by high-order statistics 84 % correct identification rate with 5 languages (20s

utterances) Perspectives

Improve the rhythmic parsing Model the sequences of rhythmic units and fundamental

frequency descriptors Study the impact of the nature of the corpus (read/spontaneous

and studio/telephone recording) Merge this approach with phonetic and phonotactic modeling

Page 17: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

8. Complementary experiments

Page 18: Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

8. Complementary experiments


Recommended