Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification...

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification

Jean-Luc Rouas1, Jérôme Farinas1 & François Pellegrino2

2Laboratoire Dynamique du LangageUMR 5596 CNRS Université Lumière Lyon 2

Lyon - France

1Institut de Recherche en Informatique de Toulouse

UMR 5505 CNRS Université Paul SabatierToulouse - France

This research is supported by the Région Rhône-Alpes and the French Ministère de la Recherche

[email protected] [email protected] [email protected]

Overview

1. Introduction

2. Motivations

3. Rhythm unit extraction & modeling

4. Fundamental frequency extraction & modeling

5. Vowel System Modeling

6. Language identification: experiments

7. Conclusion and perspectives

1. Introduction

Standard approach to language identification Phonotactic modeling Acoustic-Phonetic modeling as a pre-processing

Alternative features are crucial Phonological features (structure of the vowel system, etc.) Prosodic features (intonation, rhythm, stress, etc.) High level cues (lexicon, etc.)

Importance of prosody and rhythm One of the most salient features for Language Identification by

humans Difficult to define Even more difficult to model!

2. Motivations2.1. Relevance of rhythm

What is Rhythm? Pattern periodically repeated: syllable or stress or mora Alternative theory (Dauer, 1983)

Is rhythm important? Major role in early language acquisition (e.g. Cutler & Mehler, 1993) Structure related to the emergence of language (Frame-Content

Theory) (MacNeilage & Davis, 2000) Role in speech perception (numerous works)

Neural Network Modeling of Rhythm (Dominey & Ramus, 2000) Recurrent network dedicated to temporal sequence processing Results:

78 % of correct identification for L1-L2 coherent pair (EN – JA), chance for L1-L2 incoherent pair (EN – DU)

But inputs consist of hand C/V labelling

2. Motivations2.2. Relevance of intonation

Is intonation relevant for language discrimination? Linguistic grouping between languages using tone as a lexical marker or

not Tone driven language: Mandarin Chinese

The use of changes of F0, or tones, assigned to syllables distinguish lexical items

English uses stress at the level of the sentence Two groups of languages with distinctive prosodic signatures

The challenge Extract prosodic features in a fully unsupervised and language

independent way Model these features and evaluate their relevance

Freq

uenc

y (kH

z)

8

4

00

0 0.2 0.4 0.6 0.8 1.0Time (s)

Am

plitu

de

0 0.2 0.4 0.6 0.8 1.0Time (s)

NonVowel PauseVowel

3. Rhythm unit extraction 3.1. Speech segmentation and vowel detection

Speech segmentation: statistical segmentation (André-Obrecht, 1988)

Speech Activity Detection Vowel detection (Pellegrino & Obrecht,

2000)

3. Rhythm unit extraction 3.2 Rhythm units

Syllable: a good candidate as rhythm unit Syllable seems to be crucial in speech perception (Mehler et al. 1981,

Content et al., 2001)

But Syllable parsing seems to be a tricky language-specific mechanism No automatic language-independent algorithm can be derived (yet)

A roundabout trick: the “pseudo-syllable” Derived from the most frequent syllable structure in the world: CV Using the Vowel segments as milestones The speech signal is parsed in patterns matching the structure:

Cn V (n integer, can be 0).

3. Rhythm unit extraction 3.2 Pseudo-syllable modeling

5 pseudo-syllables

0 0.2 0.4 0.6 0.8 1.0Time (s)

Am

plitu

de

0 0.2 0.4 0.6 0.8 1.0} }} } }}Rhythm :- Duration C- Duration V- Complexity C

Intonation :- Skewness(F0)- Kurtosis(F0)

350

150msms

41,025,0

CCVV CCV CV CCCV CV CCC CCV CCV CV CCCV CV

4. Fundamental frequency modeling

Fundamental frequency extraction: « MESSIGNAIX » toolbox: combination of three methods (amdf,

spectral comb, autocorrelation) Spline interpolation of the F0 curve allows to get values even on

unvoiced segments

Fundamental frequency modeling: Computation of statistics on each pseudo-syllable: skewness &

kurtosis of the F0 distribution

For each language, a Gaussian Mixture Model is trained using the EM algorithm

5. Vowel system modeling

Each vowel segment detected by the vowel detection algorithm is represented by: 8 Mel Frequency Cepstral Coefficients (MFCCs), 8 Delta MFCCs, Energy, Delta Energy, Duration of the segment.

Cepstral subtraction is applied for removal of the channel effect and speaker normalization

For each language, a Gaussian Mixture Model is trained using the EM algorithm

6. Experiments

Corpus: MULTEXT 5 European languages (EN, FR, GE, IT, SP) 50 different speakers (male and female) Read utterances from EUROM1 Limitation: the same texts are produced on average by 3.75

speakers(possible partial text dependency of the models)

Identification task 20 s duration test utterances Very limited number of speakers:Cross validation: 9 speakers for training and 1 for testThe learning-testing procedure is iterated for each speaker of the

corpus.

6. Experiments 6.1. Rhythm modeling

Matrix of confusion: 20s test sentences duration Average correct identification rate: 79 %

EN FR GE IT SPEN 62 4 16 11 7FR - 100 - - -GE 11 1 86 2 -IT 10 1 3 62 23

SP 1 4 - 3 91

ModelItem

6. Experiments 6.2. F0 modeling


EN FR GE IT SPEN 25 44 9 - 22FR - 70 - - 30GE - 36 51 - 12IT - 20 9 43 29

SP - 14 1 - 85

ModelItem

6. Experiments 6.3. Vowel system modeling


EN FR GE IT SPEN 44 - - 38 18FR - 92 1 1 6GE 2 - 96 2 -IT 30 - - 46 24

SP 5 10 - 13 72

ModelItem

6. Experiments 6.4. Merging

Simple weighted addition of the log-likelihoods from the three models (Rhythm, F0 & vowel systems)

Weights (experimental): Rhythm model: 0.8 F0 model: 0.1 Vowel system model: 0.1

Matrix of confusion: 20s test sentences duration Average correct identification rate: 84 %EN FR GE IT SP

EN 67 1 3 10 19FR - 100 - - -GE - - 100 - -IT 13 - - 64 23

SP 1 4 - 6 89

ModelItem

7. Conclusion and perspectives

Conclusion First approach dedicated to automatic LId with merging of

rhythmic and intonation features Rhythmic modeling based on a “Pseudo-syllable” parsing Fundamental frequency described by high-order statistics 84 % correct identification rate with 5 languages (20s

utterances) Perspectives

Improve the rhythmic parsing Model the sequences of rhythmic units and fundamental

frequency descriptors Study the impact of the nature of the corpus (read/spontaneous

and studio/telephone recording) Merge this approach with phonetic and phonotactic modeling

8. Complementary experiments

8. Complementary experiments

Date post:	18-Jan-2018
Category:	Documents
Upload:	horatio-simon
View:	217 times
Download:	0 times

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification...

Documents