Recommendations Based on Speech Classification (and examples of what recommender systems can learn...

Recommendations Based on Speech Classification


(and examples of what recommender systems can learn from signal processing)

Christian MüllerGerman Research Center for Artificial Intelligence

International Computer Science Institute, Berkeley, CA


Christian MüllerGerman Research Center for Artificial Intelligence

International Computer Science Institute, Berkeley, CA

Christian Müller

OverviewOverview Speech as a source of information for non-intrusive

user modeling Speech as a source of information for non-intrusive

user modeling

Knowledge-driven feature selection

Classification methods for independent “bag of observations” features

Valid application-independent evaluation

Feature space warping normalization





Vocal aging -> features for speaker age recognition

GMM/SVM supervector approach for acoustic speech features

Detection task and pseudo-NIST evaluation procedure

Rank and polynomial rank normalization





Conclusions Conclusions

Speech/signal processing Take-away messages





Christian Müller

Speech as a Source for Non-Intrusive UM

Information about the user

explicit statement (intrusive)

inference from sensors(not intrusive)

speakerclassification user model

adaptivespeech dialog system

provides recommendations (e.g. a different route to the gate)

adapts it's dialog behavior (e.g. detailed map with shops vs. arrows)

speech = sensor

?A

B

Now it’s time to get to gate 38.

Christian Müller

Speaker Classification SystemsSpeaker Classification Systems

Audio segment(telephone quality)

Age and GenderVoice Award 2007Telekom live operation 2009

Language14 languages + dialectsNIST evaluation 2007

IdentityProject with BKA 2009NIST* Evaluation 2008

Acoustic EventsProject with VW 2008Interspeech 2008

System

Cognitive LoadBest Research Paper AwardUM 2001

Christian Müller



products media services actions strategies

age

gender

emotions

language

dialect

accent

identity

acousticevents

Christian Müller

Product Recommendations Based on Age and GenderProduct Recommendations Based on Age and Gender

Zur Anzeige wird der QuickTime™ Dekompressor „svq1“

benötigt.

Christian Müller

Michael Feld and Christian Müller. Speaker Classification for Mobile Devices. In Proceedings of the 2nd IEEE International Interdisciplinary Conference on Portable Information Devices (Portable 2008). 2008

Data Flow

on mobile device

on server

Agender Server

Classification

User / Speaker

Mobile ShopAssist

Speech Recognition

Agender Client

User Profile Output

WLAN

AM

YF

Product Recommendations Based on Age and GenderProduct Recommendations Based on Age and Gender

Christian Müller

How can you find features for building your models by explicitly studying the underlying phenomena?

Proposing Knowledge-driven feature select the example of features for speaker age recognition

How can you find features for building your models by explicitly studying the underlying phenomena?

Proposing Knowledge-driven feature select the example of features for speaker age recognition

Christian Müller

PhoneticsVoice

Pathology

SpeakerClassificati

on

Speaker Classification as an Interdisciplinary Area of Research

Speech Technology /

Artificial Intelligence

Software-Technolog

y

Which are the requirements of a speaker classification system and how can they be solved on the implementation layer ?

How can the age (and the gender) of a speaker be recognized automatically ?

Which are the manifestations of age (and gender) in thespeaker’s voice and speaking style ?

Christian Müller

Impact of Aging on the Human Speech Production

Speech breathing

lighterless elasticlower position

lungs

thorax

stiffer

effects:

more speech pauses

lower amplitude

lower expirational volume

Christian Müller


calcification and ossification

laryngal area

larynx

vocal folds loss of tissuestiffening

effects: rise of fundamental frequency (in men)reduced voice quality

Christian Müller


supralaryngal area

degenerationreduced elasticity

facial bones and muscles

effects:imprecise articulationfor example vowel centralization

Christian Müller


neurological effects

loss of tissue in the cortexreduced performance of the neuronal transmitters

effects:reduced articulation ratedefective coordination between the articulatorsvowel centralization

Christian Müller

Development of F0 in Men / Women

age in years

170

160

150

140

130

120

110

100

9020 50 60 70 80 904030

F0 (Hz)

Linville (2001)

men

smokers and non-smokers

only non-smokerswomen

Christian Müller

YM

Age Classes

YF

CF CM

AF AM

SFSM

Children

Female Male

Youth

Adults

Seniors

<= 13 years

14 - 19 years

20 - 64 years

>= 65 Jahren

age

Christian Müller

YM

Age Classes

YF

CF CM

AF AM

SFSM

Children

Female Male

Youth

Adults

Seniors

<= 13 years

14 - 19 years

20 - 64 years

>= 65 Jahren

age

Christian Müller

pitch_mean

pitch_min / pitch_max / pitch_diff

pitch_stddev

jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp

shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp

harm_mean / harm_stddev

ar_rate

pause_num / pause_dur

fundamental frequency (pitch)

mean

min, max and difference

standard deviation

voice quality

shimmer

jitter

harmonics-to-noise-ratio

articulation rate

speech pauses

Features

Christian Müller

fundamental frequency (pitch)

mean

min, max and difference

standard deviation

voice quality

shimmer

jitter

harmonics-to-noise-ratio

articulation rate

speech pauses

Features

voice

speaking style

Christian Müller

Example Results

C_YFAFSF

YM_AM_SM

CF

CM

YF

YM

AF

AM

SF

SM

CF

CM

YF

YM

AF

AM

SF

SM

high jitter value = low voice qualityhigh jitter value = low voice quality

Christian Müller. Zweistufige kontextsensitive Sprecherklassifikation am Beispiel von Alter und Geschlecht [Two-layered Context-Sensitive Speaker Classification on the Example of Age and Gender]. AKA, Berlin, 2006

speech pauses speech pauses

fundamental frequency (F0)fundamental frequency (F0)

Christian Müller

Low-level features(physical characterstics)

spectrum

prosody

phonetics

ideloect

dialog

semantics

<s> how shall I say this <c> <s> yeah I know...

/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ // /n/ /i:/ ...

d d e cbb a e bA

:B:

?

High-level features(learned characteristics)

Hiearchical Feature ModelHiearchical Feature Model

Christian Müller

How can your features be modeled assuming that they are multi-dimentional represent repeating observations of

the same kind can be assumed to be independent

(“bag” of observations) Proposing the GMM/SVM

Supervector Approach on the example of frame-by-frame acoustic features

How can your features be modeled assuming that they are multi-dimentional represent repeating observations of

the same kind can be assumed to be independent

(“bag” of observations) Proposing the GMM/SVM

Supervector Approach on the example of frame-by-frame acoustic features

Christian Müller

FeatureExtraction

FeatureExtraction

ClassificationClassification

FusionTop-Down-Knowledge

FusionTop-Down-Knowledge

PreprocessingPreprocessing

General Classification Scheme

x1x2

y1

wji

-1

0.5

0.7

-0,4

y2

1

1

11

-1.5

zk

wkj

support-vector machinesmultilayer perceptron networks

e.g. channel compensation

(not addressed in this talk)

Christian Müller

spectrum

prosody

phonetics

ideloect

dialog

semantics

<s> how shall I say this <c> <s> yeah I know...

/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ // /n/ /i:/ ...

d d e cbb a e bA

:B:

?

Modeling Acoustics and Prosodics

Modeling Acoustics and Prosodics

no ASR

Christian Müller

Generative Approach: Gaussian Mixture Model (GMM)


featureextraction

“emergencyvehicle”model

probabilitydensity

featureextraction

“emergencyvehicle”model

avg likelihood over all frames

for class “emergency

vehicle”

??

“emergency vehicle”“emergency vehicle”

frame of speechframe of speech

training

test

Christian Müller



test

featureextraction

?“emergency

vehicle”model avg. log

likelihood ratio over all

frames for class

“emergency vehicle”

back-groundmodel

frame of speech

Christian Müller

A Mixture of GaussiansA Mixture of Gaussians

Means, variances, and mixtures weights are optimized in training

Black line = mixture of 3 Gaussians

Means, variances, and mixtures weights are optimized in training

Black line = mixture of 3 Gaussians

Christian Müller

featureextraction

“em. vehic.” (1)

training

“not em. vehic.” (-1)“em. vehic.”

model

Discriminative Method: Support Vector Machine (SVM)


Features are transformed into higher-dimensional space where problem is linear

Discriminating hyper plane is learned using linear regression Trade-off between training error and width of margin Model is stored in form of “support vectors” (data points on the

margin)

Features are transformed into higher-dimensional space where problem is linear

Discriminating hyper plane is learned using linear regression Trade-off between training error and width of margin Model is stored in form of “support vectors” (data points on the

margin)

Christian Müller



featureextraction

?

test

score (distance to hyper plane)

Discriminative methods have shown to be superior to generative methods for similar tasks

Features vectors have to be of the same lengths (sensitive to variable segment lengths)

Solutions: feature statistics calculated over the entire utterance fixes portion of the segment sequential kernels

Discriminative methods have shown to be superior to generative methods for similar tasks

Features vectors have to be of the same lengths (sensitive to variable segment lengths)

Solutions: feature statistics calculated over the entire utterance fixes portion of the segment sequential kernels

Christian Müller

GMM/SVM Supervector ApproachGMM/SVM Supervector Approach

Gaussian means (MAP adapted)

featureextraction

Combines discriminative power of SVMs with length independency of GMMs

Very successful with similar tasks such as speaker recognition

GMM is trained using MAP adaptation

Combines discriminative power of SVMs with length independency of GMMs

Very successful with similar tasks such as speaker recognition

GMM is trained using MAP adaptation

Christian Müller

Evaluation ResultsEvaluation Results23,41

8,09

14,58

19,55

10,22

3,45

0

5

10

15

20

25

entire set matchedunmatched

DCF

GMM-UBMGMM-SVM

Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detectionfor Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.

Christian Müller

How can you evaluate your multi-class models independently from the given application?

How can you establish a appropriate evaluation in order procedure to obtain valid results?

Proposing the detection task and the “pseudo NIST” evaluation procedure on the example of acoustic event detection and speaker age recognition.

How can you evaluate your multi-class models independently from the given application?

How can you establish a appropriate evaluation in order procedure to obtain valid results?

Proposing the detection task and the “pseudo NIST” evaluation procedure on the example of acoustic event detection and speaker age recognition.

Christian Müller

BackgroundBackground

With multi-class recognition problems, many test/analyzing methods are very application specific. e.g. confusion matrices. we want a method that allows results to be

generalized across a large set of applications. With home-grown databases, parameter

tuning on the evaluation set often compromises the validity of the results/inferences. we want a fair “one shot” evaluation.

With multi-class recognition problems, many test/analyzing methods are very application specific. e.g. confusion matrices. we want a method that allows results to be

generalized across a large set of applications. With home-grown databases, parameter

tuning on the evaluation set often compromises the validity of the results/inferences. we want a fair “one shot” evaluation.

Christian Müller

The Detection TaskThe Detection Task

Given a speech segment (s) and an acoustic event to be detected (target

event, ET ) the task is to decide whether ET is

present in s (yes or no) the system's output shall also contains a

score indicating its confidence with more positive scores indicating greater confidence.

Given a speech segment (s) and an acoustic event to be detected (target

event, ET ) the task is to decide whether ET is

present in s (yes or no) the system's output shall also contains a

score indicating its confidence with more positive scores indicating greater confidence.

system

emergeny vehicle ?

yes , 1.324326

Christian Müller

TerminologyTerminology

Segment class e.g. segment event, segment age-class. ground truth (not known).

Target the hypothesized class.

Trial a combination of segment and target.

Segment class e.g. segment event, segment age-class. ground truth (not known).

Target the hypothesized class.

Trial a combination of segment and target.

Christian Müller

EvaluationEvaluation

The system performance is evaluated by presenting it with a set of trials.

Each test segment is used for multiple trials. The absence of all of all targets is explicitly included.

The system performance is evaluated by presenting it with a set of trials.

Each test segment is used for multiple trials. The absence of all of all targets is explicitly included.

systemmusic ?talking ?laughing ?phone ?

music ?talking ?laughing ?phone ?

no -0.3212no 1.8463no -2.5773yes 0.00132no 2.20122

no event ?no event ?

yes 1.32432

emergency vehicle ?

Christian Müller

Type of ErrorsType of Errors

system

target “em. vehic” ?target “em. vehic” ?

no

segment “em. vehic.”segment “em. vehic.”

“MISS”

system

target “phone” ?target “phone” ?

yes

segment “em. vehic”segment “em. vehic”

“FALSE ALARM”

Christian Müller

Decision-Error TradeoffDecision-Error Tradeoff

Selecting an operating point (decision threshold) along the dotted line trades misses off false alarms.

Optimal operating point is application dependent. Low false alarm rates are desirable for most

applications.

Selecting an operating point (decision threshold) along the dotted line trades misses off false alarms.

Optimal operating point is application dependent. Low false alarm rates are desirable for most

applications.

false alarms

misses

“equal error rate”

Christian Müller

Decision Cost FunctionDecision Cost Function

Weighted sum of misses and false alarms using variable costs and priors.

Application model parameters are selected according to the application.

The application parameters for EER are:

CMiss

= CFA

= 1 and PTarget

= 0.5

C(ET, E

N) = C

Miss · P

Target · P

Miss(E

T)

+ CFA

· (1-PTarget

) · PFA

(ET,E

N)

where ET and E

N are the target and non-target events,

and CMiss

, CFA

and PTarget

are application model parameters.

Christian Müller

Example DET-PlotExample DET-Plot

false alarm probability

miss probability

Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detectionfor Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.

Christian Müller

Example Cost ChartExample Cost ChartCOSTS: (At, An)

An C YF YM AF AM SF SM

C -- 0.220 0.092 0.145 0.083 0.133 0.069

YF 0.166 -- 0.081 0.201 0.080 0.198 0.070

YM 0.076 0.084 -- 0.130 0.203 0.108 0.188

AF 0.088 0.161 0.110 -- 0.095 0.219 0.082

AM 0.064 0.083 0.254 0.139 -- 0.105 0.228

SF 0.096 0.150 0.100 0.249 0.091 -- 0.095

SM 0.065 0.085 0.238 0.117 0.246 0.118 --

Avg Cost (At)

0.092 0.130 0.146 0.164 0.133 0.147 0.122

Avg Cost 0.133

Acoustic GMM/SVM Supervector system on 7-class age task

Christian Müller

Pseudo NIST Evaluation Procedure

Pseudo NIST Evaluation Procedure

ERL provided development and evaluation data as representative as possible for the application.

Three months before the evaluation, ICSI was provided with the development data.

At a pre-determined date, the blind evaluation data was provided to ICSI for processing.

The system's output was submitted to ERL in NIST format. ERL downloaded the scoring software from NIST’s website,

made the necessary modifications due to the changes in the labels.

ERL ran the software on the submitted system output. The results were then disclosed to ICSI along with the keys

(truth) for further analysis. --> Fair “one-shot” evaluation, no parameter tuning on the

evaluation set.

Christian Müller

How can you normalize your features in order to obtain a uniform scale and a unifom distribution?

Proposing rank normalization respectively polynomial rank normalization

How can you normalize your features in order to obtain a uniform scale and a unifom distribution?

Proposing rank normalization respectively polynomial rank normalization

Christian Müller

BackgroundBackground

Fundamental frequency (pitch): 75-200 Hz Jitter: 0.001324 PPQ --> implicit feature weighing

Fundamental frequency (pitch): 75-200 Hz Jitter: 0.001324 PPQ --> implicit feature weighing

Christian Müller

Mean/Variance NormalizationMean/Variance Normalization

-1 1

1ai = vi − min(vi)

max(vi) − min(vi)

uniform scale non-uniform distribution

uniform scale non-uniform distribution

Christian Müller

normalized feature

Rank-Normalization

background model

0101 0 00101 0.01 0.250101 0.06 0.50101 0.13 0.750101 0.29 1...

feature

0101 0.01...

0101 0.75...0123 0.42317 0.2...

0101 0.06...

0101 0.13...

0101 0.29...

create ordered list of values using bg data

rank = position in list / number of values

no occurrence mapped to 0

Christian Müller

Rank NormalizationRank Normalization

-1 1

1

-1 1

1

(+) uniform distribution (-) large three dimensional lookup tables (-) linear interpolation for unseen values

larger values ? smaller values ?

(+) uniform distribution (-) large three dimensional lookup tables (-) linear interpolation for unseen values

larger values ? smaller values ?

Christian Müller

Polynomial Rank NormalizationPolynomial Rank Normalization

use ranks to train a polynomial apply polynomial instead of look-up tables

use ranks to train a polynomial apply polynomial instead of look-up tables

better interpolation no need to store look-up

tables

better interpolation no need to store look-up

tables

Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language Recognition System. In Proceedings of the Odyssey 2008 Workshop on Speaker and Language Recognition. Stellenbosch, South Africa, 2008

Christian Müller

ConclusionsConclusions Speech as a source of information for non-intrusive user

modeling Speech as a source of information for non-intrusive user

modeling

















Speech/signal processing Take-away messages

Christian Müller

Thank you!Thank you!

Date post:	19-Dec-2015
Category:	Documents
View:	222 times
Download:	0 times

Recommendations Based on Speech Classification (and examples of what recommender systems can learn...

Documents