Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | robyn-jewel-cross |
View: | 214 times |
Download: | 0 times |
Generative Modeling and Classification of
Dialogs by Low-Level Features
Marco CristaniMarco Cristani, Anna Pesarin, Alessandro Tavano, , Anna Pesarin, Alessandro Tavano, Carlo Drioli, Alessandro Perina, Carlo Drioli, Alessandro Perina, Vittorio MurinoVittorio Murino
2St-12St
2St+4
1St-11St
1St+4
…
BLA BLA
BLA
A. Markov
A. Pentland
BLABLA
BLABLA
PRINT ME IN GRAYSCALE
GoalGoal
• To model and to classify dyadic conversational audio situations
• The situations are characterized by: – the kind of subjects involved within (adults,
children)– a predominant mood (flat or arguing discussion)
• Examples
1
2 3
GoalGoal (2) (2)
• Our guidelines for the modeling are:– to exploit the conversational turn-taking– to not model the content of the conversations (too
difficult)
• Our contribute– A novel kind of features (the Steady Conversational
Periods, SCP) + a very simple generative framework
• In practice…– We are able to finely characterize the turn-taking
encoding also the timing of the turns
Introduction – Social signallingIntroduction – Social signalling• Our aim can be cast as social signalling
problem• Social signals [Vinciarelli et
al. 2008] – the expression of one’s
attitude towards social situation and interplay
– manifested through a multiplicity of non-verbal behavioural cues (facial expressions, gestures, and vocal outbursts)
• Social signalling– recent
formalization
SocialPsychology
Pattern Recognition
Social Signalling
Introduction (2) – social signalsIntroduction (2) – social signals• Bricks for social signals, [Vinciarelli et al.
2008]
OUR FOCUS
Introduction (3) - DefinitionsIntroduction (3) - Definitions• A taxonomy for the social signals
– behavioural/social cues (or thin slice of behavior)• a set of temporal changes in neuromuscular and
physiological activity that last for short intervals of time (milliseconds to minutes)
– social signals (or social behaviours)• multiple behavioural cues
• attitudes towards others or specific social situations that can last minutes to hours
Introduction (5) – Turn takingIntroduction (5) – Turn taking
• Turn taking– includes the regulation of the
conversations, and the coordination (or the lack of it) during the speaker transitions
Introduction (6) – Turn taking examplesIntroduction (6) – Turn taking examples
• Turn-taking– coordination
– timed coordination• more interesting
Yes No
Our approach - preliminariesOur approach - preliminaries• Turn taking in a statistical way: Markov
chaining
• Ergodic Markov model of states
•
•
1 …
St-1 St St+4…
T
Our approach (2)- Markov structuresOur approach (2)- Markov structures
• Markov chaining for multiple agents: connections
• The core of the model is the transition probability (c,d=1,2)
2St-12St
2St+4
1St-11St
1St+4
…
•Problem: computational burden–for C processes, the joint states give transition matrices of O(NCxNC), where N is the number of states for the single processes
single process states
joint process states
T
Our approach (3) – Markov relaxationsOur approach (3) – Markov relaxations
• High-order Markov models [Meyn 2005]
• each single process choses the next state independently from the other single process(es) – reasonable! – O(NCxN) space complexity, still hard to deal with
2St-1
2St 2St+4
1St-11St
1St+4
…
Our approach (4) – Influence modelOur approach (4) – Influence model
• Mixed Memory processes, (Observed) Influence model (OIM) [Saul et al. 99, Asavathiratham 2000]
– each single process choses the next state not considering the choral effect of the system at the previous time step
– instead, pairwise state dependencies plus influence factors {θ} are introduced
2St-12St
2St+4
1St-11St
1St+4
…
Our approach (5) – Influence modelOur approach (5) – Influence model
• We have weighted convex combination of probabilities
– intra-chain transition:
– inter-chain transition:
2St-12St
1St-11St
2St-12St
1St-11St
self-influence
other’s influence
• Transition tables of O(CN2)+ influence matrix θ of O(C2)
Our approach (6) - Setting Our approach (6) - Setting
– The conversation originates a couple of synchronized audio signals sampled at 44100 Hz
– NO source separation issues (see later)
– short-term energies of the speech signals was computed on frames of 10 msec
– speech (T)/silence (S) classification via k-means
• We focused on two-person conversations
10 msec
TTTTTTTT TTTTTTT
TTTTT TTTTTTTTT
TTTT TSSSSS SSSSSSSSS
SSSS SSSSSSSS SSSS SSSS
• How to instantiate the (Observed) Influence Model ?– at each frame (10 msec) (no inter-chain trans. are
depicted for clarity)
– OUTPUT• we have more autotransions than effective changes
• the parameters of the Markov chains are not informative (highly diagonal)
• the length of the speech/silence segments is lost due to the 1-st order dependence
Our approach (7) – Choose a strategyOur approach (7) – Choose a strategy
T T T T T T T T T T T T T T T
T T T T T TTT T T T T T T
TTT T TSS SSS SSSSSSSSS
S SSS SSSSSSSS S SSS SSSS
• Whenever a change in the system does occurr, a novel SCP begins, for each chain/process
– OUTPUT • we have features, addressing system’s
changes
• we introduce a synchronization
• at each SCP are associated two information1. the SPEECH (T) – SILENCE (S) label2. the time length
Our approach (8) – Our approach (8) –
Steady Conversational PeriodsSteady Conversational Periods
T T T T T T T T T T T T T T T
T T T T T TTT T T T T T T
TTT T TSS SSS SSSSSSSSS
S SSS SSSSSSSS S SSS SSSSSCP
t~SCP
1~ tSCP
2~ tSCP
3~ tSCP
4~ tSCP
5~ tSCPT~
Frame
<label, time length>SCP
• How to exploit SCPs for a Markov modelling?
– By addressing a state renaming• <1,S> 1 | <1,T> 2 | <2,S> 3 | ….
– Training a OIM STATE SPACE EXPLOSION, SPARSITY!!!
<8,S> <4,S> <5,S> <3,S>
<5,S><5,S><4,S><8,T>
<5,T>
<3,T> <9,T>
<9,T>
<15> <7> <9> <5>
<9><9><7><16>
<10>
<6> <18>
<18>
Our approach (9) – Our approach (9) –
Steady Conversational PeriodsSteady Conversational Periods
• We consider SCP histograms
Gaussian clustering
Maximum Likelihood (ML)labeling
Our approach (9) –SCP Our approach (9) –SCP exploitationexploitation
• The state space decreases in size
<15> <7> <9> <5>
<9><9><7><16>
<10>
<6> <18>
<18>
<2> <1> <1> <1>
<1><1><1><4>
<3>
<3> <4>
<4>
Our approach (10) – SCP exploitationOur approach (10) – SCP exploitation
Our approach (11) – Classification Our approach (11) – Classification
• At this point the couple of sequences and are used to train the OIM λ, obtaining:
Two intra-chain matrices
they tell how each agent produces a set of SCP states
Two inter-chain matrices
they tell how each SCP state of one chain is conditioned on each state of the other chain
An influence matrixit tells how the two chains influence each other
(by counting state occurrences)
(by counting state occurrences) (by gradient ascent)
Our approach (12) – Remarks Our approach (12) – Remarks
• IMPORTANT: the order with which the sequences and
are evaluated by the system
Agent 1
Agent 2influences
0.0.88
0.0.22
0.0.77
0.0.33
Ag.1 Ag.2
Ag
.2A
g.1
influences
Agent 1
Agent 2influences
• Given a OIM, we can evaluate the likelihood
Our approach (13) Our approach (13) - - Classification Classification
• Once a model Ψ={ϴ,λ} and a test dialog I (an ordered pair of arrays O1 and O2 composed by {S,T} symbols) are provided, we want the likelihood P(I| Ψ) = P(O1 , O2 | Ψ)
1. SCP are extracted2. SCP Gaussian labels are estimated from ϴ,
originating , (ϴ act as a codebook)
3. The OIM, final likelihood is estimated as
Experiments Experiments - preliminaries- preliminaries
• Twofold aim:1. how the statistical signature explains turn-taking2. how our model is effective in the classification task
1. Analysis of the models parameters: restricted dataset– 27 healthy subjects (10 males, 17 females)– two age groups:
• 14 preschool children ranging from 4 to 6 years (so, 14 dialogs)
• 13 adults ranging from 22 to 40 years (13 dialogs)– semi-structured dialogs (lasting about 10 minutes): an
adult human operator asks the subject (child or adult) to talk about predetermined topics:• (school time/work, hobbies, friends, food, family)
Experiments (2) – Influence factorsExperiments (2) – Influence factors
influence
s
• High self-influence:– different intra-chain sequences
of speech/silence SCP states characterize the subjects
– such sequences occurr independently
influence
s
• Low self-influence:– different intra-chain sequences
of speech/silence SCP states characterize the subjects
– such sequences occurr co-ordinated in time
123 3
3 4 4 314
133 1
3 1 44
42 3
3
31 4 3
14
42 3
3
Experiments (3) |adult-child conv.Experiments (3) |adult-child conv.
INTRA CHAIN MATRICES
– The child shows a high tendency to converge to a short silence state
– The moderator alternates from a state of silence to a speech state, either long or short, with high probability
Experiments (4) |adult-child conv.Experiments (4) |adult-child conv.
INTER-CHAIN MATRICES
– the child utters a sentence whether the moderator speaks for a long time (he get bored of the moderator…)
– the moderator utters a sentence whenever the child remains silent for a long time (he encourages the child…)
Experiments (5) |adult-adult conv. Experiments (5) |adult-adult conv.
INTRA CHAIN MATRICES
– The subject tends to speak continuously
– The moderator alternates from a state of silence to a speech state, either long or short, with high probability
Experiments (6) |adult-adult conv. Experiments (6) |adult-adult conv.
INTER-CHAIN MATRICES
– the moderator interacts with the subject mostly by talking to him (whether to ask questions or stopping him)
Experiments (7) - Classification Experiments (7) - Classification
• Restricted extended dataset: – We add conversations
• 5 flat non-structured conversations
• 9 disputes between adults (an operator pushed for fighting, the other subject naturally reacted)
–We instantiate 4 classification tasks
(A) flat vs dispute - (cat:1 vs cat:3);(B) flat vs dispute, general - ((cat:1 U cat:2) vs cat:3);(C) with vs without child - (cat:2 vs cat:1);(D) all vs all;
–We gather three categories of dialogs
1.Flat dialog between adults (18 samples)2.Flat dialog between a child and an adult (14 samples)3.Dispute (9 samples, only between adults)
• Comparative strategies– SCP histograms (SCP)
• normalized histogram of the SCPs (silence, speech) as signature
• Bhattacharyya distance for the classification
– Turn taking influence model (TTIM)• In practice, it is as we had “SCP” with the same
duration [Basu et al. 01]
– Mixture of Gaussian classifier on a set of acoustic cues (MOG) [Shriberg 98] [Fernandez et al. 02] :• pitch range measure (for the intonation)
• “enrate” speech rate (articulation velocity)
• spectral flatness measure (SFM)
• drop-off of spectral energy above 1000 Hz (DO1000) for the emotion modelling
Experiments (8) – Classification Experiments (8) – Classification
Experiments (9) – ClassificationExperiments (9) – Classification
• Results:
(A) flat vs dispute - (cat:1 vs cat:3);(B) flat vs dispute, general - ((cat:1 U cat:2) vs cat:3);(C) with vs without child - (cat:2 vs cat:1);(D) all vs all;
• lower accuracy in the task A – some flat conversations are misclassified– sometimes timing of flat conversations is built by
subjects which utters very short sentences, similar to dispute
– this behavior is captured by our model and disregarded by TTIM
– SOLUTION: augment the features, not only SCPs!
Conclusions Conclusions
• A novel way to model dialogs has been proposed
• The main contributions are– Steady Conversational Periods (SCP), as a way to
synchronize a dialog, making feasible first-order Markov treatment
– The embedding of SCP in an Observed Influence Model, resulting in a detailed way to describe the turn taking of a conversation
• The future improvements– From a methodological point of view
• Inserting uncertainty in the SCP states, i.e., move to a full Influence Model
• Enrich the model with different prosodic features
– From a practical point of view• Enlarge the data set
• Try novel situations
Publications Publications • A.Pesarin, M.Cristani, V.Murino, C.Drioli and A.Perina,A statistical signature for automatic dialogue
classification. In proceedings of the International Conference on Pattern Recognition (ICPR 2008) Tampa, Florida.
• M.Cristani, A.Pesarin, C.Drioli, A.Tavano, A.Perina, V.Murino, Auditory Dialog Analysis and Understanding by Generative Modelling of Interactional Dynamics In proceedings of the Second IEEE Workshop on CVPR 2009 for Human Communicative Behavior Analysis.
• M.Cristani, A.Tavano, A.Pesarin, C.Drioli, A.Perina, V.Murino, Generative Modeling and Classification of Dialogs by Low-Level Features, submitted to System Man and Cybernetics:Part B (under review)
References References • [Vinciarelli et al. 2008] Vinciarelli, A., Pantic, M., Bourlard, H., and Pentland, A. 2008. Social
signal processing: state-of-the-art and future perspectives of an emerging domain. In Proceeding of the 16th ACM international Conference on Multimedia MM '08.
• [Choudhury et al. 2004] T. Choudhury and S. Basu. Modeling conversational dynamics as a mixed memory markov process. In Proc. NIPS, 2004.
• [Meyn 2005] S. P. Meyn and R.L. Tweedie, 2005. Markov Chains and Stochastic Stability. Second edition to appear, Cambridge University Press, 2008
• [Asavathiratham 2000] C. Asavathiratham, “A tractable representation for the dynamics of networked markov chain,” Ph.D. dissertation, Dept. of ECS, MIT, 2000.
• [Saul et al. 99] L. Saul and M. Jordan, “Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones,” Machine Learning, vol. 37, no. 1, pp. 75–87, 1999.
• [Basu et al. 01] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland, “Learning human interaction with the influence model,” MIT MediaLab, Tech. Rep. 539, 2001.
• [Shriberg 98] E. Shriberg, “Can prosody aid the automatic classification of dialog acts in conversational speech?” Language and Speech, vol. 41, no. 4, pp. 439–487, 1998.
• [Fernandez et al. 02] R. Fernandez and R. Picard, “Dialog act classification from prosodic features using support vector machines,” in Proc. of Speech Prosody, 2002.
Thanks!!!Thanks!!!