D3.3 - Standalone on-line voice analysis...

$Page 1: D3.3 - Standalone on-line voice analysis systemgeniiz.com/wp-content/uploads/2016/08/D3.3_-_Standalone... · 2016. 8. 30. · orz dqg plg ohyho ghvfulswruv vxfk dv slwfk hqhuj\ dqg$
Standalone on-line speech feature extractor

and emotion recognizer with visual output Erik Marchi, Florian Eyben, Björn Schuller

DELIVERABLE D3.3 Grant Agreement no. 289021 Project acronym ASC-Inclusion Project title Integrated Internet-Based Environment for Social

Inclusion of Children with Autism Spectrum Conditions Contractual date of delivery 31 October 2012 Actual date of delivery 31 October 2012 Deliverable number D3.3 Deliverable title Standalone on-line speech feature extractor and emotion

recognizer with visual output Type Report, Public Number of pages 19 WP contributing to the deliverable WP 3 (Voice Analysis) Responsible for task Björn Schuller (TUM) [email protected]

Author(s) Erik Marchi (TUM) [email protected] Florian Eyben (TUM) [email protected] Björn Schuller (TUM) [email protected]

ASC-Inclusion D3.3

Page 2 of 19 ICT FP7 Contract No. 289021

Table of Contents1. Introduction ........................................................................................................................... 3 2. Suitable parameters and descriptors .................................................................................... 3 3. Features and classification evaluation .................................................................................. 4

3.1 Database of prototypical emotional utterances ........................................................... 5 3.2 Tasks .......................................................................................................................... 5 3.3 Features ..................................................................................................................... 6 3.4 Setup and evaluation .................................................................................................. 7 3.5 Results ........................................................................................................................ 7

4. Architecture ......................................................................................................................... 10 4.1 Audio feature extraction engine ................................................................................ 11 4.2 API for transmission of acoustic parameters ............................................................. 14

5. Graphical User Interface ..................................................................................................... 15 6. Conclusions and self-analysis ............................................................................................. 18 7. References ......................................................................................................................... 19

ASC-Inclusion D3.3


1. Introduction The goal of WP3 is to implement and evaluate an on-line vocal expression analyser, showing children with Autism Spectrum Condition how they can improve their context-dependent vocal emotional expressiveness. Based on the child’s speech input recorded by a standard microphone, a set of emotionally relevant low- and mid-level acoustic features are extracted and visualised in real-time so that the child gets an immediate feedback about how well the vocal expression of the recorded utterance fits to a pre-recorded utterance or pre-defined parameters conveying the target emotion. The first step to implement such an analyser was to identify speech features for emotional expression perception, deriving potential descriptors of the affective state from the recorded prototypical utterances. For this reason, as a basis for data-driven analysis of speech features that are modulated by emotion, a number of sentences spoken in different emotion emotions by multiple ASC children and typically developed children have been recorded [1]. A core component of the final speech analyser is the on-line feature extractor calculating the speech features specified in [2] in real-time. Thus, the openSMILE [3] audio feature extraction engine has been integrated into the system and extended according to the relevant descriptors that are to be tracked for vocal expression analysis. Besides relevant low- and mid-level descriptors such as pitch, energy and duration of voiced segments, also higher-level information concerning the affective state of the speaker are extracted in real-time. Hence, the implemented system is be able to recognize basic emotion emotional expressions (happy, sad, angry, surprised, afraid, proud, ashamed, calm), as well as dimensional representations of emotion and mental state including arousal, valence. A visual output is generated for recognized emotions and for tracked speech parameters. This report, first describes the suitable parameters and descriptors for emotion perception (Section 2). Then an overview on features and classification performances (Section 3) is given. Next we introduce the architecture (Section 4) and the graphical user interface (section 5) of the ASC-Inclusion Voice Analyser prototype, before concluding the report (Section 6). 2. Suitable parameters and descriptors Considering the outcome of the evaluations described in [2], [4] and [5], we defined potential features and descriptors that are relevant for emotion perception. Applying the TUM audio feature extractor openSMILE [3] (technical details are described in the openSMILE book that can be downloaded from http://www.openaudio.eu), a large set of prosodic, spectral and voice quality low-level features has been extracted from the recorded utterances. Based on these features we derived further potential descriptors of the affective state. The outcome of the analysis is a consistent list of speech parameters (pitch, energy, duration) that is tracked by the vocal expression evaluation system. The top features in the mentioned automatic ranking were quartiles and inter-quartile ranges of the 7-th RASTA-filtered auditory filter-bank band (centre frequency of 807 Hz), the total utterance duration, and the standard deviation of pitch, the deviation of the pitch contour from a linear regression line, and spectral skewness and kurtosis. These features support various previous findings from psychologists and clinicians that ASC children differ from

ASC-Inclusion D3.3


typically developing children in terms of pitch variation. Further, we find that the distribution of the speech modulated energy in the 800 Hz region seems to be of importance, as well as the shape of the spectrum, as described by the spectral skewness and kurtosis parameters. Now the crucial task is to find out which of the features that proved to be most important in classification, can be employed for giving explicit feedback to the children, in order to improve their ability to correctly indicate emotional states. Note, that such an explicit feedback can be accompanied by implicit feedback, for instance, by playing again specific parts of the utterance, or by re-synthesising specific parts, and by that drawing attention to characteristics of specific features and parameters. A deeper look was taken at individual features in order to make quantitative statements on the differences between the classes. We have chosen features modelling pitch, energy, and duration; such features can most likely rather easily be conveyed to the children by prompting them to speak louder or more quietly, to raise or lower pitch, or to speak faster or slower, resulting in a list of prosodic descriptors such as pitch, energy, and duration of voiced segments, shown in Table 1.

Feature Pitch - mean Pitch – standard deviation Log. Energy – mean Log. Energy – standard deviation Duration of voiced segments - mean Duration of voiced segments – std. dev. Utterance duration

Table 1: list of suitable parameter for emotion perception

3. Features and classification evaluation We are interested in classification as well as in analysing to what extent prosodic features are relevant when the child is expressing his or her emotional state. Furthermore, given that prosodic features such as energy, pitch, and duration are easier to show and to convey as feedback than spectral and cepstral features, the child can interact and intuitively manipulate these parameters during the game. Prosodic features can be used both for automatic modelling and for demonstrating to the children how to employ them, and they will be used as consistent parameters for the corrective feedback that will be given to the children for improving the appropriateness of their emotional expressions. Based on the evaluation database [1], we produced models to be used for the standalone voice analysis system. This chapter gives a brief description of the database (Section 3.1), then we define the classification tasks (Section 3.2), the features sets, and the evaluation set-up (Section 3.3 and Section 3.4). We next comment on evaluation results (Section 3.5).

ASC-Inclusion D3.3


3.1 Database of prototypical emotional utterances As an evaluation database for the recognition of emotions and for the analysis of speech features that are modulated by emotion, a data set of prototypical emotional utterances containing sentences spoken in Hebrew by a total number of 20 children has been created. The focus group consists of nine children (8 male pupils and 1 female pupil) at the age of 6 to 12. All children were diagnosed with an autism spectrum condition by trained clinicians. 11 typically developed pupils (5 female and 6 male) at the age of 6 to 9 served as control group. In order to limit the effort of the children, the experimental task was designed to focus on the six “basic” emotions except disgust: happy, sad, angry, surprised, afraid plus other three mental states: ashamed, clam, proud, and neutral. The full speech material was not collected for each subject since the task was new for the children and it requires a strong sense of comfort and a high level of cooperation. In the focus group, two pupils were not recorded because they found the task not comfortable and other three of them were only partially recorded since they wanted to stop their participation. In the control group one child found the task not comfortable and recordings were not held. Furthermore, some samples belonging to the control group were left out because of the high level of background noise. Hence, the actual focus group consists of seven children (6 male and 1 female) at the age of 6 to 10 (Mean = 8.1, Standard deviation = 1.6). Three of them were diagnosed with an Asperger Syndrome (AS) and the other four were diagnosed with High-Functioning (HF) autism spectrum disorder. The actual control group is composed of 10 typically developed children (5 male and 5 female) at the age of 5 to 9 (Mean = 7.2, Standard deviation = 1.8). The database comprises 529 utterances with a total duration of 16m 24s and an average utterance length of 1.8s. 178 utterances contain emotional speech of children with ASC with a total recording time of 7m 1s and an average utterance duration of 2.37s. Within this group, 90 and 88 utterances are produced by children with Asperger Syndrome and High-Functioning diagnosis, respectively. The remaining 351 utterances are produced by the control group with a total duration of 9m 23s and an average utterance recording time of 1.61s. Further details are shown in [1].

# utterances #Emotion #Arousal #Valence #Diagnosis #All H SA AN SU AF P AS C N - + - + AS HF

Focus group 30 21 20 21 18 21 17 14 16 67 111 76 102 88 90 178 Control group 49 38 38 38 38 46 37 27 40 142 209 151 200 - - 351

Total 79 59 58 59 56 67 54 41 56 209 320 227 302 - - 529 Table 2: Number of utterances per emotion category (# Emotion), binary arousal/valence, diagnoisis and overall number of utterances (# All) on the two group sets. Emotion classes: happy (H), sad (SA), angry (AN), surprised (SU), afraid (AF), pruod (P), ashamed (AS), calm (C), neutral (N). Diagnosis categories: Asperger Syndrome (AS), High-Functioning autism spectrum disorders (HF).

3.2 Tasks Three tasks were evaluated: emotion, valence, arousal, and every emotion-against-neutral. The emotion task covers the recognition of the nine target classes (eight emotions plus “neutral”). We further evaluated the discrimination between high and low arousal as well as between positive and negative valence. Additionally, we evaluate the emotion-

ASC-Inclusion D3.3


against-neutral task in order to analyse the differences and discriminate across each of the eight emotions against the neutral state. The typicality task was performed on the full database, and the diagnosis task on the focus group. All the emotion related tasks (emotion, valence, arousal and emotion-against-neutral) were performed on the focus and control group subsets separately. The mapping of the emotion categories onto the binary arousal/valence labels is shown in Table 3, a detailed description of the number of instances belonging to the classes of each task per subset is given in Table 2.

AROUSAL VALENCE Low High Negative Positive sad happy sad happy

ashamed angry angry surprised calm surprised afraid proud

neutral afraid ashamed calm proud neutral neutral

Table 3: Arousal and valence mapping for emotion categories.

3.3 Features We grouped all the features into three categories: Spectral such as functionals of auditory spectrum at different frequency bands with or without RASTA filtering, magnitude spectrum and Mel Frequency Cepstral Coefficients (MFCCs), Voice Quality that comprises functionals of jitter, shimmer and Harmonic to Noise Ratio (HNR), and Prosodic such as functionals of energy, loudness, duration, fundamental frequency contour, voice probability and zero-crossing rate. In the following sections we will refer to the features by using this taxonomy. The experiments were conducted using four feature sets: IS12, IS12-CFS, IS12-IG and PROS. The IS12 features set, from the INTERSPEECH 2012 Speaker Trait Challenge [6], contains 6128 features and is taken as reference in our experiments. Next, we applied feature selection to IS12 using two methods: by considering the individual predictive ability of each feature using correlation-based selection (IS12-CFS) and by measuring the information gain (IS12-IG). While the former selected a variable number of features for each task (up to 140 features), for the latter we selected the best 15 features in order to have a set of features of equal size to compare with our manually selected prosodic feature set comprising 15 features. The prosodic set (PROS) consists of statistical functionals of: Energy such as the sum of auditory spectrum at different frequency bands (from 20Hz to 8kHz) and root-mean-square signal frame energy; Pitch: fundamental frequency contour; and Duration by modelling temporal aspects of F0 values, such as the F0 onset segment length. We applied mean, standard deviation, 1st percentile and 99th percentile to Energy and Pitch, and only mean and standard deviation to Duration. As mentioned before, we choose these three prosodic low level descriptors (Energy, Pitch and Duration) with their basic functionals (mean, standard deviation, maximum and minimum values) as simplest prosodic parameters that can be easily conveyed to the children. They enable the child to manipulate them intuitively throughout the game, for instance, by modulating pitch in order to accomplish a simple task such as moving a graphical object to a target, or by increasing/decreasing energy in order to jump over an

ASC-Inclusion D3.3


obstacle. Such intuitive and easy interaction would be hardly provided by spectral features and cepstral features such as MFCCs. It can be expected that automatically selected features yield a better performance than pure prosodic features; however, these might be correlated up to some extent with the automatically selected ones, and thus still be good candidates for our envisaged game. All features were extracted with openSMILE [3]. 3.4 Setup and evaluation Since all data sets are unbalanced (i.e. one class is underrepresented in the data), the unweighted average recall (UAR) of the classes is used as scoring metric. Adopting the Weka toolkit [7], Support Vector Machines (SVMs) with linear kernel were trained with the Sequential Minimal Optimization (SMO) algorithm. SVMs have been chosen as classifier since they are a well-known standard method for emotion recognition due to their capability to handle high and low dimensional data. The SVM training has been made at different complexity constant values. To ensure speaker independent evaluations, Leave-One-Speaker-Out (LOSO) cross-validation has been performed. In order to balance the class distribution, we applied the Synthetic Minority Over-sampling Technique (SMOTE) for all the evaluation experiments. Furthermore, we adopt the speaker z-normalisation (SN) method since it is known to improve the performance of speech-related recognition tasks. With such a method, the feature values are normalised to a mean of zero and a standard deviation of one for each speaker. For typicality and diagnosis tasks, we do not apply speaker z-normalisation since centring and scaling the feature space in such tasks is not effective because the phenomena considerably vary in the range across subjects. By applying this technique the relevant features able to characterise the subject are flattened, making the classification performances not acceptable and below the chance level. For each task, we first perform classification experiments using the four different feature sets, in order to evaluate the performances over decreasing dimensional feature spaces. Then we analyse the selected feature sets with a detailed description of the differences/similarities across the IS12-IG and PROS sets. For that, we compute the correlation between the features belonging to the two sets and adopt the average mean correlation coefficient r to identify the level of correlation across the two sets with a unique parameter. Note that we first compute the absolute value of the correlation coefficients ri,j and then we calculate the mean, since we are interested in both decreasing and increasing linear relationships between the features. This analysis has the goal to bring to light if and which prosodic features are relevant for each task and what further prosodic functionals we should include in our manually selected features set. 3.5 Results This section shows evaluation and feature analysis for the targeted tasks: typicality and diagnosis, emotion, arousal, valence, and emotion-against-neutral. For further details see [4] and [5]. Emotion related tasks For emotion classification, we perform four different tasks: a 9-class emotion task, a 2-class arousal and valence task, and the 2-class task “e vs. Neutral”, with e ∈ {Happy, Surprised, Proud, Angry, Afraid, Calm, Sad, Ashamed}. All the tasks were performed both

ASC-Inclusion D3.3


on the focus and on the control set separately. In addition to the classification, we further analyse the differences between the feature sets employed in our experiments: We adopt the same strategy as described for typicality and diagnosis discrimination, showing the best results achieved over the different complexities among the four feature sets with and without speaker z-normalisation (SN). Since speaker normalisation led to better performances on all the tasks, we only show the speaker normalised performance trends over decreasing dimensional feature spaces in Figure 1 (emotion, arousal, and valence). Emotion 9-class problem

On the focus group, we observe the influence of speaker normalisation that improves UAR by over 4%, 10% and 8% absolute, respectively, for the IS12, IS12-CFS and PROS. Applying the full set of features (IS12), we obtain up to 42.6% UAR, however, reducing the features space led to an expected decrease of performance (cf. Figure 1a). The IS12-CFS set performs quite close to the baseline. It consists of spectral features (23), one voice quality feature, and four prosodic features related to voicing probability and energy, such as the sum of auditory spectrum with and without RASTA filtering. We further compare IS12-IG and PROS; the average mean correlation coefficient r is 0.5 showing that the two feature sets are correlated to some extent; the IS12-IG set, in addition to 3 spectral features, comprises 12 energy features related to the sum of auditory spectrum in the different frequency bands. In particular 1st and 99th percentiles, and the standard deviation of the sum of auditory spectrum can be found in both feature sets, therefore the maximum absolute correlation coefficient is 1.0. Thus, in the focus group, the emotion task relies on prosodic features, in particular on energy features, leading to 23.9% and 28.9% UAR, respectively, for IS12-IG and PROS. Furthermore, the prosodic feature set performs better that the automatically selected set, showing that with such a small set of features, prosody can be relevant for this task. On the control group set, UAR is improved by speaker normalisation over 9%, 20% and 4%, respectively, for the IS12, IS12-CFS and PROS sets. Applying the full set of features (IS12), we obtain up to 55.9% UAR; Figure 1a shows the performance trends over decreasing dimensional feature spaces. IS12-CFS is close to IS12 performances. IS12-CFS contains spectral features (33); only six prosodic features such as RMS energy, sum of auditory spectrum and F0, are comprised. A more detailed comparison between PROS and IS12- IG shows that the average mean correlation coefficient is below 0.5. IS12-IG consists of 5 energy features (auditory spectrum) and 10 spectral features; in particular 99th percentile and standard deviation of the sum of auditory spectrum are used in the two sets. The two feature sets lead to similar results: 16.6% and 18.8% UAR for IS12-IG and PROS.

Arousal 2-class problem An increase in performance can be obtained by speaker normalisation among the four feature sets and on the focus and control group. On the focus data set, UAR is improved up to 86% with IS12-CFS. As in the previous tasks, the reduction of the feature space led to a decrease of performance as shown in Figure 1b. The IS12-CFS and the full feature set (IS12) perform similarly; the former one consists of 94 features, comprising a significant number of spectral features (77) and only few prosodic features, such as F0 standard deviation, 1st delta coefficient of the sum of auditory spectrum, and further functionals of root-mean-square energy. The two smaller feature

ASC-Inclusion D3.3


sets (IS12-IG and PROS) yield quite similar performance: 81.4% and 78.8% UAR; this is corroborated by a medium average mean correlation coefficient. The IS12-IG set consists of energy features and 7 spectral features. In particular the 99th percentile and the standard deviation of the sum of auditory spectrum are also found in PROS set. On the control group subset, we obtain up to 90.0% UAR, with the IS12-CFS set. The correlation-based selected feature set (IS12-CFS) and the full feature set (IS12) perform close to each other; the IS12-CFS comprises 135 features, including a vast number of spectral features (114); only few features are related to prosody, such as F0, root-mean-square energy, the sum of auditory spectrum with and without RASTA filtering, voicing probability and zero-crossing rate. The IS12-IG and PROS sets perform similarly: 76.8% and 77.5% UAR. The average mean correlation coefficients is equal to 0.44, showing that the two feature sets are medium correlated also in the control group. The IS12-IG set contains again energy features related to auditory spectrum and 9 spectral features. In addition to the 99th percentile and the standard deviation of the sum of auditory spectrum that are found also in the PROS set, we observe the presence of further functionals, such as range and percentile range and mean peak absolute values. Thus, in the two groups, the arousal task can rely on prosodic features without losing performance.

Valence 2-class problem On the focus group data set, the influence of speaker normalisation improves UAR by over 8%, 7%, and 5% absolute, respectively, for IS12, IS12-CFS, and IS12-IG. Applying the full set of features (IS12), we obtain up to 82.1% UAR. IS12-CFS performs close to the baseline (IS12) and consists of 41 features, including spectral features and voice quality features, such as jitter and shimmer and only 4 prosodic features such as energy and F0. We observe a very low average mean correlation coefficient, meaning that the two feature sets are not correlated; in fact, the IS12-IG comprises mainly voice quality features (3) and spectral (10) features; only two prosodic features related to F0 are comprised. On the control group set, UAR is improved by speaker normalisation over 9% and 3% for the IS12 and the IS12-CFS feature sets. We obtain up to 81.8% UAR using the IS12 set, but reducing the feature space led to significant decrease in performances (cf. Figure 1c). IS12-CFS comprises 61 features, which are mainly spectral (58) and only 3 prosodic related to RMS energy and F0. As for the focus group, the average mean correlation coefficient is low and, again, in IS12-IG the predominance of spectral features (15) is maintained. Thus, the valence task performs better with spectral and voice quality features than with prosodic features, achieving up to 72.0% and 64.4% UAR with the IS12-IG set, respectively, for the focus and the control group.

Summing up, together with the classification evaluation, we analyse how prosodic features behave in the tasks. We focus on three prosodic low level descriptors (energy, pitch and duration) with their basic functionals (mean, standard deviation, 1st percentile and 99th percentile), as these can be easily conveyed to the children and modified by them during the game.

ASC-Inclusion D3.3


Figure 1: Classification of emotion, arousal, and valence: Mean and standard deviation of UAR by average of

complexity for the four different feature sets with speaker normalisation. For example, the child can modulate his/her pitch in order to reach a target, or he/she has to increase or decrease energy to jump over an obstacle. Such intuitive and easy interaction would be hardly possible for spectral and cepstral features. Speaker normalisation increases performance for all the emotion related tasks, and this technique will be adopted also in the prototype of the ASC-Inclusion platform since we will incrementally collect more speech material from the same subject throughout the game. The caveat has to be made that this is a pilot study, with a rather small number of cases per class; the results will be reviewed, verified or falsified, with larger databases collected in the future. However, so far the results corroborate common wisdom, for instance, that prosody is more relevant if it comes to modelling arousal, and less relevant for modelling valence. ASC children seem to employ prosodic features, albeit in a different way. The correlation between the prosodic and the automatically selected feature sets is not very high but not low, either. Moreover, we can expect that by intentionally modulating and manipulating prosodic features, other acoustic parameters will change accordingly. 4. Architecture A description of the data-flow and the architecture of the on-line feature extractor is given in Section 4.1. Next we describe the adopted format for acoustic parameter transmission (Section 4.2).

ASC-Inclusion D3.3


4.1 Audio feature extraction engine The openSMILE audio feature extraction engine that has been integrated into the system and extended according to the relevant descriptors, addresses a comprehensive cross-domain feature set, flexibility and extensibility, and the incremental processing support. The key features are as follows: Incremental processing, where data from an arbitrary input stream (file, sound card,

etc.) is pushed through the processing chain sample by sample and frame by frame (see Figure 2,3,4). Ring-buffer memory for features requiring temporal context and/or buffering, and for reusability of data, i.e. to avoid duplicate computation of data used by multiple feature extractors such as FFT spectra (see Figure 4). Fast and lightweight algorithms carefully implemented in C/C++, no third-party dependencies for the core functionality. Modular architecture which allows for arbitrary feature combination and easy addition of new feature extractor components by the community via a well-structured API and a run-time plug-in interface. Configuration of feature extractor parameters and component connections in a single configuration file.

Figure 2 shows the overall data-flow architecture of openSMILE, where the Data Memory is the central link between all Data Sources (components that write data from external sources to the data memory), Data Processors (components which read data from the data memory, modify it, and write it back to the data memory), and Data Sinks (components that read data from the data memory and write it to external places such as files).

Figure 2: Architecture

ASC-Inclusion D3.3


The ring-buffer based incremental processing is illustrated in Figure 3. Three levels are present in this setup as example: wave, frames, and pitch. A cWaveSource component writes samples to the ‘wave’ level. The write positions in the levels are indicated by the vertical arrows. A cFramer produces frames of size 3 from the wave samples (non-overlapping), and writes these frames to the ‘frames’ level. A cPitch (simplified for the purpose of illustration) component extracts pitch features from the frames and writes them to the ‘pitch’ level. Since all boxes in the plot contain values (=data), the buffers have been filled, and the write pointers have been warped.

Figure 3: Incremental data-flow in ring-buffer memories; the (red) arrow (pointing in between the columns)

indicates the current write pointer Figure 4 shows the incremental processing for higher order features. Functionals (max and min) over two frames (overlapping) of the pitch features are extracted and saved to the level ‘func’. The size of the buffers must be adjusted to the size of the block a reader or writer reads/writes from/to the data memory at once. In the above example the read block size of the functionals component would be 2 because it reads 2 pitch frames at once. The input level buffer of ‘pitch’ must be at least 2 frames long, otherwise the functionals component will not be able to read a complete window from this level. openSMILE handles this adjustment of the buffer size automatically.

ASC-Inclusion D3.3


Figure 4: Incremental data-flow in ring-buffer memories; the (red) arrow (pointing in between the columns)

indicates the current write pointer Besides exchanging data via the data memory, components in openSMILE can exchange data via a very simple messaging system, which is useful e.g. for classification results and turn start/end messages. To extend the capabilities of the system, further components have been implemented, in order to send messages over the network, as a client-server architecture. For ASC-Inclusion, a new networking layer was added to openSMILE. This includes a basic platform independent socket API available both for Linux and Windows, a network message and data sender, and a remote control interface. With this interface, openSMILE can be run on live audio input on a backend system; output data can be streamed to a client over a standard TCP/IP network. This is required for easy integration with other components of the game. It is also required for the standalone demonstrator system, to avoid the overhead involved with linking GUI and plotting libraries into openSMILE – which is purely a command line tool. The network components enable the processing in the backend to be started, paused and resumed, and stopped remotely from a client process. All features and also ASR output and phoneme alignments, which are extracted from a live audio stream (from the computer’s microphone), can be sent over the network to one or more client programs at the same time. The openSMILE engine links to the Julius Large-Vocabulary Speech Recognizer (Julius LVSCR) engine for decoding speech with Hidden-Markov Models (HMM). With HMMs trained on suitable children’s speech, a speech recognizer was built, which can align the spoken content of the child’s utterance to a given ground truth word and phoneme sequence. In the future, we will also be investigating methods of detecting word insertions,

ASC-Inclusion D3.3


deletion, or substitutions (parts where the spoken utterance differs from the ground truth, i.e. the utterance the child was supposed to speak) based on the log-likelihoods of the models. 4.2 API for transmission of acoustic parameters The acoustic parameters described in chapter 2 and 3 need to be shared with other components. In the initial prototype, every recognition component will provide its own way of feedback by implementing a prototype GUI to visualize the input (the acoustic parameters – in the case of the audio input component). However, in the final system a common visualization component shall be used, and parts of the game will need access to the acoustic parameters to compute scores, for example. The acoustic parameters are computed on several temporal levels. The lowest level is the frame level, i.e. acoustic low-level descriptors are computed from short overlapping frames of audio data (20-60ms), which are sampled at a rate of 10ms. Pitch, Energy, Loudness, and spectral parameters are examples of this category. The next level is constructed by applying statistical functionals to the low-level descriptor contours over fixed length windows. The length of these windows can either be constant, typically 1-5 seconds, or dynamically correspond to linguistic meaningful units such as words, phrases, or utterances. We refer to these levels as the constant length or dynamic length supra-segmental levels. The data rate required by the transmission of acoustic parameters decreases from the frame level to the supra-segmental levels. The dynamic length supra-segmental level has no fixed rate at which data is sent, while the other two levels have constant rates. In order to manage the high data rate on the frame level and constant length supra-segmental level efficiently we propose to send the data as a binary stream of packets, which have the following format:

Data-type (which level, acoustic, etc.) 32-bit unsigned integer Time-stamp 64-bit unsigned integer Number of parameter values 32-bit unsigned integer Data rate (period) 32-bit float Value 1 32-bit float Value 2 32-bit float … … Value N 32-bit float

Table 4: Format of the proposed packet for sending acoustic parameters to other system components.

The types/names of the N parameters sent are either hard-coded via a system wide configuration file, or will be set up statically during a system initialization phase. These packets must be embedded in standardized container packets, which contain header information, such as packet type, sender and receiver component, and system-state meta-information. The values on the dynamic length supra-segmental level can be sent either in the same

ASC-Inclusion D3.3


format as described above, or in an XML based format, where acoustic parameters are sent as key/value pairs. Emotion related data (classifier results, etc.) will also be sent in an XML based format using EmotionML. 5. Graphical User Interface The graphical interface of the ASC-Inclusion Voice Analyzer prototype includes two parts: the upper part in which the system gives a visual feedback on the recognised emotion and the bottom part in which the parameters for emotion perception are tracked in real time (Figure 5).

Figure 5: Voice analysis standalone system

The upper part of the GUI concerns emotion recognition. It shows arousal and valence in a 2D plot by colouring the 4 quadrants depending on the recognition results (cf. Figure 6). Four colours were used to represent combinations of high and low arousal, and negative and positive valence:

ASC-Inclusion D3.3


Green: high arousal and positive valence Light Blue: low arousal and positive valence Blue: low arousal and negative valence Red: high arousal and negative valence

Figure 6: Colours used for showing arousal and valence in a 2D plot

In addition to arousal and valence, we envisioned to show a feedback for further emotion categories such as Individual, Social and Self-conscious according to the final list of emotion provided in [8]. For basic emotions we will be showing also the emotion name while for the others we will just colour up the recognised category accordingly to arousal and valence colours mapping. For example, if the system will recognise ‘Sad’, the arousal and valence plot will colour the bottom-left quadrant in blue and the ‘basic’ field will show a blue rectangle with ‘Sad’ written internally.

Basic Individual Social Self-conscious Happy Excited Kind Ashamed

Sad Interested Jealous Proud Afraid Bored Unfriendly Angry Worried Joking

Disgusted Disappointed Sneaky Surprised Frustrated

Hurt Table 5: Emotion categories

The bottom part of the GUI is dedicated to real-time parameters tracking. It shows Energy and Pitch over time. Moreover in the two bottom-right quadrants, speed and pitch standard deviation are showed as moving bars. The voice analysis standalone system has two modes of operation: Free mode: the user is asked to speak freely while pitch, energy, speed and pitch variation are tracked over time and shown to the user (cf. Figure 5). In this case the user expressed happiness and the arousal and valence plot coloured up in green. Accordingly, the basic field shows a green rectangle, indicating the recognised emotion.

ASC-Inclusion D3.3


Exercise mode: the system provides to the user a given audio file to listen to, and the user is invited to repeat and imitate the audio file. In the first phase, while the speaker is listening to the audio sample, the audio parameters and descriptors are plotted in green, in order to provide a visual target to achieve. Then the user starts to repeat and imitate the audio file; this time, the visual feedback will be in blue and it will indicate how far the vocal production of the user is from the target utterance (cf. Figure 7).

Figure 7: Exercise mode; in green the target parameters' variation, in blue the user' parameter variation

tracked over time.

ASC-Inclusion D3.3


6. Conclusions and self-analysis The models used for classification are trained with the database reported in [1] and additional models will be added as soon as new material will be recorded. The corrective visual feedback given in the exercise mode by the system is only a first attempt to show to the children how they can manipulate low level descriptors like pitch, energy and duration. It is a first tentative for what will be developed in the next project year. In order to assess a child’s performance in expressing emotions via speech, the extracted audio parameters have to be compared to the respective parameters extracted from pre-recorded prototypical utterances. Suitable measures for determining the ‘distance’ between the child’s expression and the corresponding prototypical utterance (based on relevant features) have to be defined and calibrated. This ‘distance’ or ‘difference’ has to be visualized in an easily understandable way and the child’s motivation for minimizing the deviation between its vocal expression and the expression conveyed in the prototypical utterance has to be assured by interpreting the task as a game. The child’s success in this game shall be tracked over a longer period of time in order to create a child’s individual ‘history’ leading to a personal user profile that reveals specific difficulties, changes, and examples of vocal emotional expression. Corrective feedback regarding the appropriateness of the child’s vocal expression will be provided based on contextual parameters provided from the central platform, the stored library of prototypical expressions, and the child’s individual history. Feedback can be provided either visually or acoustically, e.g., by replaying both, prototypical and recorded pitch patterns.

ASC-Inclusion D3.3


7. References [1] E. Marchi, B. Schuller, S. Tal, S. Fridenzon, and O. Golan, “Database of prototypical emotional utterances,” Public Deliverable D3.1, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Feb 2012. [2] E. Marchi, F. Eyben, A. Batliner, and B. Schuller, “Specification of suitable parameters for emotion perception and definition of api for later integration,” Public Deliverable D3.2, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Apr 2012. [3] F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, MM 2010, (Florence, Italy), pp. 1459–1462, ACM, ACM, October 2010. [4] E. Marchi, B. Schuller, A. Batliner, S. Fridenzon, S. Tal, and O. Golan, “Emotion in the speech of children with autism spectrum conditions: Prosody and everything else,” in Proceedings 3rd Workshop on Child, Computer and Interaction (WOCCI 2012), Satellite Event of INTERSPEECH 2012, (Portland, OR), ISCA, ISCA, September 2012. [5] E. Marchi, A. Batliner, B. Schuller, S. Fridenzon, S. Tal, and O. Golan, “Speech, emotion, age, language, task, and typicality: Trying to disentangle performance and feature relevance,” in Proceedings First International Workshop on Wide Spectrum Social Signal Processing (WSÂ³P 2012), held in conjunction with the ASE/IEEE International Conference on Social Computing (SocialCom 2012), (Amsterdam, The Netherlands), ASE/IEEE, IEEE, September 2012. [6] B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, and B. Weiss, “The INTERSPEECH 2012 Speaker Trait Challenge,” in Proceedings INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, (Portland, OR), ISCA, ISCA, September 2012. [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explor. Newsl., vol. 11, pp. 10–18, Nov. 2009. [8] H. O’Reilly, S. Baron-Cohen, K. Baron, N. Meir, C. Rotman, S. Fridenzon, S. Tal, D. Lundqvist, S. Berggren, and S. Boelte, “Scenarios definition and content,” Confidential Deliverable D6.1, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Aug 2012.

Date post:	03-Mar-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

D3.3 - Standalone on-line voice analysis...

Documents