SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Introduction
The VODER “Voice Operation DEmonstratoR” ofHomer Dudley, demonstrated at Bell Laboratoryexhibit at the 1939 New York World’s Fair, wascontrolled using a keyboard and foot pedals.
We can say that these peripherals enabled to controlparameters of the a vocoder behind the VODER. Andoperator of the VODER was a “model” that generatedthe control sequence.
In the case of the VODER the “model” to synthesizethe speech parameters was a human. Current vocodersincorporate the modelling of the parameters. Todistinguish them from historical vocoders, we aregoing to call them hereinafter synthesis vocoders.
1/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Analysis - MGC features
Mel-generalised cepstral (MGC) features cα,γ(m) aretypically used in speech vocoding.
H(z) = s−1γ
(∑Mm=0 cα,γ(m)z−m
)=
(
1 + γ∑M
m=1 cα,γ(m)z−m)1/γ
, −1 ≤ γ < 0
exp∑M
m=1 cα,γ(m)z−m, γ = 0
(1)where M is an analysis order.
2/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Relation of α and γ
The variable z−1 can be expressed as the first orderall-pass function
z−1 =z−1 − α1− αz−1
(2)
where α is a warping factor.
For 16kHz, α = 0.42 gives good approximation to themel scale. The parameter γ control the representationaccuracy of poles and zeros.
As the value of γ approaches zero, the accuracy forspectral zeros increases at the expense of formantaccuracy.
3/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Relation of MGC to other analysis methods.
Figure: Relation of MGC to other analysis methods.
For more details and explanation, please see Phil’s rootcepstrum notes.
4/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Speech parameter generation
As already mentioned, vocoders enable to model theirparameters. The models are typically Hidden MarkovModels (HMMs).
Then, an additional algorithm need to be used, tocalculate the speech parameters (static cepstra) fromcontinuous mixture HMMs with dynamic features.
An iterative MLPG algorithm does it. We will notexplain it as it is a staff for sequential speechprocessing systems.
5/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Re-synthesis
In mel-generalised speech ceptrum H(z) is modelledby as set cepstrum coefficients cα,γ(m).
For re-synthesis, the parameter γ is fixed to be −1/2.This value balances good representation of bothspectral poles and zeros.
Then, the synthesis filter is realised as a rationaltransfer function
H(z) =1
{B(z)}2(3)
where
B(z) = 1 + γ
M∑m=0
cα,γ(m)z−m. (4)
6/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Removing delay-free loops
To remove delay-free loops from B(z), it synthesisfilter is re-designed to
B(z) = 1 + γ
M∑m=0
b′γ(m)Φm(z). (5)
where
Φm(z) =(1− α2)z−1
1− αz−1z−(m−1),m ≥ 1. (6)
and the filter coefficients b′γ(m) are obtained using arecursive formula
b′γ(m) =
{cα,γ(M), m = Mcα,γ(m)− αb′γ(m+ 1), 0 ≤ m < M
(7)
7/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
A structure of MGLSA filter
Figure: A structure of MGLSA filter 1B(z) . (If you’re not familiar
with this kind of diagram, the triangles are scalers/attenuatorsi.e. multiply-by-constant, the plusses are adders, and the z−1
boxes are 1-cycle delays.)
This synthesis filter is known in literature asMel-Generalised Log Spectral Approximation (MGLSA)filter.
8/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Mel-generalized cepstral vocoder (MGC)
The MGC vocoder is based on analysis/re-synthesisframework introduced in the previous section. The maincharacteristics are:
Uses a mixture of pulse train and white Gaussian noisefor excitation source modelling.
Pulse/noise model is straightforward.
Produces characteristic “buzzy” sounds due to strongharmonics at higher frequencies.
Typical parameters are α = 0.42 and γ = −1/3.
9/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
STRAIGHT-MGC - 1999
Figure: Hideki Kawahara, Professor, Wakayama University
http://www.wakayama-u.ac.jp/~kawahara/
STRAIGHTadv/index_e.html
10/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
STRAIGHT: Speech Transformation andRepresentation based on Adaptive Interpolationof weiGHTed spectrogram
Figure: A block diagram of STRAIGHT vocoder
Extract fundamental frequency F0
F0-adaptive spectral analysis. The aperiodicitymeasure is defined as the lower envelope (spectralvalleys) normalized by the upper envelope (spectralpeaks).
11/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
STRAIGHT: synthesis
Figure: STRAIGHT synthesis
Aperiodicity is used to weight the harmonic and noisecomponents of the excitation; removes the periodicityeffects of fundamental frequency on extracting thevocal tract spectral shape.
12/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Glottal vocoder
Uses a library of glottal pulses instead of pulse trainfor voiced signals.
The glottal excitation is synthesized throughinterpolating and concatenating natural glottal flowpulses.
The excitation signal is further modified to reproducethe time-varying changes in the natural voice source.
Analysis of excitation using the Iterative AdaptiveInverse Filtering.
Energy and harmonic-to-noise ratio for weighting thenoise component.
Available at http://www.helsinki.fi/speechsciences/synthesis/glott.html.
13/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Deterministic plus Stochastic vocoder – 2012
Uses MGC analysis/re-synthesis
Differs in the excitation modelling:
1 Uses GCI-synchronous LP residuals extraction.2 Deterministic component at the low frequencies is
decomposed using PCA to obtain first eigen residual3 Stochastic component is made of energy envelope and
an autoregressive model.
14/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Harmonics plus Noise Model based vocoder -2013
The previous described vocoders were based onsource-filter decomposition and modelling.
An completely different approach is usingsinusoidal/waveform decomposition.
The harmonic plus noise (HNM) models assumes thespeech spectrum to be composed of two frequencybands: harmonic and noise. The bands are separatedby maximum voiced frequency (MVF).
15/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
HNM harmonic band analysis 1
The harmonic part, the lower band, is modelled as asum of harmonics
sh(t) =
L(t)∑k=−L(t)
Ak(t) exp(jkω0(t)t) (8)
where L(t) denotes the number of harmonics thatdepends on the fundamental frequency w0(t) and onthe MVF Fm(t).
16/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
HNM harmonic band analysis 2
The complex magnitudes Ak(t) can take on one of thefollowing forms:
Ak(t) = ak(ti)Ak(t) = ak(ti) + tbk(ti)Ak(t) = ak(ti) + tck(ti) + t2dk(ti)
(9)
where ak(ti), bk(ti), ck(ti) and dk(ti) are complexnumbers with constant phases, measured at analysistime instants ti.
A simple stationary harmonic model using the firstlydefined Ak(t) is referred as HNM1 is capable togenerate speech perceptually indistinguishable fromoriginal speech.
17/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
HNM noise band analysis
The modulated noise, the upper band. Mostimportant is specification of noise bursts (whereenergy is localised). Therefore the noise part sn(t) isdescribed as time-varying autoregressive model h(τ, t)modulated by a parametric envelope e(t):
sn(t) = e(t)[h(τ, t) ∗ b(t)] (10)
where b(t) is white Gaussian noise.
Finally, the synthetic speech s(t) is
s(t) = sh(t) + sn(t) (11)
18/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
HNM based vocoder
The HNM based vocoder thus:
Decomposes the speech frames into a harmonic partand the stochastic part using
1 MGC2 F03 MVF
Voiced frames – full spectral envelope may be obtainedby interpolating amplitudes at harmonics.
Unvoiced frames – analysed with fast Fouriertransform.
Available at aholab.ehu.es/ahocoder/index.html
19/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Speech quality evaluation
In the context of the last lectures about the parametricspeech, i.e., the speech analysis/re-synthesis methods,one may be interested in evaluation of speech qualitydegradation that the methods introduce.
We distinguish:
1 Subjective evaluation: by asking people aboutevaluated stimuli. It is costly and time consuming.
2 Objective evaluation: by using computers for that. Itis cheaper, faster, but the quality depends on the test.
20/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Classification of speech quality evaluationmethods
1 Conversational quality: the quality aspects of theconversation – it is a rare test.
2 Talking quality: echo, delay and sidetone distortion.
3 Listening quality: to measure typically single qualitydimension such as:
intelligibilitynaturalnesslistening effort
21/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Subjective listening tests
The subjective listening tests differ mainly if thereference signal is used.
1 Non reference based tests follow absolute category(ACR) rating procedures.
2 Otherwise reference based tests are called degradationcategory rating (DCR) tests.
Both following MOS and DMOS tests are standardisedby ITU-T.
22/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Mean Option Score (MOS)
In an ACR test a group of listeners rate the listeningquality of the stimuli (speech examples). The quality israted in the 5-level impairment scale:
1 Bad,
2 Poor,
3 Fair,
4 Good,
5 Excellent.
and the average of all scores is represents the speechquality metric called mean opinion score (MOS).
23/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Degradation Mean Option Score (DMOS)
Sometimes the resolution of the MOS is not sufficient.It can be increased by reference based DCR test.
Here the listeners first listen original (source) speechsignal and rate the degradation of speech quality of theprocessed (modified) speech signal. The degradation isagain rated in the 5-level impairment scale:
1 very annoying,2 annoying,3 slightly annoying,4 audible but not annoying,5 inaudible.
The average of all scores is represents the speechquality metric called degradation mean opinion score(DMOS).
24/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
ABX
If one can test listener’ reliability as well, there is socalled ABX test.
The listeners are provided with three speech examples– A, B, and X, asking which of A/B is identical to X.As the signal X is known reference, the ABX test alsobelongs to the DCR procedures.
The ABX test is suitable for rating small degradationusing a continuous impairment scale, and expert(trained) listeners should be used.
25/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Objective
1 Similarly as in subjective listening tests, referencebased tests are called intrusive.
2 Non reference based are called non-intrusive.
26/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Spectral distortion
Widely accepted objective measure is a frequency domainmeasure – gain-normalised spectral distortion (SD). TheSD measure evaluates autoregressive spectra Pxy(n, k)
PRxy(n, k) =< Rxy(k), exp−j2πnk/N > (12)
as per frame k
dkSD(s, t) =1
N
N−1∑n=0
[10 log10
(P sxy(n, k)
P txy(n, k)
)]2(13)
for the source signal s and target signal t. The finalmeasure, the global distortion is the root-mean SD:
dSD(s, t) =1
K
√√√√K−1∑k=0
dkSD(s, t) (14)
where K is the total number of frames.27/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Psycho-acoustically motivated measures
Many of the intrusive objective measures arepsycho-acoustically motivated measures. The idea here isto mimic human speech listening, and so the methodsimplement two basic modules:
1 Auditory processing – it employs an perceptualtransform using bark-scale frequency warping andsubjective loudness conversion. The output is theauditory (nerve) excitation.
2 Cognitive mapping – it extract key information relatedto anomalies in the speech signal from the auditoryexcitation. This area is still not well understood.
28/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Perceptual Evaluation of Speech Quality (PESQ)
Figure: Mimicking human quality assessment.
Widely used Perceptual Evaluation of Speech Quality(PESQ) computes internal representations based onauditory periphery of both reference/source signal sand distorted/target signal y
Internal representations are compared to predictspeech quality degradation Q.
It mimics human brain that probably compares thesetwo entities during speech quality evaluation as well.
29/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
Perceptual Objective Listening QualityAssessment (POLQA)
A recent update of PESQ measure is Perceptual ObjectiveListening Quality Assessment (POLQA):
PESQ measures one-way distortion and the effectsrelated to two-way communication such as delay, echoare not reflected in the scores. POLQA handles thesignal with variable delays.
PESQ was design for narrow-band signal (3.4 kHz)and even there is an wide-band (7 kHz) extension,POLQA should perform better for wide-band signals
30/31
SpeechSignalProcessing
MilosCernak
Introduction
Speechsynthesissignalprocessing
Analysis
Speechparameter
generation
Re-synthesis
Synthesisvocoders
Speechqualityevaluation
Subjective
listening tests
Objective
POLQA enhancements
POLQA in addition predicts “idealised” referencesignal, modelling listeners expectations of an idealsignal.
The reference signal with low amount of recordingnoise and an identical degraded signal will not bescored with the maximum score.
When the uncertainty of the subjective scores is takeninto account, a statistical metric calledepsilon-insensitive rmse (rmse*) can be used (ITU-TP.1401 (07/2012)).
Last but not least: PESQ is free while a binary ofPOLQA costs 3500 CHF.
31/31