1
Acoustic Laboratory November 2013, Caseros, Buenos Aires Province, Argentina
EFFECT OF DISTORTION ON SPEAKER RECOGNITION IN
TELEPHONE LINE FREQUENCY BAND
AGUSTÍN Y. ARIAS 1
1 Universidad Nacional de Tres de Febrero, Buenos Aires, Argentina.
Abstract – This paper analyze how the harmonic distortion affects the correct recognition of speakers with the
objective to contribute to the advance of forensic acoustics. To achieve this task, the dynamics of different voices has
been processed, gradually increasing the level of distortion, in order to simulate the actual quality of a telephone line.
Several subjective listening tests were performed, where fifteen subjects had to listen to the voices processed and
decide which person belongs each voice, comparing the distorted voices with reference voices without distortion. The
influence of the Distortion Level and the Total Harmonic Distortion (THD) of a telephone line simulator is analyzed.
In addition, vocal formants are studied using several sonograms to analyze the results of the tests. The procedures,
considerations and limitations of the test process are explained as well as the conclusions and new lines of
investigations are described.
1. INTRODUCTION
The recognition of speakers is a broadly diversified
discipline, where various systems that perform this
task automatically or semi-automatically can be found.
These systems are used for various purposes, for
example: smart opening locks, access to private
computer networks, mobile smart devices, among
others. However, the objective of this study is directly
related to the field of Forensic Acoustics where
handling computerized techniques by a skilled expert
plays a fundamental role to produce accurate results
and conclusions. Forensic Acoustics is one of the most
complex research environments of the Scientific
Police, mainly due to the multidisciplinary nature of
their different approaches of analysis, and the need to
grant a continued high level of training to their experts.
While the domain and use of technology and digital
applications analysis, calculation or processing is
essential, involving a team of experts specializing in
different perspectives of study is even more
unavoidable [1].
In forensic analysis of voices is common to work
with poor signal quality in terms of dynamic,
bandwidth, phase distortion, and several types of
noise (ambient noise, digital noise, quantization noise,
modulation noise, electromagnetic noise), which is
mainly related to the limitations imposed by the
equipment and transmission lines (whether through
physical lines or wireless) [2]. All these factors are
potential generators of errors in the results produced
by automatic speech recognition systems, whereby the
judgment and skill of the expert in charge of the
analysis is of fundamental importance.
Hirsch [3] developed a study in which the
performance of various automatic speaker recognition
systems using voice recordings filtered between 300
and 3400 Hz and contaminated with various
background noise (interior of cars, trains, restaurant,
among others) was analyzed. The SNR (Signal-to-
Noise Ratio) of each voice + background noise was
gradually modified and the results shows that the
automatic speaker recognition systems are unreliable
when the background noise level is greater or equal to
the level of the vocal registers. In the other hand, the
phase distortion produced by different elements of
telephones lines (cables, filters, transformers and
equalizers) were characterized by Steinberg [4] in the
Bell Laboratories. Several telephone lines were
measured and the technological requirements were
stated. In addition, Wang and Lim [5] studied the
importance of phase distortion in the context of speech
enhancement through some listening tests. They used
two analysis blocks in parallel to estimate the phase
and magnitude spectra by using both clean and noisy
speech. By altering the amount of induced noise for
phase and magnitude estimation separately, the
structure is capable of controlling the amount of phase
and magnitude distortions independently. Using this
structure and carrying out listening tests, they
conclude that it is unwarranted to make an effort to
more accurately estimate the phase from noisy speech.
However, the effects produced by the level of
distortion of a telephone line in a speaker recognition
task has not been fully investigated. The objective of
this research is to subjectively evaluate the difficulties
that a listener has when performing a speaker
recognition test using sound recordings with different
levels of harmonic distortion and elimination of
2
spectral information. This manipulation of sound
recordings tries to simulate the behavior of low quality
telephone transmission lines with regard to
transmission and reception of audio signals such as in
police radio electric equipment and some mobile and
landline telephones. This poor sound quality
recordings are commonly used in forensic acoustics,
damaging and making it much harder the correct
performance of the automated or semi-automated
systems of speaker recognition [6]. To quantify the
different levels of distortion, the Total Harmonic
Distortion (THD) parameter was chosen because it is
one of the most important technical specifications that
any equipment that manipulates audio signals has.
A subjective speaker recognition test was designed
in order to study the effects of the THD. In the test
each subject had to compare filtered (telephone lines
simulator) and distorted vocal records with high sound
quality reference vocal records.
2. EXPERIMENTAL DESIGN
Firstly, a filtering process that simulates the
frequency response of a typical telephonic system was
design. For this, a 6th order Butterworth band pass
filter was used. The lower cutoff frequency was set to
300 Hz and the upper cutoff frequency was set to 2000
Hz as can be appreciate in Figure 1 [7]. This filtering
process was also used to perform a preliminary
speaker recognition test with filtered signals without
distortion, in order to analyze the influence of that
process in the results obtained from the main test
which uses the filtering + distortion processes. Two
sonograms corresponding to the word “secuestro” are
show in Figure 2 to compare how formants are seen.
Figure 1. Telephone-like band pass filter
The spectral content elimination can be clearly
appreciate in the sonogram of the filtered voice. The
sonograms are a useful tool to analyze the presence of
formants. Formants are own resonances of the vocal
tract and represents the peak intensity in the spectrum
of a sound, which corresponds to the concentration of
energy that occurs at a particular frequency.
Technically, the formants are frequency bands where
most of the sound energy is concentrated.
The energy of the first and second formants remain
intact, which greatly favors the recognition since those
are the most energetic formants that characterize the
presence of vowels in the speech. As it was mentioned
above, the applied filter has a lower cutoff frequency
of 300 Hz, which does not interfere too much with the
main formant region of vowels which are listed in
Table 1 [8].
Original voice register
Filtered voice register
Figure 2. Original and filtered voice register sonograms
Table 1. Main formant regions of vowels
Vowel Main formant region
u 200 - 400 Hz
o 400 - 600 Hz
a 800 - 1200 Hz
e 400 -600 and 2200 - 2600 Hz
Then, since the main subjective test was designed
to study the degree of accuracy in recognizing
speakers gradually varying the distortion, four levels
of distortion were designed. The distortion system
used to perform this task is a VST Plugin available in
Adobe Audition, which consists of a typical limiter
that allows the user to modify its transfer function. The
configurations of the distortion systems were chosen
as follows:
Distortion 1 (D1): Low distortion.
o Limiter: -20dB
Distortion 2 (D2): Moderate distortion
o Limiter: -30dB
Distortion 3 (D3): High distortion
o Limiter: -40dB
Distortion 4 (D4): Very high distortion
3
o Limiter: -50dB
Figure 3 shows the transfer functions of these four
configurations.
Figure 3. Distortion systems transfer functions
These distortion processes generate a big amount
of harmonic components. The energy of those
harmonics increase as the distortion level increase. It
tends to impair the correct identification of formants,
which may difficult the task of recognition of low-
frequency signals (vowels). In Figure 4 the sonograms
of an original voice recording and the same recording
with the D1 and D4 distortion levels are compared.
The main formants and the amount of spectral
harmonics generated is visually distinguishable in the
D1 distortion sonogram, and also the reduction of the
energy levels with increasing the number of harmonics
is appreciated. But in the case of the D4 distortion level
occurs exactly the opposite. It is very difficult to
distinguish the formants because the energy level
difference between the first-order formant (first
resonance frequency of the vocal tract) and the first
harmonic is too small (about 2dB). The same energy
level difference was found between upper harmonics.
In the case of the D1 distortion level, the differences
found are about 5.3 and 6.4 dB. The signal energy
above 3000 Hz is less than -85 dB, so it can be
despised.
As it was mentioned above, it was decided to use
the Total Harmonic Distortion (THD) [9] to quantify
the four distortion levels. This parameter is calculated
according to Eq.1.
𝑇𝐻𝐷 =10
𝐻𝑟𝑚𝑠10
10𝐼𝑅𝑟𝑚𝑠
10
∗ 100 [%] (1)
Where Hrms is the root mean square value of all the
harmonics and IRrms is the root mean square value of
the original impulse in the impulse response of the
filter + distorted system. The usefulness of this
descriptor is that it allows to relate the amount of
harmonic energy generated by the distortion process
and the energy of the input signal (original impulse). The results obtained from the analysis of each level of
distortion are shown in Table 2.
Original voice register
D4 distorted voice register
D1 distorted voice register
Figure 4. Sonograms comparison
Table 2. THD [%] of each filter + distortion system
Distortion
level
Signal energy level [dB] THD
Original Impulse Harmonics
D1 -54,1 -72,8 1,35%
D2 -53,69 -66,2 5,61%
D3 -52,51 -60,5 15,89%
D4 -51,37 -54,3 50,93%
D3 D4
D1 D2
4
3. SIGNALS ADQUISITION AND
PROCESSING
3.1. Voices signals recordings
Six different voices were recorded using high
quality recording equipment to obtain "clean" signals
with a high signal-to-noise ratio, a wide bandwidth and
low harmonic distortion. The following list details the
equipment and software employed:
Notebook POSITIVO BGH C570
M-Audio Fast Track audio interface
RODE NT 2A microphone
XLR-XLR cable
Sound level meter Svantek 959 + microphone
calibrator.
Microphone tripod
Adobe Audition®
o Sample rate: 44100 Hz
o Resolution: 16 bits
o Channel mode: Mono
The recordings were performed in a room with the
following acoustic characteristics:
Background Noise: 42 dBA.
Reverberation Time: 0,4s.
Those values were measured with the sound level
meter. The background noise measurement time was 5
minutes and the reverberation time T30 was obtain
using a balloon explosion and averaging the results of
the 500, 1000 and 2000 Hz octave bands.
In order to perform a test that represents the actual
working conditions of a forensic expert, five Spanish
words commonly used in forensic acoustics were
recorded by each person. Those words are: “Dinero”
(money), “Secuestro” (kidnapping), “Policía” (police),
“Rehén” (hostage) and “Bomba” (bomb).
Finally, to minimize the temporal characteristics
that each person has in their own speech, they had to
perform the recordings imitating the pronunciation of
a reference speech reproduced by headphones that
contains the five words defined above with a specific
rhythmic.
3.2. Signals processing and considerations
Once all the sound recordings of the six persons
who contributed with their voices were obtain, several
stages of signals processing were conducted. The goal
of this step is to process some words of each speech
signals to simulate the poor sound quality produced by
certain telephones lines such as it was defined in
Section 2.
First, a process of "normalization" to 100% was
applied to each recorded word to equate the levels of
all recordings, avoiding loudness differences between
different words and persons.
From this point, only three words (Dinero –
Secuestro - Policía) were processed for each voice
recorded, while the other two words (Rehén – Bomba)
were not modify. It is because the subjective test seeks
to compare originals signals with distorted signals, but
the words recorded as the original signals must be
different for those words corresponding to the
distorted signals, in order to, again, avoid any temporal
characteristics of the speeches (rhythm and rate of
speech, forms of pronunciation, particular accents,
among others) that may help to distinguish a particular
listener. So, the two important considerations taken
into account for the voices recordings and processing
that allow minimize the influence of temporal
characteristics were defined.
Then, the four different levels of distortion were
applied. Because this process consist of a typical
limiter, the amplitude of the signals processed results
highly attenuated. For this reason another
“normalization” process was applied to raise de
loudness of the signals but in this instance the
normalization was set to 60% due to the noisiness that
generates the distortion which can cause temporal
hearing damage to the listener if the normalization
value were set higher than 60%.
Figure 6. Impulse response of a filter + distortion system
BAND PASS FILTER
DISTORTION SYSTEM
INVERSE FILTER CONVOLUTION
Original impulse
1º harmonic 3º harmonic
2º harmonic
4º harmonic
5º harmonic
Log-Sine sweep
Figure 5. Block diagram of the entire signal processing.
5
The entire process is shown in Figure 5. As it was
explained above, the words "Dinero", "Secuestro" and
"Policía" were normalized to 100%, then filtered,
distorted and once again normalized to 60%. The
words "Rehén" and "Bomba" were only normalized to
100% because they were used as the reference signals
in the test procedure.
To calculate the THD values, the impulse response
of each Filter + Distortion system was obtained by
passing a Log-Sine sweep (between 100 and 4500 Hz)
at the input of the system and then convolving the
output signal by the inverse filter using the AURORA
plugins [10]. As shown in Figure 6, this process allows
to discriminate temporarily the different harmonics
created by the distortion processes and thereby it is
possible to compare the energy of those harmonic with
the energy of the original impulse.
4. TEST PROCEDURE
The subjective tests were performed by fifteen
subjects, ten men and five women aged between 22 to
27 years. In all cases a pair of headphones were used
in order to minimize the background noise of the
rooms where the tests were performed, allowing the
subjects to concentrate on the listening without any
external interference.
In Figure 7 a typical test table is shown. The
subjects had first to listen the “Reference 1” track
which contains the two words that were not modify
(Rehén – Bomba) corresponding to one person voice.
Then, they had to listen the six “Voice” tracks (A, B,
C, D, E and F) each one containing the three distorted
words (Dinero – Secuestro – Policía) and decide which
one of the six “Voice” tracks corresponds to the same
person of the “Reference 1” track. This procedure was
repeated for the rest of the “Reference” tracks. There
were four tables like the one showed in Figure 7, each
one corresponding to a different level of distortion of
the “Voice” tracks. The first test is that one
corresponding to the Distortion D4, the second one
correspond to the Distortion D3 and so on. The subjects
were allowed to repeat any track many times if it was
necessary, and for that reason the time required to
complete the test vary according to the subject. On
average it took twenty minutes to complete the test.
Figure 7. Test table for the Distortion D1.
5. RESULTS
The preliminary test (speaker recognition using
filtered voices) results are shown in Figure 8. The
results obtained show that ten of the twelve subjects
(83.33%) were able to recognize all voices, while the
remaining two subjects (16.66%) achieved four
recognitions.
Figure 8. Preliminary test. Amount of successful
recognitions
These results indicate that the filtering process
does not produce a significant difficulty to perform a
successful speaker recognition task. It is considered
that the various degrees of difficulty of recognizing
speakers are highly related to the processes of
distortion.
Then, regarding the main test, the amount of
successful recognitions that each subject has achieved
in the four tests was studied in order to analyze the
difficulty of recognizing speakers with the different
levels of distortion. The results are shown in Figure 9.
Figure 9. Amount of successful recognitions of each subject
for each distortion level
100%
66.66%
50%
33.33%
16.66%
0%
100%
66.66%
6
As can be seen, the results obtained indicate that
for the lower distortions levels there is a great amount
of successful recognitions, as expected. Three cases
were found in which there was no recognition, two
corresponding to the Distortion D4 and the remaining
to the Distortion D3. On the other hand, the number of
cases in which the subject correctly matched at all
recognitions is seven, all corresponding to the
minimum level of distortion. In general, the results
obtained for the distortion levels D4 and D3 are evenly
distributed between one and four successful
recognitions. There's no chance to hit five times
because if a comparison is erroneous then inevitably
there is another erroneous one.
Analyzing the results for each distortion level the
following behaviors were found:
In the case of Distortion D4 the recognition was
very poor, with a mean of 1.33 successful recognitions
(22.6%), while the maximum number was three (50%)
achieved only by one subject. Moreover, two cases in
which recognition was null (0%) were found.
Analyzing the distribution of the results, it is found
that twelve of the fifteen subjects (80% of the total
population) were able to recognize only one or two
voices.
On the other hand, in the test of Distortion D3 a
slight improvement in the average score was obtain
with an average of two successful recognitions
(33.3%). Only in one case it was possible to recognize
four voices which was the maximum value of
successes obtained (66.6%). There was also one case
of null recognition (0%). In this case the distribution
of results indicates that only five subjects (33.3%)
achieved a recognition between three and four voices.
The Distortion D2 test results did not provide
significant improvements over the previous case. The
average of successful recognitions increased only to
2.53 (42.2%) and the maximum is four (66.6%)
achieved by three subjects could not be overcome.
Unlike the previous cases, this time no subject failed
in all the recognition process, although only 46.6% of
subjects achieving recognize three or more voices
without exceeding the maximum value of four
successful recognitions.
Finally, the test of Distortion D1 got the most
accurate results, with a mean of 4.87 (81.1%) of
successful recognitions. Fourteen subjects (93.3%)
were able to correctly identify more than three voices,
from which seven (46.6%) could recognize all voices.
The mean values of successful recognitions and the
standard deviations associated are listed in Table 3.
Another interesting result obtained is the amount
of successful recognitions for each one of the six
voices. To perform this analysis the number of
subjects who were able to recognize each voice
individually in each of the four levels of distortion was
determined. The results obtained are shown in Figure
10.
Table 3. Mean and Standard Deviation values of successful recognitions for each distortion level
Distortion
level
Successful recognitions
Mean Standard
deviation
1 4,87 1,13
2 2,53 0,99
3 2,00 1,07
4 1,33 0,82
Figure 10. Amount of successful recognitions for each
one of the six voices
This results denote that for all levels of distortion,
the voice of the person A (Voice A) was the one that
had a greater number of successful recognitions. The
Voice D also has a great amount of successful
recognitions, especially in the case of the D1 where
86.66% of the subjects were able to correctly identify
that person.
The statistical analysis (Correlation coefficients,
Standard Error, the Coefficient of Determination “R
square”, and the p-value for a confidence interval of
95%) of the distortion level and the amount of
successful recognitions for each voice is listed in Table
4. The results obtained for the Voices A, D and F
indicates that the quantity of successful recognitions
are tightly related with the difference of the distortion
levels. The p-values obtained are much small than 0.05
for those words, and then it allows to confirm that both
variables are highly correlated.
100%
93.33%
86.66%
73.33%
66.66%
53.33%
46.66%
33.33%
26.66%
13.33%
6.66%
0%
80%
60%
40%
20%
7
Table 4. Distortion Level. Statistical analysis
Distortion
level
Successful recognitions
Voice
A
Voice
B
Voice
C
Voice
D
Voice
E
Voice
F
1 15 12 11 13 12 10
2 10 3 5 9 4 7
3 8 4 5 5 3 5
4 5 3 3 3 2 4
Correlation 0,983 0,770 0,894 0,990 0,875 0,976
Standard Error 0,424 1,523 0,849 0,346 1,212 0,316
Determination
Coefficient 0,966 0,593 0,800 0,980 0,766 0,952
p-value 0,017 0,229 0,105 0,010 0,124 0,024
The same analysis was performed relating the
amount of successful recognitions with the four THD
values calculated. The results are listed in Table 5. The
negative correlation indicates that the higher the THD
the lower the amount of successful recognitions. There
is no p-value that allows to reject the null hypothesis,
so it would not be correct to affirm that the amount of
successful recognitions of each voice is correlated to
the THD. For this reason, the THD parameter is not
useful to quantify the distortion level of the designed
distortion systems.
Table 5. THD. Statistical analysis
THD [%]
Successful recognitions
Voice
A
Voice
B
Voice
C
Voice
D
Voice
E
Voice
F
1,35 15 12 11 13 12 10
5,61 10 3 5 9 4 7
15,89 8 4 5 5 3 5
50,93 5 3 3 3 2 4
Correlation -0,856 -0,532 -0,717 -0,845 -0,645 -0,804
Standard Error 0,068 0,116 0,076 0,074 0,110 0,049
Determination
Coefficient 0.732 0,283 0,513 0,715 0,416 0,674
p-value 0.144 0,468 0,283 0,154 0,354 0,195
Finally, a comparison between the correlation
coefficients of the Distortion Limit Level and THD is
shown in Table 6.
Table 6. Correlation coefficients comparison
Successful recognitions Correlation Coefficient
D1 D2 D3 D4 Distortion level THD
Voice A 5 9 13 1 0,98 -0,86 Voice B 5 5 11 1 0,77 -0,53
Voice C 4 3 12 1 0,89 -0,72
Voice D 8 10 15 1 0,99 -0,85
Voice E 3 4 12 1 0,88 -0,65
Voice F 5 7 10 1 0,98 -0,80
The correlation coefficients of the Distortion Level
parameter are greater than those obtained for the THD
parameter for all the voices. The correlation
coefficients of the Distortion Level are positives,
which indicates a direct correlation between the
variables. In the opposite, the correlation coefficients
of the THD parameter are negative, indicating an
inverse correlation between the variables.
However, the results show that the THD cannot be
used as an objective parameter for quantifying the
degree of accuracy when performing a speaker
recognition. The THD only considers the energy of the
harmonics without considering the signal to noise
ratio, which increases with increasing the level of
distortion producing variations in the degree of
perceived noisiness by the listener. The noise energy
increment between syllables and words due to the
second normalization process at 60% worsens the
quality of vocal registers because each person's voice
is masked by the noise level of the signal. Figure 11
shows two distorted voice signals (corresponding to
the word “secuestro” as in the previous analysis) with
same amplitude scale before applying the 60%
normalization process: the left one has applied the D4
distortion level and the right has the D1 one.
The intervals of “silence” (actually there is
background noise of the original recordings) between
syllables are highlighted demonstrating that those
intervals have the same amplitude, no matter what is
the distortion level applied, because the limiter starting
value (-50dB for D4) is never reached. The noise
amplitude is of the same order of the voice amplitude
for the D4 signal, but it is much lower compared with
D4 D1
Noise between syllables
Noise between syllables
D4 D1
Figure 12. Noise between syllables after the 60% normalization process
Figure 11. Noise between syllables before the 60% normalization process
8
the D1 one. Because of this, applying the 60%
normalization process, the amplitude of the noise
increases in proportion to the level of distortion, as
shown in Figure 12. This suggests that it is necessary
to study the influence of the signal-to-noise ratio
produced by the distortion process in conjunction with
the THD of the system, since the latter is not sufficient
to quantify the degree of successful recognitions.
Finally, the major account of successful
recognitions corresponds to the Voice A, which
corresponds to a person previously knew by the
subjects before to perform the test. All subjects know
and maintain a daily relationship over four years ago
with that person. It indicates that the familiarization of
a particular voice may improve the recognition task.
6. CONCLUSIONS
From the tests results it was found that the chances
to successfully recognize three of the six voices
employed are strongly correlated with the distortion
level. For the other three voices this conclusion cannot
be affirmed. So, there is a need to increase the amount
of subjects to perform more test and to obtain more
subjective data in order to affirm or refute the
hypothesis that the distortion level and the amount of
successful recognitions have a strong correlation. The
THD parameter of the distortion systems is not useful
to predict how difficult is to perform a successful
speaker recognition because there is not a strong
correlation between THD and the amount of
successful recognitions.
A visual analysis of the formants printed in a
sonogram allows to conclude that as the distortion
level increase, the discriminations of vowels energy
becomes even more difficult. The spectrograms are a
valuable visual tool but its usefulness is limited to the
dynamic quality of the records analyzed as both
distortion and noise can impair the correct
interpretation of the graphs.
These findings were reflected in the tests, where
only good results were achieved with the minimum
distortion level. It may indicate that the subject who
performed the test use some spectral information
related mainly with the vowels energy from the
original references tracks and them tries to
characterize the different person’s voices with that
information. Then, listening both references and
distorted tracks, the subjects try to find some common
behavior related to the vowels energy distribution
between speakers.
In the filtered signal test, a strong dependence
between the speaker recognition performance and the
elimination of spectral information using a telephone-
like filter could not be found.
As it was explained, the test was designed to avoid
any kind of temporal information that may help to the
recognition process. However some subjects
commented after the realization of the tests that they
use some temporal characteristics related to the
pronunciation forms of the persons recorded to try to
recognize them, especially with the two highest levels
of distortion.
7. FUTURE WORK
One of the most important needs that emerges from
the results of this report is the increasing of amount of
subjects to perform the test and the addition of more
distortions levels, which will allow an increasing in the
resolutions of the results. It is, the regression lines will
be calculated with more data and the correlation
coefficients will be more accurate. In addition, it is
necessary to evaluate the influence of the reduction of
the signal-to-noise ratio with the THD variations in
order to design an optimal parameter that allows an
accurate quantification of the degree of difficulty
when a speaker recognition task is performed.
It would be also interesting analyze the signals
used for subjective tests and process them with
automatic speaker recognition systems, which will not
only allow to classify different degrees of precision of
such systems but also to compare the number of
successful recognitions with those obtained
subjectively. This will allow forensic experts to
determine to what extent they can rely on their
interpretation.
On the other hand, it is necessary to investigate and
develop new technologies that allows a clear
transmission of the human voice in telephone systems,
as this is the primary means of communication used by
kidnappers.
It is also necessary that all offenders under police
custody realize a voice recording session in order to
store their vocal registers that could potentially be used
if any of them re-offend.
For further research, it would also be interesting to
analyze what happens using female voices in order to
analyze the influence of the different levels of
distortion when both male and female voice are
compared and also study the influence of the
previously known voices.
8. REFERENCES
[1] Delgado Romero C. “Técnicas digitales de
análisis audiovisual en acústica forense”, Actas del 3º
Congreso de Investigadores Audiovisuales. Vol. 1.
Del Laberinto. Madrid, Spain. November 1999.
[2] Lane C. “Phase distortion in telephone
apparatus”, Bell System Technical Journal. Vol. 2,
pp. 493-521. New York, US. May 1983.
[3] Hirsch H. “The Aurora experimental framework
for the performance evaluation on speech recognition
systems under noisy conditions”. VI International
Conference on Spoken Language Processing. Vol.4,
pp. 29-32. ICSLP-2000. Beijing, China. October 2000.
[4] Steinberg J. C. “Effects of phase distortion
telephone quality”. Bell System Technical Journal.
Vol. 2, pp. 550-555. New York, US. May 1983.
9
[5] Wang D., Lim J. “The unimportance of phase in
speech enhancement”. Acoustics, Speech and Signal
Processing. IEEE Transactions. Vol. 30, pp. 679– 681.
Cambridge, US. January 2003.
[6] Dominguez, S. “Estimated weight of evidence in
forensic sound for statistical inference of identity of
the speaker by Bayesian network application to
acoustic features”. Master’s thesis. pp. 1-5. Madrid,
Spain, October 2009
[7] Steinberg J. C. “Effects of phase distortion
telephone quality”. Bell System Technical Journal.
Vol. 2, pp. 555-566. New York, US. May 1983.
[8] Ladefoged P. “A Course in Phonetics”. Fort Worth
Harcourt Brace Jovanovich College Publishers. Vol. 1,
5th Ed. Boston, US. 2006.
[9] Bohn D. “Audio specifications”. Rane Note. Vol.
12, pp. 1-12. Washington DC, US. 2000
[10] Farina. A. “Advancements in Impulse Response
Measurements by Sine Sweeps”. AES E-Library.
Parma, Italy. May 2007.
10
Acoustic Laboratory Name: _________________ Age: _____ Date: ___/___/________
6 different person’s voices were recorded, each one saying the following list of 5 words: Dinero,
Secuestro, Policía, Rehén, and Bomba.
The first three words were filtered and distorted.
The other two words were not modified
You had first to listen the “Reference 1” track which contains the two words that were not modify
(Rehén – Bomba) corresponding to one person voice. Then, listen the six “Voice” tracks (A, B, C, D, E
and F) each one containing the three distorted words (Dinero – Secuestro – Policía) and decide
which one of the six “Voice” tracks corresponds to the same person of the “Reference 1” track. This
procedure is repeated for the rest of the “Reference” tracks.
Distortion 4 Voice A Voice B Voice C Voice D Voice E Voice F
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Distortion 3 A B C D E F
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
APPENDIX
Subjective Test
11
Distortion 2 A B C D E F
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Distortion 1 A B C D E F
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6