Download - Effect of distortion on speaker recognition in telephone line frequency band.pdf

1

Acoustic Laboratory November 2013, Caseros, Buenos Aires Province, Argentina

EFFECT OF DISTORTION ON SPEAKER RECOGNITION IN

TELEPHONE LINE FREQUENCY BAND

AGUSTÍN Y. ARIAS 1

1 Universidad Nacional de Tres de Febrero, Buenos Aires, Argentina.

[email protected]

Abstract – This paper analyze how the harmonic distortion affects the correct recognition of speakers with the

objective to contribute to the advance of forensic acoustics. To achieve this task, the dynamics of different voices has

been processed, gradually increasing the level of distortion, in order to simulate the actual quality of a telephone line.

Several subjective listening tests were performed, where fifteen subjects had to listen to the voices processed and

decide which person belongs each voice, comparing the distorted voices with reference voices without distortion. The

influence of the Distortion Level and the Total Harmonic Distortion (THD) of a telephone line simulator is analyzed.

In addition, vocal formants are studied using several sonograms to analyze the results of the tests. The procedures,

considerations and limitations of the test process are explained as well as the conclusions and new lines of

investigations are described.

1. INTRODUCTION

The recognition of speakers is a broadly diversified

discipline, where various systems that perform this

task automatically or semi-automatically can be found.

These systems are used for various purposes, for

example: smart opening locks, access to private

computer networks, mobile smart devices, among

others. However, the objective of this study is directly

related to the field of Forensic Acoustics where

handling computerized techniques by a skilled expert

plays a fundamental role to produce accurate results

and conclusions. Forensic Acoustics is one of the most

complex research environments of the Scientific

Police, mainly due to the multidisciplinary nature of

their different approaches of analysis, and the need to

grant a continued high level of training to their experts.

While the domain and use of technology and digital

applications analysis, calculation or processing is

essential, involving a team of experts specializing in

different perspectives of study is even more

unavoidable [1].

In forensic analysis of voices is common to work

with poor signal quality in terms of dynamic,

bandwidth, phase distortion, and several types of

noise (ambient noise, digital noise, quantization noise,

modulation noise, electromagnetic noise), which is

mainly related to the limitations imposed by the

equipment and transmission lines (whether through

physical lines or wireless) [2]. All these factors are

potential generators of errors in the results produced

by automatic speech recognition systems, whereby the

judgment and skill of the expert in charge of the

analysis is of fundamental importance.

Hirsch [3] developed a study in which the

performance of various automatic speaker recognition

systems using voice recordings filtered between 300

and 3400 Hz and contaminated with various

background noise (interior of cars, trains, restaurant,

among others) was analyzed. The SNR (Signal-to-

Noise Ratio) of each voice + background noise was

gradually modified and the results shows that the

automatic speaker recognition systems are unreliable

when the background noise level is greater or equal to

the level of the vocal registers. In the other hand, the

phase distortion produced by different elements of

telephones lines (cables, filters, transformers and

equalizers) were characterized by Steinberg [4] in the

Bell Laboratories. Several telephone lines were

measured and the technological requirements were

stated. In addition, Wang and Lim [5] studied the

importance of phase distortion in the context of speech

enhancement through some listening tests. They used

two analysis blocks in parallel to estimate the phase

and magnitude spectra by using both clean and noisy

speech. By altering the amount of induced noise for

phase and magnitude estimation separately, the

structure is capable of controlling the amount of phase

and magnitude distortions independently. Using this

structure and carrying out listening tests, they

conclude that it is unwarranted to make an effort to

more accurately estimate the phase from noisy speech.

However, the effects produced by the level of

distortion of a telephone line in a speaker recognition

task has not been fully investigated. The objective of

this research is to subjectively evaluate the difficulties

that a listener has when performing a speaker

recognition test using sound recordings with different

levels of harmonic distortion and elimination of

2

spectral information. This manipulation of sound

recordings tries to simulate the behavior of low quality

telephone transmission lines with regard to

transmission and reception of audio signals such as in

police radio electric equipment and some mobile and

landline telephones. This poor sound quality

recordings are commonly used in forensic acoustics,

damaging and making it much harder the correct

performance of the automated or semi-automated

systems of speaker recognition [6]. To quantify the

different levels of distortion, the Total Harmonic

Distortion (THD) parameter was chosen because it is

one of the most important technical specifications that

any equipment that manipulates audio signals has.

A subjective speaker recognition test was designed

in order to study the effects of the THD. In the test

each subject had to compare filtered (telephone lines

simulator) and distorted vocal records with high sound

quality reference vocal records.

2. EXPERIMENTAL DESIGN

Firstly, a filtering process that simulates the

frequency response of a typical telephonic system was

design. For this, a 6th order Butterworth band pass

filter was used. The lower cutoff frequency was set to

300 Hz and the upper cutoff frequency was set to 2000

Hz as can be appreciate in Figure 1 [7]. This filtering

process was also used to perform a preliminary

speaker recognition test with filtered signals without

distortion, in order to analyze the influence of that

process in the results obtained from the main test

which uses the filtering + distortion processes. Two

sonograms corresponding to the word “secuestro” are

show in Figure 2 to compare how formants are seen.

Figure 1. Telephone-like band pass filter

The spectral content elimination can be clearly

appreciate in the sonogram of the filtered voice. The

sonograms are a useful tool to analyze the presence of

formants. Formants are own resonances of the vocal

tract and represents the peak intensity in the spectrum

of a sound, which corresponds to the concentration of

energy that occurs at a particular frequency.

Technically, the formants are frequency bands where

most of the sound energy is concentrated.

The energy of the first and second formants remain

intact, which greatly favors the recognition since those

are the most energetic formants that characterize the

presence of vowels in the speech. As it was mentioned

above, the applied filter has a lower cutoff frequency

of 300 Hz, which does not interfere too much with the

main formant region of vowels which are listed in

Table 1 [8].

Original voice register

Filtered voice register

Figure 2. Original and filtered voice register sonograms

Table 1. Main formant regions of vowels

Vowel Main formant region

u 200 - 400 Hz

o 400 - 600 Hz

a 800 - 1200 Hz

e 400 -600 and 2200 - 2600 Hz

Then, since the main subjective test was designed

to study the degree of accuracy in recognizing

speakers gradually varying the distortion, four levels

of distortion were designed. The distortion system

used to perform this task is a VST Plugin available in

Adobe Audition, which consists of a typical limiter

that allows the user to modify its transfer function. The

configurations of the distortion systems were chosen

as follows:

Distortion 1 (D1): Low distortion.

o Limiter: -20dB

Distortion 2 (D2): Moderate distortion

o Limiter: -30dB

Distortion 3 (D3): High distortion

o Limiter: -40dB

Distortion 4 (D4): Very high distortion

3

o Limiter: -50dB

Figure 3 shows the transfer functions of these four

configurations.

Figure 3. Distortion systems transfer functions

These distortion processes generate a big amount

of harmonic components. The energy of those

harmonics increase as the distortion level increase. It

tends to impair the correct identification of formants,

which may difficult the task of recognition of low-

frequency signals (vowels). In Figure 4 the sonograms

of an original voice recording and the same recording

with the D1 and D4 distortion levels are compared.

The main formants and the amount of spectral

harmonics generated is visually distinguishable in the

D1 distortion sonogram, and also the reduction of the

energy levels with increasing the number of harmonics

is appreciated. But in the case of the D4 distortion level

occurs exactly the opposite. It is very difficult to

distinguish the formants because the energy level

difference between the first-order formant (first

resonance frequency of the vocal tract) and the first

harmonic is too small (about 2dB). The same energy

level difference was found between upper harmonics.

In the case of the D1 distortion level, the differences

found are about 5.3 and 6.4 dB. The signal energy

above 3000 Hz is less than -85 dB, so it can be

despised.

As it was mentioned above, it was decided to use

the Total Harmonic Distortion (THD) [9] to quantify

the four distortion levels. This parameter is calculated

according to Eq.1.

𝑇𝐻𝐷 =10

𝐻𝑟𝑚𝑠10

10𝐼𝑅𝑟𝑚𝑠

10

∗ 100 [%] (1)

Where Hrms is the root mean square value of all the

harmonics and IRrms is the root mean square value of

the original impulse in the impulse response of the

filter + distorted system. The usefulness of this

descriptor is that it allows to relate the amount of

harmonic energy generated by the distortion process

and the energy of the input signal (original impulse). The results obtained from the analysis of each level of

distortion are shown in Table 2.

Original voice register

D4 distorted voice register

D1 distorted voice register

Figure 4. Sonograms comparison

Table 2. THD [%] of each filter + distortion system

Distortion

level

Signal energy level [dB] THD

Original Impulse Harmonics

D1 -54,1 -72,8 1,35%

D2 -53,69 -66,2 5,61%

D3 -52,51 -60,5 15,89%

D4 -51,37 -54,3 50,93%

D3 D4

D1 D2

4

3. SIGNALS ADQUISITION AND

PROCESSING

3.1. Voices signals recordings

Six different voices were recorded using high

quality recording equipment to obtain "clean" signals

with a high signal-to-noise ratio, a wide bandwidth and

low harmonic distortion. The following list details the

equipment and software employed:

Notebook POSITIVO BGH C570

M-Audio Fast Track audio interface

RODE NT 2A microphone

XLR-XLR cable

Sound level meter Svantek 959 + microphone

calibrator.

Microphone tripod

Adobe Audition®

o Sample rate: 44100 Hz

o Resolution: 16 bits

o Channel mode: Mono

The recordings were performed in a room with the

following acoustic characteristics:

Background Noise: 42 dBA.

Reverberation Time: 0,4s.

Those values were measured with the sound level

meter. The background noise measurement time was 5

minutes and the reverberation time T30 was obtain

using a balloon explosion and averaging the results of

the 500, 1000 and 2000 Hz octave bands.

In order to perform a test that represents the actual

working conditions of a forensic expert, five Spanish

words commonly used in forensic acoustics were

recorded by each person. Those words are: “Dinero”

(money), “Secuestro” (kidnapping), “Policía” (police),

“Rehén” (hostage) and “Bomba” (bomb).

Finally, to minimize the temporal characteristics

that each person has in their own speech, they had to

perform the recordings imitating the pronunciation of

a reference speech reproduced by headphones that

contains the five words defined above with a specific

rhythmic.

3.2. Signals processing and considerations

Once all the sound recordings of the six persons

who contributed with their voices were obtain, several

stages of signals processing were conducted. The goal

of this step is to process some words of each speech

signals to simulate the poor sound quality produced by

certain telephones lines such as it was defined in

Section 2.

First, a process of "normalization" to 100% was

applied to each recorded word to equate the levels of

all recordings, avoiding loudness differences between

different words and persons.

From this point, only three words (Dinero –

Secuestro - Policía) were processed for each voice

recorded, while the other two words (Rehén – Bomba)

were not modify. It is because the subjective test seeks

to compare originals signals with distorted signals, but

the words recorded as the original signals must be

different for those words corresponding to the

distorted signals, in order to, again, avoid any temporal

characteristics of the speeches (rhythm and rate of

speech, forms of pronunciation, particular accents,

among others) that may help to distinguish a particular

listener. So, the two important considerations taken

into account for the voices recordings and processing

that allow minimize the influence of temporal

characteristics were defined.

Then, the four different levels of distortion were

applied. Because this process consist of a typical

limiter, the amplitude of the signals processed results

highly attenuated. For this reason another

“normalization” process was applied to raise de

loudness of the signals but in this instance the

normalization was set to 60% due to the noisiness that

generates the distortion which can cause temporal

hearing damage to the listener if the normalization

value were set higher than 60%.

Figure 6. Impulse response of a filter + distortion system

BAND PASS FILTER

DISTORTION SYSTEM

INVERSE FILTER CONVOLUTION

Original impulse

1º harmonic 3º harmonic

2º harmonic

4º harmonic

5º harmonic

Log-Sine sweep

Figure 5. Block diagram of the entire signal processing.

5

The entire process is shown in Figure 5. As it was

explained above, the words "Dinero", "Secuestro" and

"Policía" were normalized to 100%, then filtered,

distorted and once again normalized to 60%. The

words "Rehén" and "Bomba" were only normalized to

100% because they were used as the reference signals

in the test procedure.

To calculate the THD values, the impulse response

of each Filter + Distortion system was obtained by

passing a Log-Sine sweep (between 100 and 4500 Hz)

at the input of the system and then convolving the

output signal by the inverse filter using the AURORA

plugins [10]. As shown in Figure 6, this process allows

to discriminate temporarily the different harmonics

created by the distortion processes and thereby it is

possible to compare the energy of those harmonic with

the energy of the original impulse.

4. TEST PROCEDURE

The subjective tests were performed by fifteen

subjects, ten men and five women aged between 22 to

27 years. In all cases a pair of headphones were used

in order to minimize the background noise of the

rooms where the tests were performed, allowing the

subjects to concentrate on the listening without any

external interference.

In Figure 7 a typical test table is shown. The

subjects had first to listen the “Reference 1” track

which contains the two words that were not modify

(Rehén – Bomba) corresponding to one person voice.

Then, they had to listen the six “Voice” tracks (A, B,

C, D, E and F) each one containing the three distorted

words (Dinero – Secuestro – Policía) and decide which

one of the six “Voice” tracks corresponds to the same

person of the “Reference 1” track. This procedure was

repeated for the rest of the “Reference” tracks. There

were four tables like the one showed in Figure 7, each

one corresponding to a different level of distortion of

the “Voice” tracks. The first test is that one

corresponding to the Distortion D4, the second one

correspond to the Distortion D3 and so on. The subjects

were allowed to repeat any track many times if it was

necessary, and for that reason the time required to

complete the test vary according to the subject. On

average it took twenty minutes to complete the test.

Figure 7. Test table for the Distortion D1.

5. RESULTS

The preliminary test (speaker recognition using

filtered voices) results are shown in Figure 8. The

results obtained show that ten of the twelve subjects

(83.33%) were able to recognize all voices, while the

remaining two subjects (16.66%) achieved four

recognitions.

Figure 8. Preliminary test. Amount of successful

recognitions

These results indicate that the filtering process

does not produce a significant difficulty to perform a

successful speaker recognition task. It is considered

that the various degrees of difficulty of recognizing

speakers are highly related to the processes of

distortion.

Then, regarding the main test, the amount of

successful recognitions that each subject has achieved

in the four tests was studied in order to analyze the

difficulty of recognizing speakers with the different

levels of distortion. The results are shown in Figure 9.

Figure 9. Amount of successful recognitions of each subject

for each distortion level

100%

66.66%

50%

33.33%

16.66%

0%

100%

66.66%

6

As can be seen, the results obtained indicate that

for the lower distortions levels there is a great amount

of successful recognitions, as expected. Three cases

were found in which there was no recognition, two

corresponding to the Distortion D4 and the remaining

to the Distortion D3. On the other hand, the number of

cases in which the subject correctly matched at all

recognitions is seven, all corresponding to the

minimum level of distortion. In general, the results

obtained for the distortion levels D4 and D3 are evenly

distributed between one and four successful

recognitions. There's no chance to hit five times

because if a comparison is erroneous then inevitably

there is another erroneous one.

Analyzing the results for each distortion level the

following behaviors were found:

In the case of Distortion D4 the recognition was

very poor, with a mean of 1.33 successful recognitions

(22.6%), while the maximum number was three (50%)

achieved only by one subject. Moreover, two cases in

which recognition was null (0%) were found.

Analyzing the distribution of the results, it is found

that twelve of the fifteen subjects (80% of the total

population) were able to recognize only one or two

voices.

On the other hand, in the test of Distortion D3 a

slight improvement in the average score was obtain

with an average of two successful recognitions

(33.3%). Only in one case it was possible to recognize

four voices which was the maximum value of

successes obtained (66.6%). There was also one case

of null recognition (0%). In this case the distribution

of results indicates that only five subjects (33.3%)

achieved a recognition between three and four voices.

The Distortion D2 test results did not provide

significant improvements over the previous case. The

average of successful recognitions increased only to

2.53 (42.2%) and the maximum is four (66.6%)

achieved by three subjects could not be overcome.

Unlike the previous cases, this time no subject failed

in all the recognition process, although only 46.6% of

subjects achieving recognize three or more voices

without exceeding the maximum value of four

successful recognitions.

Finally, the test of Distortion D1 got the most

accurate results, with a mean of 4.87 (81.1%) of

successful recognitions. Fourteen subjects (93.3%)

were able to correctly identify more than three voices,

from which seven (46.6%) could recognize all voices.

The mean values of successful recognitions and the

standard deviations associated are listed in Table 3.

Another interesting result obtained is the amount

of successful recognitions for each one of the six

voices. To perform this analysis the number of

subjects who were able to recognize each voice

individually in each of the four levels of distortion was

determined. The results obtained are shown in Figure

10.

Table 3. Mean and Standard Deviation values of successful recognitions for each distortion level

Distortion

level

Successful recognitions

Mean Standard

deviation

1 4,87 1,13

2 2,53 0,99

3 2,00 1,07

4 1,33 0,82

Figure 10. Amount of successful recognitions for each

one of the six voices

This results denote that for all levels of distortion,

the voice of the person A (Voice A) was the one that

had a greater number of successful recognitions. The

Voice D also has a great amount of successful

recognitions, especially in the case of the D1 where

86.66% of the subjects were able to correctly identify

that person.

The statistical analysis (Correlation coefficients,

Standard Error, the Coefficient of Determination “R

square”, and the p-value for a confidence interval of

95%) of the distortion level and the amount of

successful recognitions for each voice is listed in Table

4. The results obtained for the Voices A, D and F

indicates that the quantity of successful recognitions

are tightly related with the difference of the distortion

levels. The p-values obtained are much small than 0.05

for those words, and then it allows to confirm that both

variables are highly correlated.

100%

93.33%

86.66%

73.33%

66.66%

53.33%

46.66%

33.33%

26.66%

13.33%

6.66%

0%

80%

60%

40%

20%

7

Table 4. Distortion Level. Statistical analysis

Distortion

level


Voice

A

Voice

B

Voice

C

Voice

D

Voice

E

Voice

F

1 15 12 11 13 12 10

2 10 3 5 9 4 7

3 8 4 5 5 3 5

4 5 3 3 3 2 4

Correlation 0,983 0,770 0,894 0,990 0,875 0,976

Standard Error 0,424 1,523 0,849 0,346 1,212 0,316

Determination

Coefficient 0,966 0,593 0,800 0,980 0,766 0,952

p-value 0,017 0,229 0,105 0,010 0,124 0,024

The same analysis was performed relating the

amount of successful recognitions with the four THD

values calculated. The results are listed in Table 5. The

negative correlation indicates that the higher the THD

the lower the amount of successful recognitions. There

is no p-value that allows to reject the null hypothesis,

so it would not be correct to affirm that the amount of

successful recognitions of each voice is correlated to

the THD. For this reason, the THD parameter is not

useful to quantify the distortion level of the designed

distortion systems.

Table 5. THD. Statistical analysis

THD [%]


Voice

A

Voice

B

Voice

C

Voice

D

Voice

E

Voice

F

1,35 15 12 11 13 12 10

5,61 10 3 5 9 4 7

15,89 8 4 5 5 3 5

50,93 5 3 3 3 2 4

Correlation -0,856 -0,532 -0,717 -0,845 -0,645 -0,804

Standard Error 0,068 0,116 0,076 0,074 0,110 0,049

Determination

Coefficient 0.732 0,283 0,513 0,715 0,416 0,674

p-value 0.144 0,468 0,283 0,154 0,354 0,195

Finally, a comparison between the correlation

coefficients of the Distortion Limit Level and THD is

shown in Table 6.

Table 6. Correlation coefficients comparison

Successful recognitions Correlation Coefficient

D1 D2 D3 D4 Distortion level THD

Voice A 5 9 13 1 0,98 -0,86 Voice B 5 5 11 1 0,77 -0,53

Voice C 4 3 12 1 0,89 -0,72

Voice D 8 10 15 1 0,99 -0,85

Voice E 3 4 12 1 0,88 -0,65

Voice F 5 7 10 1 0,98 -0,80

The correlation coefficients of the Distortion Level

parameter are greater than those obtained for the THD

parameter for all the voices. The correlation

coefficients of the Distortion Level are positives,

which indicates a direct correlation between the

variables. In the opposite, the correlation coefficients

of the THD parameter are negative, indicating an

inverse correlation between the variables.

However, the results show that the THD cannot be

used as an objective parameter for quantifying the

degree of accuracy when performing a speaker

recognition. The THD only considers the energy of the

harmonics without considering the signal to noise

ratio, which increases with increasing the level of

distortion producing variations in the degree of

perceived noisiness by the listener. The noise energy

increment between syllables and words due to the

second normalization process at 60% worsens the

quality of vocal registers because each person's voice

is masked by the noise level of the signal. Figure 11

shows two distorted voice signals (corresponding to

the word “secuestro” as in the previous analysis) with

same amplitude scale before applying the 60%

normalization process: the left one has applied the D4

distortion level and the right has the D1 one.

The intervals of “silence” (actually there is

background noise of the original recordings) between

syllables are highlighted demonstrating that those

intervals have the same amplitude, no matter what is

the distortion level applied, because the limiter starting

value (-50dB for D4) is never reached. The noise

amplitude is of the same order of the voice amplitude

for the D4 signal, but it is much lower compared with

D4 D1

Noise between syllables

Noise between syllables

D4 D1

Figure 12. Noise between syllables after the 60% normalization process

Figure 11. Noise between syllables before the 60% normalization process

8

the D1 one. Because of this, applying the 60%

normalization process, the amplitude of the noise

increases in proportion to the level of distortion, as

shown in Figure 12. This suggests that it is necessary

to study the influence of the signal-to-noise ratio

produced by the distortion process in conjunction with

the THD of the system, since the latter is not sufficient

to quantify the degree of successful recognitions.

Finally, the major account of successful

recognitions corresponds to the Voice A, which

corresponds to a person previously knew by the

subjects before to perform the test. All subjects know

and maintain a daily relationship over four years ago

with that person. It indicates that the familiarization of

a particular voice may improve the recognition task.

6. CONCLUSIONS

From the tests results it was found that the chances

to successfully recognize three of the six voices

employed are strongly correlated with the distortion

level. For the other three voices this conclusion cannot

be affirmed. So, there is a need to increase the amount

of subjects to perform more test and to obtain more

subjective data in order to affirm or refute the

hypothesis that the distortion level and the amount of

successful recognitions have a strong correlation. The

THD parameter of the distortion systems is not useful

to predict how difficult is to perform a successful

speaker recognition because there is not a strong

correlation between THD and the amount of

successful recognitions.

A visual analysis of the formants printed in a

sonogram allows to conclude that as the distortion

level increase, the discriminations of vowels energy

becomes even more difficult. The spectrograms are a

valuable visual tool but its usefulness is limited to the

dynamic quality of the records analyzed as both

distortion and noise can impair the correct

interpretation of the graphs.

These findings were reflected in the tests, where

only good results were achieved with the minimum

distortion level. It may indicate that the subject who

performed the test use some spectral information

related mainly with the vowels energy from the

original references tracks and them tries to

characterize the different person’s voices with that

information. Then, listening both references and

distorted tracks, the subjects try to find some common

behavior related to the vowels energy distribution

between speakers.

In the filtered signal test, a strong dependence

between the speaker recognition performance and the

elimination of spectral information using a telephone-

like filter could not be found.

As it was explained, the test was designed to avoid

any kind of temporal information that may help to the

recognition process. However some subjects

commented after the realization of the tests that they

use some temporal characteristics related to the

pronunciation forms of the persons recorded to try to

recognize them, especially with the two highest levels

of distortion.

7. FUTURE WORK

One of the most important needs that emerges from

the results of this report is the increasing of amount of

subjects to perform the test and the addition of more

distortions levels, which will allow an increasing in the

resolutions of the results. It is, the regression lines will

be calculated with more data and the correlation

coefficients will be more accurate. In addition, it is

necessary to evaluate the influence of the reduction of

the signal-to-noise ratio with the THD variations in

order to design an optimal parameter that allows an

accurate quantification of the degree of difficulty

when a speaker recognition task is performed.

It would be also interesting analyze the signals

used for subjective tests and process them with

automatic speaker recognition systems, which will not

only allow to classify different degrees of precision of

such systems but also to compare the number of

successful recognitions with those obtained

subjectively. This will allow forensic experts to

determine to what extent they can rely on their

interpretation.

On the other hand, it is necessary to investigate and

develop new technologies that allows a clear

transmission of the human voice in telephone systems,

as this is the primary means of communication used by

kidnappers.

It is also necessary that all offenders under police

custody realize a voice recording session in order to

store their vocal registers that could potentially be used

if any of them re-offend.

For further research, it would also be interesting to

analyze what happens using female voices in order to

analyze the influence of the different levels of

distortion when both male and female voice are

compared and also study the influence of the

previously known voices.

8. REFERENCES

[1] Delgado Romero C. “Técnicas digitales de

análisis audiovisual en acústica forense”, Actas del 3º

Congreso de Investigadores Audiovisuales. Vol. 1.

Del Laberinto. Madrid, Spain. November 1999.

[2] Lane C. “Phase distortion in telephone

apparatus”, Bell System Technical Journal. Vol. 2,

pp. 493-521. New York, US. May 1983.

[3] Hirsch H. “The Aurora experimental framework

for the performance evaluation on speech recognition

systems under noisy conditions”. VI International

Conference on Spoken Language Processing. Vol.4,

pp. 29-32. ICSLP-2000. Beijing, China. October 2000.

[4] Steinberg J. C. “Effects of phase distortion

telephone quality”. Bell System Technical Journal.

Vol. 2, pp. 550-555. New York, US. May 1983.

9

[5] Wang D., Lim J. “The unimportance of phase in

speech enhancement”. Acoustics, Speech and Signal

Processing. IEEE Transactions. Vol. 30, pp. 679– 681.

Cambridge, US. January 2003.

[6] Dominguez, S. “Estimated weight of evidence in

forensic sound for statistical inference of identity of

the speaker by Bayesian network application to

acoustic features”. Master’s thesis. pp. 1-5. Madrid,

Spain, October 2009

[7] Steinberg J. C. “Effects of phase distortion

telephone quality”. Bell System Technical Journal.

Vol. 2, pp. 555-566. New York, US. May 1983.

[8] Ladefoged P. “A Course in Phonetics”. Fort Worth

Harcourt Brace Jovanovich College Publishers. Vol. 1,

5th Ed. Boston, US. 2006.

[9] Bohn D. “Audio specifications”. Rane Note. Vol.

12, pp. 1-12. Washington DC, US. 2000

[10] Farina. A. “Advancements in Impulse Response

Measurements by Sine Sweeps”. AES E-Library.

Parma, Italy. May 2007.

10

Acoustic Laboratory Name: _________________ Age: _____ Date: ___/___/________

6 different person’s voices were recorded, each one saying the following list of 5 words: Dinero,

Secuestro, Policía, Rehén, and Bomba.

The first three words were filtered and distorted.

The other two words were not modified

You had first to listen the “Reference 1” track which contains the two words that were not modify

(Rehén – Bomba) corresponding to one person voice. Then, listen the six “Voice” tracks (A, B, C, D, E

and F) each one containing the three distorted words (Dinero – Secuestro – Policía) and decide

which one of the six “Voice” tracks corresponds to the same person of the “Reference 1” track. This

procedure is repeated for the rest of the “Reference” tracks.

Distortion 4 Voice A Voice B Voice C Voice D Voice E Voice F

Reference 1

Reference 2

Reference 3

Reference 4

Reference 5

Reference 6

Distortion 3 A B C D E F

Reference 1

Reference 2

Reference 3

Reference 4

Reference 5

Reference 6

APPENDIX

Subjective Test

11


Reference 1

Reference 2

Reference 3

Reference 4

Reference 5

Reference 6


Reference 1

Reference 2

Reference 3

Reference 4

Reference 5

Reference 6