Home > Documents > Effect of distortion on speaker recognition in telephone line frequency band.pdf

Effect of distortion on speaker recognition in telephone line frequency band.pdf

Date post: 23-Oct-2015
Category:
Author: agustin-arias
View: 11 times
Download: 0 times
Share this document with a friend
Embed Size (px)
Popular Tags:
of 11 /11
1 Acoustic Laboratory November 2013, Caseros, Buenos Aires Province, Argentina EFFECT OF DISTORTION ON SPEAKER RECOGNITION IN TELEPHONE LINE FREQUENCY BAND AGUSTÍN Y. ARIAS 1 1 Universidad Nacional de Tres de Febrero, Buenos Aires, Argentina. [email protected] Abstract This paper analyze how the harmonic distortion affects the correct recognition of speakers with the objective to contribute to the advance of forensic acoustics. To achieve this task, the dynamics of different voices has been processed, gradually increasing the level of distortion, in order to simulate the actual quality of a telephone line. Several subjective listening tests were performed, where fifteen subjects had to listen to the voices processed and decide which person belongs each voice, comparing the distorted voices with reference voices without distortion. The influence of the Distortion Level and the Total Harmonic Distortion (THD) of a telephone line simulator is analyzed. In addition, vocal formants are studied using several sonograms to analyze the results of the tests. The procedures, considerations and limitations of the test process are explained as well as the conclusions and new lines of investigations are described. 1. INTRODUCTION The recognition of speakers is a broadly diversified discipline, where various systems that perform this task automatically or semi-automatically can be found. These systems are used for various purposes, for example: smart opening locks, access to private computer networks, mobile smart devices, among others. However, the objective of this study is directly related to the field of Forensic Acoustics where handling computerized techniques by a skilled expert plays a fundamental role to produce accurate results and conclusions. Forensic Acoustics is one of the most complex research environments of the Scientific Police, mainly due to the multidisciplinary nature of their different approaches of analysis, and the need to grant a continued high level of training to their experts. While the domain and use of technology and digital applications analysis, calculation or processing is essential, involving a team of experts specializing in different perspectives of study is even more unavoidable [1]. In forensic analysis of voices is common to work with poor signal quality in terms of dynamic, bandwidth, phase distortion, and several types of noise (ambient noise, digital noise, quantization noise, modulation noise, electromagnetic noise), which is mainly related to the limitations imposed by the equipment and transmission lines (whether through physical lines or wireless) [2]. All these factors are potential generators of errors in the results produced by automatic speech recognition systems, whereby the judgment and skill of the expert in charge of the analysis is of fundamental importance. Hirsch [3] developed a study in which the performance of various automatic speaker recognition systems using voice recordings filtered between 300 and 3400 Hz and contaminated with various background noise (interior of cars, trains, restaurant, among others) was analyzed. The SNR (Signal-to- Noise Ratio) of each voice + background noise was gradually modified and the results shows that the automatic speaker recognition systems are unreliable when the background noise level is greater or equal to the level of the vocal registers. In the other hand, the phase distortion produced by different elements of telephones lines (cables, filters, transformers and equalizers) were characterized by Steinberg [4] in the Bell Laboratories. Several telephone lines were measured and the technological requirements were stated. In addition, Wang and Lim [5] studied the importance of phase distortion in the context of speech enhancement through some listening tests. They used two analysis blocks in parallel to estimate the phase and magnitude spectra by using both clean and noisy speech. By altering the amount of induced noise for phase and magnitude estimation separately, the structure is capable of controlling the amount of phase and magnitude distortions independently. Using this structure and carrying out listening tests, they conclude that it is unwarranted to make an effort to more accurately estimate the phase from noisy speech. However, the effects produced by the level of distortion of a telephone line in a speaker recognition task has not been fully investigated. The objective of this research is to subjectively evaluate the difficulties that a listener has when performing a speaker recognition test using sound recordings with different levels of harmonic distortion and elimination of
Transcript
  • 1

    Acoustic Laboratory November 2013, Caseros, Buenos Aires Province, Argentina

    EFFECT OF DISTORTION ON SPEAKER RECOGNITION IN

    TELEPHONE LINE FREQUENCY BAND

    AGUSTN Y. ARIAS 1

    1 Universidad Nacional de Tres de Febrero, Buenos Aires, Argentina. [email protected]

    Abstract This paper analyze how the harmonic distortion affects the correct recognition of speakers with the objective to contribute to the advance of forensic acoustics. To achieve this task, the dynamics of different voices has

    been processed, gradually increasing the level of distortion, in order to simulate the actual quality of a telephone line.

    Several subjective listening tests were performed, where fifteen subjects had to listen to the voices processed and

    decide which person belongs each voice, comparing the distorted voices with reference voices without distortion. The

    influence of the Distortion Level and the Total Harmonic Distortion (THD) of a telephone line simulator is analyzed.

    In addition, vocal formants are studied using several sonograms to analyze the results of the tests. The procedures,

    considerations and limitations of the test process are explained as well as the conclusions and new lines of

    investigations are described.

    1. INTRODUCTION

    The recognition of speakers is a broadly diversified

    discipline, where various systems that perform this

    task automatically or semi-automatically can be found.

    These systems are used for various purposes, for

    example: smart opening locks, access to private

    computer networks, mobile smart devices, among

    others. However, the objective of this study is directly

    related to the field of Forensic Acoustics where

    handling computerized techniques by a skilled expert

    plays a fundamental role to produce accurate results

    and conclusions. Forensic Acoustics is one of the most

    complex research environments of the Scientific

    Police, mainly due to the multidisciplinary nature of

    their different approaches of analysis, and the need to

    grant a continued high level of training to their experts.

    While the domain and use of technology and digital

    applications analysis, calculation or processing is

    essential, involving a team of experts specializing in

    different perspectives of study is even more

    unavoidable [1].

    In forensic analysis of voices is common to work

    with poor signal quality in terms of dynamic,

    bandwidth, phase distortion, and several types of

    noise (ambient noise, digital noise, quantization noise,

    modulation noise, electromagnetic noise), which is mainly related to the limitations imposed by the

    equipment and transmission lines (whether through

    physical lines or wireless) [2]. All these factors are

    potential generators of errors in the results produced

    by automatic speech recognition systems, whereby the

    judgment and skill of the expert in charge of the

    analysis is of fundamental importance.

    Hirsch [3] developed a study in which the

    performance of various automatic speaker recognition

    systems using voice recordings filtered between 300

    and 3400 Hz and contaminated with various

    background noise (interior of cars, trains, restaurant,

    among others) was analyzed. The SNR (Signal-to-

    Noise Ratio) of each voice + background noise was

    gradually modified and the results shows that the

    automatic speaker recognition systems are unreliable

    when the background noise level is greater or equal to

    the level of the vocal registers. In the other hand, the

    phase distortion produced by different elements of

    telephones lines (cables, filters, transformers and

    equalizers) were characterized by Steinberg [4] in the

    Bell Laboratories. Several telephone lines were

    measured and the technological requirements were

    stated. In addition, Wang and Lim [5] studied the

    importance of phase distortion in the context of speech

    enhancement through some listening tests. They used

    two analysis blocks in parallel to estimate the phase

    and magnitude spectra by using both clean and noisy

    speech. By altering the amount of induced noise for

    phase and magnitude estimation separately, the

    structure is capable of controlling the amount of phase

    and magnitude distortions independently. Using this

    structure and carrying out listening tests, they

    conclude that it is unwarranted to make an effort to

    more accurately estimate the phase from noisy speech.

    However, the effects produced by the level of

    distortion of a telephone line in a speaker recognition

    task has not been fully investigated. The objective of

    this research is to subjectively evaluate the difficulties

    that a listener has when performing a speaker

    recognition test using sound recordings with different

    levels of harmonic distortion and elimination of

  • 2

    spectral information. This manipulation of sound

    recordings tries to simulate the behavior of low quality

    telephone transmission lines with regard to

    transmission and reception of audio signals such as in

    police radio electric equipment and some mobile and

    landline telephones. This poor sound quality

    recordings are commonly used in forensic acoustics,

    damaging and making it much harder the correct

    performance of the automated or semi-automated

    systems of speaker recognition [6]. To quantify the

    different levels of distortion, the Total Harmonic

    Distortion (THD) parameter was chosen because it is

    one of the most important technical specifications that

    any equipment that manipulates audio signals has.

    A subjective speaker recognition test was designed

    in order to study the effects of the THD. In the test

    each subject had to compare filtered (telephone lines

    simulator) and distorted vocal records with high sound

    quality reference vocal records.

    2. EXPERIMENTAL DESIGN

    Firstly, a filtering process that simulates the

    frequency response of a typical telephonic system was

    design. For this, a 6th order Butterworth band pass

    filter was used. The lower cutoff frequency was set to

    300 Hz and the upper cutoff frequency was set to 2000

    Hz as can be appreciate in Figure 1 [7]. This filtering

    process was also used to perform a preliminary

    speaker recognition test with filtered signals without

    distortion, in order to analyze the influence of that

    process in the results obtained from the main test

    which uses the filtering + distortion processes. Two

    sonograms corresponding to the word secuestro are show in Figure 2 to compare how formants are seen.

    Figure 1. Telephone-like band pass filter

    The spectral content elimination can be clearly

    appreciate in the sonogram of the filtered voice. The

    sonograms are a useful tool to analyze the presence of

    formants. Formants are own resonances of the vocal

    tract and represents the peak intensity in the spectrum

    of a sound, which corresponds to the concentration of

    energy that occurs at a particular frequency.

    Technically, the formants are frequency bands where

    most of the sound energy is concentrated.

    The energy of the first and second formants remain

    intact, which greatly favors the recognition since those

    are the most energetic formants that characterize the

    presence of vowels in the speech. As it was mentioned

    above, the applied filter has a lower cutoff frequency

    of 300 Hz, which does not interfere too much with the

    main formant region of vowels which are listed in

    Table 1 [8].

    Original voice register

    Filtered voice register

    Figure 2. Original and filtered voice register sonograms

    Table 1. Main formant regions of vowels

    Vowel Main formant region

    u 200 - 400 Hz

    o 400 - 600 Hz

    a 800 - 1200 Hz

    e 400 -600 and 2200 - 2600 Hz

    Then, since the main subjective test was designed

    to study the degree of accuracy in recognizing

    speakers gradually varying the distortion, four levels

    of distortion were designed. The distortion system

    used to perform this task is a VST Plugin available in

    Adobe Audition, which consists of a typical limiter

    that allows the user to modify its transfer function. The

    configurations of the distortion systems were chosen

    as follows:

    Distortion 1 (D1): Low distortion. o Limiter: -20dB

    Distortion 2 (D2): Moderate distortion o Limiter: -30dB

    Distortion 3 (D3): High distortion o Limiter: -40dB

    Distortion 4 (D4): Very high distortion

  • 3

    o Limiter: -50dB

    Figure 3 shows the transfer functions of these four

    configurations.

    Figure 3. Distortion systems transfer functions

    These distortion processes generate a big amount

    of harmonic components. The energy of those

    harmonics increase as the distortion level increase. It

    tends to impair the correct identification of formants,

    which may difficult the task of recognition of low-

    frequency signals (vowels). In Figure 4 the sonograms

    of an original voice recording and the same recording

    with the D1 and D4 distortion levels are compared.

    The main formants and the amount of spectral

    harmonics generated is visually distinguishable in the

    D1 distortion sonogram, and also the reduction of the

    energy levels with increasing the number of harmonics

    is appreciated. But in the case of the D4 distortion level

    occurs exactly the opposite. It is very difficult to

    distinguish the formants because the energy level

    difference between the first-order formant (first

    resonance frequency of the vocal tract) and the first

    harmonic is too small (about 2dB). The same energy

    level difference was found between upper harmonics.

    In the case of the D1 distortion level, the differences

    found are about 5.3 and 6.4 dB. The signal energy

    above 3000 Hz is less than -85 dB, so it can be

    despised.

    As it was mentioned above, it was decided to use

    the Total Harmonic Distortion (THD) [9] to quantify

    the four distortion levels. This parameter is calculated

    according to Eq.1.

    =10

    10

    10

    10

    100 [%] (1)

    Where Hrms is the root mean square value of all the

    harmonics and IRrms is the root mean square value of

    the original impulse in the impulse response of the

    filter + distorted system. The usefulness of this

    descriptor is that it allows to relate the amount of

    harmonic energy generated by the distortion process

    and the energy of the input signal (original impulse). The results obtained from the analysis of each level of

    distortion are shown in Table 2.

    Original voice register

    D4 distorted voice register

    D1 distorted voice register

    Figure 4. Sonograms comparison

    Table 2. THD [%] of each filter + distortion system

    Distortion

    level

    Signal energy level [dB] THD

    Original Impulse Harmonics

    D1 -54,1 -72,8 1,35%

    D2 -53,69 -66,2 5,61%

    D3 -52,51 -60,5 15,89%

    D4 -51,37 -54,3 50,93%

    D3 D4

    D1 D2

  • 4

    3. SIGNALS ADQUISITION AND

    PROCESSING

    3.1. Voices signals recordings

    Six different voices were recorded using high

    quality recording equipment to obtain "clean" signals

    with a high signal-to-noise ratio, a wide bandwidth and

    low harmonic distortion. The following list details the

    equipment and software employed:

    Notebook POSITIVO BGH C570

    M-Audio Fast Track audio interface

    RODE NT 2A microphone

    XLR-XLR cable

    Sound level meter Svantek 959 + microphone calibrator.

    Microphone tripod

    Adobe Audition o Sample rate: 44100 Hz o Resolution: 16 bits o Channel mode: Mono

    The recordings were performed in a room with the

    following acoustic characteristics:

    Background Noise: 42 dBA.

    Reverberation Time: 0,4s.

    Those values were measured with the sound level

    meter. The background noise measurement time was 5

    minutes and the reverberation time T30 was obtain

    using a balloon explosion and averaging the results of

    the 500, 1000 and 2000 Hz octave bands.

    In order to perform a test that represents the actual

    working conditions of a forensic expert, five Spanish

    words commonly used in forensic acoustics were

    recorded by each person. Those words are: Dinero (money), Secuestro (kidnapping), Polica (police), Rehn (hostage) and Bomba (bomb).

    Finally, to minimize the temporal characteristics

    that each person has in their own speech, they had to

    perform the recordings imitating the pronunciation of

    a reference speech reproduced by headphones that

    contains the five words defined above with a specific

    rhythmic.

    3.2. Signals processing and considerations

    Once all the sound recordings of the six persons

    who contributed with their voices were obtain, several

    stages of signals processing were conducted. The goal

    of this step is to process some words of each speech

    signals to simulate the poor sound quality produced by

    certain telephones lines such as it was defined in

    Section 2.

    First, a process of "normalization" to 100% was

    applied to each recorded word to equate the levels of

    all recordings, avoiding loudness differences between

    different words and persons.

    From this point, only three words (Dinero Secuestro - Polica) were processed for each voice

    recorded, while the other two words (Rehn Bomba) were not modify. It is because the subjective test seeks

    to compare originals signals with distorted signals, but

    the words recorded as the original signals must be

    different for those words corresponding to the

    distorted signals, in order to, again, avoid any temporal

    characteristics of the speeches (rhythm and rate of

    speech, forms of pronunciation, particular accents,

    among others) that may help to distinguish a particular

    listener. So, the two important considerations taken

    into account for the voices recordings and processing

    that allow minimize the influence of temporal

    characteristics were defined.

    Then, the four different levels of distortion were

    applied. Because this process consist of a typical

    limiter, the amplitude of the signals processed results

    highly attenuated. For this reason another

    normalization process was applied to raise de loudness of the signals but in this instance the

    normalization was set to 60% due to the noisiness that

    generates the distortion which can cause temporal

    hearing damage to the listener if the normalization

    value were set higher than 60%.

    Figure 6. Impulse response of a filter + distortion system

    BAND PASS FILTER

    DISTORTION SYSTEM

    INVERSE FILTER CONVOLUTION

    Original impulse

    1 harmonic 3 harmonic

    2 harmonic

    4 harmonic

    5 harmonic

    Log-Sine sweep

    Figure 5. Block diagram of the entire signal processing.

  • 5

    The entire process is shown in Figure 5. As it was

    explained above, the words "Dinero", "Secuestro" and

    "Polica" were normalized to 100%, then filtered,

    distorted and once again normalized to 60%. The

    words "Rehn" and "Bomba" were only normalized to

    100% because they were used as the reference signals

    in the test procedure.

    To calculate the THD values, the impulse response

    of each Filter + Distortion system was obtained by

    passing a Log-Sine sweep (between 100 and 4500 Hz)

    at the input of the system and then convolving the

    output signal by the inverse filter using the AURORA

    plugins [10]. As shown in Figure 6, this process allows

    to discriminate temporarily the different harmonics

    created by the distortion processes and thereby it is

    possible to compare the energy of those harmonic with

    the energy of the original impulse.

    4. TEST PROCEDURE

    The subjective tests were performed by fifteen

    subjects, ten men and five women aged between 22 to

    27 years. In all cases a pair of headphones were used

    in order to minimize the background noise of the

    rooms where the tests were performed, allowing the

    subjects to concentrate on the listening without any

    external interference.

    In Figure 7 a typical test table is shown. The

    subjects had first to listen the Reference 1 track which contains the two words that were not modify

    (Rehn Bomba) corresponding to one person voice. Then, they had to listen the six Voice tracks (A, B, C, D, E and F) each one containing the three distorted

    words (Dinero Secuestro Polica) and decide which one of the six Voice tracks corresponds to the same person of the Reference 1 track. This procedure was repeated for the rest of the Reference tracks. There were four tables like the one showed in Figure 7, each

    one corresponding to a different level of distortion of

    the Voice tracks. The first test is that one corresponding to the Distortion D4, the second one

    correspond to the Distortion D3 and so on. The subjects

    were allowed to repeat any track many times if it was

    necessary, and for that reason the time required to

    complete the test vary according to the subject. On

    average it took twenty minutes to complete the test.

    Figure 7. Test table for the Distortion D1.

    5. RESULTS

    The preliminary test (speaker recognition using

    filtered voices) results are shown in Figure 8. The

    results obtained show that ten of the twelve subjects

    (83.33%) were able to recognize all voices, while the

    remaining two subjects (16.66%) achieved four

    recognitions.

    Figure 8. Preliminary test. Amount of successful

    recognitions

    These results indicate that the filtering process

    does not produce a significant difficulty to perform a

    successful speaker recognition task. It is considered

    that the various degrees of difficulty of recognizing

    speakers are highly related to the processes of

    distortion.

    Then, regarding the main test, the amount of

    successful recognitions that each subject has achieved

    in the four tests was studied in order to analyze the

    difficulty of recognizing speakers with the different

    levels of distortion. The results are shown in Figure 9.

    Figure 9. Amount of successful recognitions of each subject

    for each distortion level

    100%

    66.66%

    50%

    33.33%

    16.66%

    0%

    100%

    66.66%

  • 6

    As can be seen, the results obtained indicate that

    for the lower distortions levels there is a great amount

    of successful recognitions, as expected. Three cases

    were found in which there was no recognition, two

    corresponding to the Distortion D4 and the remaining

    to the Distortion D3. On the other hand, the number of

    cases in which the subject correctly matched at all

    recognitions is seven, all corresponding to the

    minimum level of distortion. In general, the results obtained for the distortion levels D4 and D3 are evenly

    distributed between one and four successful

    recognitions. There's no chance to hit five times

    because if a comparison is erroneous then inevitably

    there is another erroneous one.

    Analyzing the results for each distortion level the

    following behaviors were found:

    In the case of Distortion D4 the recognition was

    very poor, with a mean of 1.33 successful recognitions

    (22.6%), while the maximum number was three (50%)

    achieved only by one subject. Moreover, two cases in

    which recognition was null (0%) were found.

    Analyzing the distribution of the results, it is found

    that twelve of the fifteen subjects (80% of the total

    population) were able to recognize only one or two

    voices.

    On the other hand, in the test of Distortion D3 a

    slight improvement in the average score was obtain

    with an average of two successful recognitions

    (33.3%). Only in one case it was possible to recognize

    four voices which was the maximum value of

    successes obtained (66.6%). There was also one case

    of null recognition (0%). In this case the distribution

    of results indicates that only five subjects (33.3%)

    achieved a recognition between three and four voices.

    The Distortion D2 test results did not provide

    significant improvements over the previous case. The

    average of successful recognitions increased only to

    2.53 (42.2%) and the maximum is four (66.6%)

    achieved by three subjects could not be overcome.

    Unlike the previous cases, this time no subject failed

    in all the recognition process, although only 46.6% of

    subjects achieving recognize three or more voices

    without exceeding the maximum value of four

    successful recognitions.

    Finally, the test of Distortion D1 got the most

    accurate results, with a mean of 4.87 (81.1%) of

    successful recognitions. Fourteen subjects (93.3%)

    were able to correctly identify more than three voices,

    from which seven (46.6%) could recognize all voices.

    The mean values of successful recognitions and the

    standard deviations associated are listed in Table 3.

    Another interesting result obtained is the amount

    of successful recognitions for each one of the six

    voices. To perform this analysis the number of

    subjects who were able to recognize each voice

    individually in each of the four levels of distortion was

    determined. The results obtained are shown in Figure

    10.

    Table 3. Mean and Standard Deviation values of successful recognitions for each distortion level

    Distortion

    level

    Successful recognitions

    Mean Standard

    deviation

    1 4,87 1,13

    2 2,53 0,99

    3 2,00 1,07

    4 1,33 0,82

    Figure 10. Amount of successful recognitions for each

    one of the six voices

    This results denote that for all levels of distortion,

    the voice of the person A (Voice A) was the one that

    had a greater number of successful recognitions. The

    Voice D also has a great amount of successful

    recognitions, especially in the case of the D1 where

    86.66% of the subjects were able to correctly identify

    that person.

    The statistical analysis (Correlation coefficients,

    Standard Error, the Coefficient of Determination R square, and the p-value for a confidence interval of 95%) of the distortion level and the amount of

    successful recognitions for each voice is listed in Table

    4. The results obtained for the Voices A, D and F

    indicates that the quantity of successful recognitions

    are tightly related with the difference of the distortion

    levels. The p-values obtained are much small than 0.05

    for those words, and then it allows to confirm that both

    variables are highly correlated.

    100%

    93.33%

    86.66%

    73.33%

    66.66%

    53.33%

    46.66%

    33.33%

    26.66%

    13.33%

    6.66%

    0%

    80%

    60%

    40%

    20%

  • 7

    Table 4. Distortion Level. Statistical analysis

    Distortion

    level

    Successful recognitions

    Voice

    A

    Voice

    B

    Voice

    C

    Voice

    D

    Voice

    E

    Voice

    F

    1 15 12 11 13 12 10

    2 10 3 5 9 4 7

    3 8 4 5 5 3 5

    4 5 3 3 3 2 4

    Correlation 0,983 0,770 0,894 0,990 0,875 0,976

    Standard Error 0,424 1,523 0,849 0,346 1,212 0,316

    Determination

    Coefficient 0,966 0,593 0,800 0,980 0,766 0,952

    p-value 0,017 0,229 0,105 0,010 0,124 0,024

    The same analysis was performed relating the

    amount of successful recognitions with the four THD

    values calculated. The results are listed in Table 5. The

    negative correlation indicates that the higher the THD

    the lower the amount of successful recognitions. There

    is no p-value that allows to reject the null hypothesis,

    so it would not be correct to affirm that the amount of

    successful recognitions of each voice is correlated to

    the THD. For this reason, the THD parameter is not

    useful to quantify the distortion level of the designed

    distortion systems.

    Table 5. THD. Statistical analysis

    THD [%]

    Successful recognitions

    Voice

    A

    Voice

    B

    Voice

    C

    Voice

    D

    Voice

    E

    Voice

    F

    1,35 15 12 11 13 12 10

    5,61 10 3 5 9 4 7

    15,89 8 4 5 5 3 5

    50,93 5 3 3 3 2 4

    Correlation -0,856 -0,532 -0,717 -0,845 -0,645 -0,804

    Standard Error 0,068 0,116 0,076 0,074 0,110 0,049

    Determination

    Coefficient 0.732 0,283 0,513 0,715 0,416 0,674

    p-value 0.144 0,468 0,283 0,154 0,354 0,195

    Finally, a comparison between the correlation

    coefficients of the Distortion Limit Level and THD is

    shown in Table 6.

    Table 6. Correlation coefficients comparison

    Successful recognitions Correlation Coefficient

    D1 D2 D3 D4 Distortion level THD

    Voice A 5 9 13 1 0,98 -0,86 Voice B 5 5 11 1 0,77 -0,53

    Voice C 4 3 12 1 0,89 -0,72

    Voice D 8 10 15 1 0,99 -0,85

    Voice E 3 4 12 1 0,88 -0,65

    Voice F 5 7 10 1 0,98 -0,80

    The correlation coefficients of the Distortion Level

    parameter are greater than those obtained for the THD

    parameter for all the voices. The correlation

    coefficients of the Distortion Level are positives,

    which indicates a direct correlation between the

    variables. In the opposite, the correlation coefficients

    of the THD parameter are negative, indicating an

    inverse correlation between the variables.

    However, the results show that the THD cannot be

    used as an objective parameter for quantifying the

    degree of accuracy when performing a speaker

    recognition. The THD only considers the energy of the

    harmonics without considering the signal to noise

    ratio, which increases with increasing the level of

    distortion producing variations in the degree of

    perceived noisiness by the listener. The noise energy

    increment between syllables and words due to the

    second normalization process at 60% worsens the

    quality of vocal registers because each person's voice

    is masked by the noise level of the signal. Figure 11

    shows two distorted voice signals (corresponding to

    the word secuestro as in the previous analysis) with same amplitude scale before applying the 60%

    normalization process: the left one has applied the D4

    distortion level and the right has the D1 one.

    The intervals of silence (actually there is background noise of the original recordings) between

    syllables are highlighted demonstrating that those

    intervals have the same amplitude, no matter what is

    the distortion level applied, because the limiter starting

    value (-50dB for D4) is never reached. The noise

    amplitude is of the same order of the voice amplitude

    for the D4 signal, but it is much lower compared with

    D4 D1

    Noise between syllables

    Noise between syllables

    D4 D1

    Figure 12. Noise between syllables after the 60% normalization process

    Figure 11. Noise between syllables before the 60% normalization process

  • 8

    the D1 one. Because of this, applying the 60%

    normalization process, the amplitude of the noise

    increases in proportion to the level of distortion, as

    shown in Figure 12. This suggests that it is necessary

    to study the influence of the signal-to-noise ratio

    produced by the distortion process in conjunction with

    the THD of the system, since the latter is not sufficient

    to quantify the degree of successful recognitions.

    Finally, the major account of successful

    recognitions corresponds to the Voice A, which

    corresponds to a person previously knew by the

    subjects before to perform the test. All subjects know

    and maintain a daily relationship over four years ago

    with that person. It indicates that the familiarization of

    a particular voice may improve the recognition task.

    6. CONCLUSIONS

    From the tests results it was found that the chances

    to successfully recognize three of the six voices

    employed are strongly correlated with the distortion

    level. For the other three voices this conclusion cannot

    be affirmed. So, there is a need to increase the amount

    of subjects to perform more test and to obtain more

    subjective data in order to affirm or refute the

    hypothesis that the distortion level and the amount of

    successful recognitions have a strong correlation. The

    THD parameter of the distortion systems is not useful

    to predict how difficult is to perform a successful

    speaker recognition because there is not a strong

    correlation between THD and the amount of

    successful recognitions.

    A visual analysis of the formants printed in a

    sonogram allows to conclude that as the distortion

    level increase, the discriminations of vowels energy

    becomes even more difficult. The spectrograms are a

    valuable visual tool but its usefulness is limited to the

    dynamic quality of the records analyzed as both

    distortion and noise can impair the correct

    interpretation of the graphs.

    These findings were reflected in the tests, where

    only good results were achieved with the minimum

    distortion level. It may indicate that the subject who

    performed the test use some spectral information

    related mainly with the vowels energy from the

    original references tracks and them tries to

    characterize the different persons voices with that information. Then, listening both references and

    distorted tracks, the subjects try to find some common

    behavior related to the vowels energy distribution

    between speakers.

    In the filtered signal test, a strong dependence

    between the speaker recognition performance and the

    elimination of spectral information using a telephone-

    like filter could not be found.

    As it was explained, the test was designed to avoid

    any kind of temporal information that may help to the

    recognition process. However some subjects

    commented after the realization of the tests that they

    use some temporal characteristics related to the

    pronunciation forms of the persons recorded to try to

    recognize them, especially with the two highest levels

    of distortion.

    7. FUTURE WORK

    One of the most important needs that emerges from

    the results of this report is the increasing of amount of

    subjects to perform the test and the addition of more

    distortions levels, which will allow an increasing in the

    resolutions of the results. It is, the regression lines will

    be calculated with more data and the correlation

    coefficients will be more accurate. In addition, it is

    necessary to evaluate the influence of the reduction of

    the signal-to-noise ratio with the THD variations in

    order to design an optimal parameter that allows an

    accurate quantification of the degree of difficulty

    when a speaker recognition task is performed.

    It would be also interesting analyze the signals

    used for subjective tests and process them with

    automatic speaker recognition systems, which will not

    only allow to classify different degrees of precision of

    such systems but also to compare the number of

    successful recognitions with those obtained

    subjectively. This will allow forensic experts to

    determine to what extent they can rely on their

    interpretation.

    On the other hand, it is necessary to investigate and

    develop new technologies that allows a clear

    transmission of the human voice in telephone systems,

    as this is the primary means of communication used by

    kidnappers.

    It is also necessary that all offenders under police

    custody realize a voice recording session in order to

    store their vocal registers that could potentially be used

    if any of them re-offend.

    For further research, it would also be interesting to

    analyze what happens using female voices in order to

    analyze the influence of the different levels of

    distortion when both male and female voice are

    compared and also study the influence of the

    previously known voices.

    8. REFERENCES

    [1] Delgado Romero C. Tcnicas digitales de anlisis audiovisual en acstica forense, Actas del 3 Congreso de Investigadores Audiovisuales. Vol. 1.

    Del Laberinto. Madrid, Spain. November 1999.

    [2] Lane C. Phase distortion in telephone apparatus, Bell System Technical Journal. Vol. 2, pp. 493-521. New York, US. May 1983.

    [3] Hirsch H. The Aurora experimental framework for the performance evaluation on speech recognition

    systems under noisy conditions. VI International Conference on Spoken Language Processing. Vol.4,

    pp. 29-32. ICSLP-2000. Beijing, China. October 2000.

    [4] Steinberg J. C. Effects of phase distortion telephone quality. Bell System Technical Journal. Vol. 2, pp. 550-555. New York, US. May 1983.

  • 9

    [5] Wang D., Lim J. The unimportance of phase in speech enhancement. Acoustics, Speech and Signal Processing. IEEE Transactions. Vol. 30, pp. 679 681. Cambridge, US. January 2003.

    [6] Dominguez, S. Estimated weight of evidence in forensic sound for statistical inference of identity of

    the speaker by Bayesian network application to

    acoustic features. Masters thesis. pp. 1-5. Madrid, Spain, October 2009

    [7] Steinberg J. C. Effects of phase distortion telephone quality. Bell System Technical Journal. Vol. 2, pp. 555-566. New York, US. May 1983.

    [8] Ladefoged P. A Course in Phonetics. Fort Worth Harcourt Brace Jovanovich College Publishers. Vol. 1,

    5th Ed. Boston, US. 2006.

    [9] Bohn D. Audio specifications. Rane Note. Vol. 12, pp. 1-12. Washington DC, US. 2000

    [10] Farina. A. Advancements in Impulse Response Measurements by Sine Sweeps. AES E-Library. Parma, Italy. May 2007.

  • 10

    Acoustic Laboratory Name: _________________ Age: _____ Date: ___/___/________

    6 different persons voices were recorded, each one saying the following list of 5 words: Dinero,

    Secuestro, Polica, Rehn, and Bomba.

    The first three words were filtered and distorted.

    The other two words were not modified

    You had first to listen the Reference 1 track which contains the two words that were not modify

    (Rehn Bomba) corresponding to one person voice. Then, listen the six Voice tracks (A, B, C, D, E

    and F) each one containing the three distorted words (Dinero Secuestro Polica) and decide

    which one of the six Voice tracks corresponds to the same person of the Reference 1 track. This

    procedure is repeated for the rest of the Reference tracks.

    Distortion 4 Voice A Voice B Voice C Voice D Voice E Voice F

    Reference 1

    Reference 2

    Reference 3

    Reference 4

    Reference 5

    Reference 6

    Distortion 3 A B C D E F

    Reference 1

    Reference 2

    Reference 3

    Reference 4

    Reference 5

    Reference 6

    APPENDIX

    Subjective Test

  • 11

    Distortion 2 A B C D E F

    Reference 1

    Reference 2

    Reference 3

    Reference 4

    Reference 5

    Reference 6

    Distortion 1 A B C D E F

    Reference 1

    Reference 2

    Reference 3

    Reference 4

    Reference 5

    Reference 6


Recommended