Tutorial Slides Eriksson Drzygajlo

8/12/2019 Tutorial Slides Eriksson Drzygajlo

1/81

Forensic Speech Science

Part I: Forensic Phonetics

Anders Eriksson

Department of Linguistics,

Gothenburg University,Gothenburg, Sweden

Historical background

Man has always had strong

intuitions about the reliability ofvoice recognition:

The voice of the speaker is as

easily distinguished by the ear as

the face is by the eye

Quintillian, 3596 AD

1


2/81


An early court case:

In 1660, William Hulet was

accused of having executed King

Charles I. A witness, Richard

Gittens, testified that he knew that

it was Hulet by his speech.


On March 1, 1932 the son of the famous aviator

Charles Lindbergh was kidnapped and was later

found dead. The crime has been called the Crimeof the century because of the enormous publicity

it attracted. Its interest in forensic phonetics,

however, has to do with voice recognition and

memory.

Before it became known that the boy was dead, a

ransom was paid to the kidnapper by a negotiator.On that occasion, April 2, 1932, Lindbergh heard

the kidnappers voice, but could not see him.

2


3/81


In September 1934, 29 months afterhearing the voice of the kidnapper,

Lindbergh (in disguise) was confronted

with the suspected kidnapper, Bruno

Hauptmann, who was instructed to

repeat the phrase Lindbergh had heard.

Lindbergh then claimed that herecognized the voice as the one he had

heard 29 months earlier.


At the trial in January 1935, Lindberghtestifies under oath that the suspects voice is

the one he had heard 29 months earlier.

3


4/81


The invention of

the soundspectrograph meant

a breakthrough in

speech analysis. A

first model was

built at Bell Labs

in the early forties.


The original motivation behind the

development of the spectrographwas the phonetic study of speech.

a method of approach to studies

of speech production and

measurement

Steinberg, 1934

4


5/81


A real time spectrograph calledDirect

Translator was also produced, to be

used for pronunciation training for the

deaf and foreign language students.


In spite of the general interest of the

spectrograph as a tool and the

suggested applications, no

publications describing the work

appeared from Bell labs until 1945.

Why?

Because the work was rated as a warproject.

5


6/81


The reason could hardly have been

military applications of pronun-

ciation training for the deaf. It

must have been something else.

We have reasons to believe that

speaker identification by the use ofspectrograms was what gave the

research its war project rating.


It has been suggested that one of the

intended applications was identi-

fying enemy war ships by identifyingtheir radio operators, but very little is

know about it.

The term voiceprint appears in

some publications but without

explicit reference to speaker

identification.

6


7/81


If the people at Bell Labs, sponsored by the

military, secretly worked on voiceprints forspeaker identification purposes, as we have

reasons to believe, then the early history of

the voiceprints follows very parallel tracks

in the USSR, including the fact that we know

very little about it.

The only (?) account of the Soviet efforts we

have is the novel The First Circleby

Solzhenitsyn.


The plot of the novel takes place within a

time-span of only three days during the

Christmas Holiday of 1949 and the setting isthe Mvrino prison at the outskirts of

Moscow, were the Stalinist regime held

unreliable scientist imprisoned.

The prison had its own acoustics laboratory

and the so called Clipped Speech Laboratorywhere work on speech coding took place.

7


8/81


One day the focus shifted, at least

temporarily, from voice clipping to voice

recognition, when the people working in

the lab were given the task of identifying a

anonymous speaker in a tapped telephone

conversation by comparing the recorded call

with sample recordings of five suspects.

They were given only two days to complete

the task.


Given that Siberia was a likely alternative

option, it comes as no surprise that theysucceeded with their task.

There is no detailed information in the

novel about the methods they used, but it

is obvious that they were familiar with

similar efforts outside the USSR.

8


9/81


Based on the description in thenovel, it seems likely that the

spectrograph they used may

have been based on the description

by Steinberg published in JASA in

1934.


This diagram in Steinbergs papers fits

the description in the novel very well.

9


10/81


Screen shots from the American

television (1991) series based onThe First Circle.


Two quotations from the novel:

The science of phonoscopy, born today,

December 26th 1949, does have a rationalcore.

They envisioned the system, like finger-

printing ... Any criminal conversation

would be recorded ... and the criminal

would be caught straight off, like a thiefwho had left his fingerprints on the safe

door.

10


11/81


The term the inmates at Mvrinocoined for the use of acoustic

analysis as a means of speaker

identification wasphonoscopy. In

Russia today and many former east

European countries this is still theterm used.

Some fundamental issues

In the following sections we will

present a selection of importantissues in forensic phonetics trying

to describe problems as well as

solutions, and what we know at

present and do not yet know.

11


12/81

Voiceprints

Much of the story of voiceprintingin forensic phonetics revolves

around one particular man

Lawrence G. Kersta who was an

engineer at Bell and head of the lab

until he resigned in 1966 to start hisown company dedicated to forensic

phonetics.

Voiceprints

Between 1945, when people at

Bell started to publish again and1962 there was no mention of

voiceprints.

But in 1962 Kersta, still at Bell,

published a paper inNature titled

Voiceprint identification

12


13/81

Voiceprints

He also gave a paper at the ASAmeeting that same year called

"Voiceprint-identification

infallibility.

In both papers he described how

spectrograms could be used forspeaker identification.

Voiceprints

What made his claims so remarkable

was, however, the accuracy heclaimed for his method.

Based on visual comparison of key

words, his examiners achieved no

less than 99% correct identification

or better.

13


14/81

Voiceprints

In spite of his rather sensational claims

and the fact that his description of themethod was vague, to say the least, the

scientific community was slow to

react.

Up until 1966 when he resigned from

Bell to start his own company he

remained largely unchallenged.

Voiceprints

He therefore enjoyed some initial

success and his testimonies were

accepted as evidence by courts in

some, but not all, states.

He later began to meet with resistance,

however, when other researchers

tested the method of visual voicerecognition from spectrograms.

14


15/81

Voiceprints

Subjects in a study by Young andCampbell (1967), for example, using

the voiceprint technique, obtained

78.4% correct identifications for two

words spoken in isolation but only

38.3% when the same words weretaken from different contexts.

Voiceprints

Many others joined in as more and

more results indicated that themethod was by no means as

reliable as Kersta had claimed.

But there were also those who

supported him, most notably Tosi

who was a qualified phonetician.

15


16/81

Voiceprints

A weak point in addition to the factthat the results could not be

reproduced was that there was never

a detailed, explicit description of the

method. We may rather safely

assume, however, that it was largelyintuitively based.

Voiceprints

The controversy continued until the

late eighties and voiceprinting is stilldone by private detectives and other

non-academic experts but nobody

in the speech science community

believes in its usefulness for forensic

purposes any more.

16


17/81

Voiceprints

What we, as forensic phoneticians,may learn from this experience is not

so much that the methods were not

sufficiently reliable, but that they

were put to use in forensic field work

without having been thoroughly testedand that professional phoneticians

were far too slow to react to it.

Voice recognition and memory

As we mentioned, the Lindbergh case

raised questions about voice

recognition accuracy and memory. A

researcher who questioned whether it

would be possible to accurately

remember an unknown voice over a

period of two years was a psychologist

by the name of Francis McGehee.

17


18/81


In the first of her experiments the

listeners heard a speaker read a 56-word passage. They were then

assigned to groups who heard the

speaker as one of the speakers in a

voice line-up with five foils at

intervals of 1, 2, and 3 days, 1, 2and 3 weeks and 1, 3, and 5 months

respectively.


Recognition rate varied as a function

of time starting at a little over 80%correct identifications after a lapse of 1

day or 1 week. After 2 weeks the

recognition rate had fallen to 69%,

after a month to 57%, after 3 months

to 35% and after 5 months it was downto 13%, which is less than chance.

18


19/81


Later studies have in general confirmed her

findings although the precise decay rate

may vary from study to study.

0 5 10 15 20

Time lapse (weeks)

10

20

30

40

50

60

70

80

Correctidentifications(%)

Non-contemporary speech samples

The term refers to speech samples,

which are obtained at differentpoints in time and later used in an

identification process. The relevant

question in forensic phonetics is at

what separation in time between

speech samples, change over timebecomes a problematic factor.

19


20/81


In forensic cases time spans of a year

or more between a suspect recording

and a later attempt at identifying the

speaker are not unusual. It is therefore

important to know if voice changes

that take place over a period of one or

a few years may affect the accuracy of

speaker recognition.


This question has been addressed in

a series of studies by Hollien and

Schwartz (2000).

They tested latencies between

recordings from 4 weeks up to 20

years.

20


21/81


There was a drop in correct

identification from around 95% forcontemporary samples to 7085% for

latencies from 4 weeks to 6 years

(with no observable time trend in the

interval). For the 20-year latency,

however, a sharp drop down to 35%could be observed.


For similar voices,

however, there was a

dramatic effect.

Performance dropped

from around 95% for

contemporary

samples to 40% forsamples recorded

only 4 weeks later.

In the normal case, non-contemporary speech thus

seems to affect identification only marginally.

21


22/81

Other issues involving the sample

Other factors that may influence identification

accuracy are primarily sample duration and

acoustic quality.

If we first consider the influence of sample

duration, we may observe that in real life

investigations samples may be very short, often

just a few words or a phrase or two whichmeans that sample duration is on the order of a

few seconds.


In an early study by Pollack et al. (1954) the

authors observed that identification accuracy

increased with sample size but only up to about

1.2 seconds. For longer samples phonetic

variation took over as the most important factor.

They conclude that duration per se is

relatively unimportant, except insofar as it

admits a larger or smaller statistical sampling ofthe speakers speech repertoire.

22


23/81


This somewhat surprising finding has,

however, been confirmed in other

studies. Bricker and Pruzansky (1966)

presented stimuli which varied in

duration as well as phonemic variation.

They found that identification rate

increased with duration only if thelonger stimuli also contained more

phonemic variation.


It is important to point out, however, that

while an increase in correct identifications

is desirable it is equally desirable to keepthe number of false alarms down.

Yarmey and Matthys (1992) found that:

The facilitating effect on identification of

longer voice-sample durations was

counteracted by the high false alarm ratesin both suspect-present and suspect-absent

line-ups.

23


24/81


A large proportion of threats and abuse isdone over the telephone. Telephone quality

speech has therefore received some

attention in forensic phonetics studies.

An important question in the forensic

context is whether the poorer sound quality

of recorded telephone conversations

adversely affects voice identification.


It is a common belief that because of the

difference in sound quality, speaker

identification of voices heard over the

telephone must necessarily be performed

using voices recorded over the telephone,

the underlying assumption being that the

difference in sound quality would make

identification less reliable if directlyrecorded voice samples were used.

24


25/81


There are surprisingly few studies that

address this question but the results thereare indicate that the problem might not be

as serious as one might expect.

Rathborn et al.(1981) did not find any

significant differences in identification of a

target voice heard over the telephone and

tested using a taped lineup over thetelephone, in contrast to voice identification

tested directly with a taped lineup.


A question that has received some attention

lately is the influence on acoustic analysis

of voice samples of the band-pass filtering

that occurs in telephone transmissions.

Knzel (2001) found that the lower cut-off

frequency had the effect of shifting F1 in

German vowels upwards compared to the

corresponding tokens in a simultaneous

DAT-recording. The average frequency

shift was on the order of 6%

25


26/81

Familiarity with the speaker

Hollien et al. studied

speaker identification asa function of familiarity

under three speaking

conditions, normal,

stressed and disguised.

Listeners who were

familiar with thespeakers performed

significantly better under

all conditions.


These results have generally been confirmed

in other studies.

It is important to point out, however, that

although recognition rates are generally high

for familiar speakers, recognition is by no

means always perfect. For individual

speakers and listeners the error rates can be

very high if the utterances are short andbelong to a fairly large open set. (Ladefoged

& Ladefoged, 1980)

26


27/81


An influence of utterance length on therecognition of familiar speakers has also

been found in other studies.

In a series of experiments reported by Rose

and Duncan (1995), recognition of familiar

speakers varied from chance level to nearly

perfect as a function of utterance length.


It has been generally assumed that in

voice recognition, discrimination

constitutes the initial step with recog-

nition occurring as a later phase.

But Van Lanckeret al. have shown

that discrimination and recognition

are not stages in one process, but aredissociated, unordered abilities.

27


28/81


It is therefore entirely possible thata listener who is good at

recognizing familiar speakers may

perform badly if the task is to

discriminate between unfamiliar

speakers.

Disguise

Voice disguise, to the extent that it is

used, may be a serious problem forspeaker identification. In the extreme

end of the spectrum we find electronic

manipulation or even communicating

via speech synthesis, which would

make speaker identification virtuallyimpossible.

28


29/81

Disguise

In the world of real forensic work,

however, voice disguise tends to be of arather unsophisticated nature.

Knzel, based on experience from the

German Federal Police (BKA), notes that

falsetto, pertinent creaky voice,

whispering, faking a foreign accent, andpinching ones nose are the most

common types.

Disguise

Even unsophisticated types of disguise

may have a considerable detrimental effect

on speaker identification. In a study by

Reich and Duke all types produced

significantly fewer correct identifications.

Hypernasality produced the greatest effect.

Whisper resulted in markedly fewer

correct identifications in a study byOrchard and Yarmey.

29


30/81

Disguise

Voice disguise is not as common asone might think. Knzel reports that:

Over the last two decades, between 15

and 25 per cent of the annual cases

dealt with at the BKA speaker

identification section exhibited at leastone kind of disguise.

Disguise

Electronically manipulated messages are

still rare, but Knzel notes that there hasbeen an increase in recent years, mainly

in the form of editing recorded voices.

While at present electronic manipulation

is rare and therefore not a significant

problem, that may soon change, withincreasing availability of such devices.

30


31/81

It is generally found that foreign accent makes

identification more difficult, but the difference isoften small and not always present.

McGehee found no difference at all using

speakers with a German accent.

Doty (1998) on the other hand found substantial

differences using speakers from the US and

England speaking English as a native languageand speakers from France and Belize speaking

English as a foreign language and native speakers

of English as listeners (88% vs. 13%).

Foreign Accents

Foreign Accents

Results by Goldstein, et al. (1981) fall

somewhere in between: With relativelylong speech samples, accented voices

were no more difficult to recognize than

were unaccented voices; reducing the

speech sample duration decreased

recognition memory for accented and

unaccented voices, but the reduction wasgreater for accented voices.

31


32/81

Foreign languages

Thompson (1987) recorded six bilingualmale students reading messages in English,

Spanish, and English with a strong Spanish

accent.

Voices were best identified by monolingual

English speaking listeners when speaking

English and worst when speaking Spanish.Identification accuracy was intermediate

for the accent condition.

Schiller and Kster (1996) tested

Americans with no knowledge of

German, Americans who knew someGerman, and native German speakers

using recordings of German speakers.

Subjects with no knowledge of German

made significantly more errors than

other subjects. Subjects who knew someGerman performed similarly to native

German speakers.

Foreign languages

32


33/81

Kster and Schiller (1997) used Spanish and

Chinese listeners.

Spanish and Chinese listeners who were

familiar with German showed better

recognition rates than listeners with no

knowledge of German.

Spanish and Chinese listeners with aknowledge of German performed measurably

worse than the German and English listeners

with a knowledge of German.

Foreign languages

We may summarize the results by saying

that listeners with no knowledge of alanguage perform worse on voice

recognition than listeners with some

knowledge or native speakers, while

listeners with some knowledge of the

language tend to perform on the same level

as native speakers or only slightly below.

Foreign languages

33


34/81

Earwitnesses

Factors, which are relevant for speakerrecognition in general, like memory,

familiarity, disguise etc. are also

relevant for earwitnesses, but there are

additional factors about which we

presently do not know as much as we

would like.

Earwitnesses

The first such factor is stress.

the majority of (the relatively few)studies of earwitnessing bear little

resemblance to real-life witnessing

circumstances. Most have used

nonstressful situations with prepared

subjects participating in laboratory

situations Bull and Clifford (1984)

34


35/81

Earwitnesses

The stress that witnesses may experiencein a real life situation can never be fully

recreated in a laboratory experiment.

Neither can we, or the witness, have much

experience to draw on that will help us

determine just how and how much the

capabilities of a traumatized victim torecognize a voice or discriminate between

voices may be affected.

Earwitnesses

Another factor is familiatity.

personal experience of voice recognition,

is always of familiar voices the voices

that are notusually those to be identified in

criminal situations (Bull and Clifford)

And as we know from the work by Van

Lancker and Kreiman, recognizing a

familiar voice and discriminating between

unfamiliar ones are independent abilities.

35


36/81

Earwitnesses

A third factor is preparednessWhereas subjects in a laboratory

experiment are, to a higher or lesser

degree, prepared for the situation, real

life witnesses are in most cases not.

Studies have shown that voiceidentification accuracy under unprepared

conditions is much lower.

Earwitness line-ups

An earwitness line-up (or voice parade) is meant

to be the auditory equivalent of an eyewitness

line-up. It is used when a person has heard but notseen the perpetrator.

Recordings of a suspects voice and number of

foils are presented and the witness is to compare

the voices with the memory of the perpetrators

voice and determine if any of them matches the

memory.

36


37/81

Earwitness line-ups

Two important questions in connectionwith earwitness line-ups are

1) how many voices should be present

in the line-up?

2) how similar to the suspects voice

should the voices of the foils be?

Earwitness line-ups

It has been found that with few voices

there may be marked position effects andthat the number of correct identifications

decreases as lineup size increases. So the

question is if there is an optimal size

where the position effect is minimized

and the decrease in correct

identifications has bottomed out.

37


38/81

Earwitness line-ups

A number of studies have addressed thequestion of lineup size. They are in

reasonable agreement that the decrease in

identification accuracy bottoms out with

about 6 foils and that position effects only

appear if the target voice comes first.

Thus, as a rule of thumb at least, 5 or 6

foils should be used.

Earwitness line-ups

How similar to the target should the foils

be?

At least the two extremes must be avoided.

The target voice must not stand out as

different. The speakers must be reasonably

matched with respect characteristics like

speaker age, dialect etc.

On the other hand they should not besound-alikes.

38


39/81

Earwitness line-ups

When Rothman (1977) used sound-

alikes (brothers, fathers, sons)identification dropped from 94%

(ordinary foils) to 58% (sound-alikes).

Similar results were obtained by Hollien

and Schwartz (2000).

Thus foils should be chosen so as torepresent a reasonable degree of

variation but avoiding the extremes.

Lie detection

Attempts have been made recently to use

brain scanning methods in order to studythe possibility of consistent differences

in brain activity patterns which separate

lie or deception from truthful statements.

Although this research is only in its

infancy, some highly interesting results

have been obtained.

39


40/81

Lie detection

Langleben et al. (2002) used Functional

Magnetic Resonance Imaging (fMRI) todetect differences in brain activity when

their subjects told a lie compared to when

they told the truth. Their results indicate

that: There is a neurophysiological

difference between deception and truth at

the brain activation level that can be

detected with fMRI. Similar results have

been obtained in other studies.

Lie detection

High resolution thermal imaging

which can detect minor regionalchanges in the blood flow in the

face for example has also been used

in an attempt to develop methods to

detect lie and deception (Pavlidis

and Levine, 2002).

40


41/81

Lie detection

We should be aware that these are

very preliminary results. When, and

if, these methods can be put to use in

forensic fieldwork we will not know

for many years to come. We must

also be aware that there may be a

very long way to go between research

results and reliable field applications.

Lie detection

Unfortunately this is not always the

case. Unproven technologies arebecoming increasingly attractive to

US law enforcement and security

agencies Laboratory tools from

infrared sensors to eye trackers are

being converted into lie detectors(Knight 2004).

41


42/81

Overgeneralization, charlatanry, fraud

The most well known lie detectoris the so called Polygraph. Its first

appearance can be dated back to

1917. A more refined version was

used in a court case in 1923 and

Polygraphs have been used eversince with some refinements.


The basic idea behind the Polygraph is

that lying increases the level of stress andif you can register the involuntary

reactions we know to be correlated with

stress (respiration, pulse, blood pressure,

and galvanic skin respons (e.g.palm

sweat), these signs can be used to detect

lies and deception.

42


43/81


A typical Polygraph setup.


The problem with the Polygraph as a lie

detector lies in the interpretation.

Correlations between stress levels and

pulse for example are found as group

results. To generalize from group results

to individuals is, of course, not a valid

step. Neither is it a valid step to conclude

that a person who experiences stressmust necessarily be lying.

43


44/81


The basic idea behind lie detectorsbased on voice analysis is that there are

properties in the voice signal that may be

reliably correlated with lie or deception.

Voice stress analysis (VSA), based on the

monitoring of so called micro tremor issuch a method.


But whereas there are scientifically

established correlations between stress

and the indicators used by the Poly-

graph, there is no scientific basis for

the voice stress analysis whatsoever.

The few in depth studies there are of

micro tremor in the larynx indicatethat it does not even exist.

44


45/81


But it does make pretty diagrams!


So what the VSA analyzers do is

measure the variation in somethingthat isnt even there, in itself an

achievement of sorts.

If the people who use these gadgets

dont know any better we may be

generous enough to call it charlatanry,the alternative being fraud of course.

45


46/81


Finally an example which without theslightest doubt may be classified as fraud.

An Israeli based company markets the

most wonderful tools including both lie

detectors and love detectors. The

technique behind the lie detector is said to

be something called Layered Voice

Analysis (LVA).


Here is how they claim it works

every event that passes through thebrain will leave its finger prints on

the speech flow. LVA Technology

ignores what your subject is saying,

and focuses only on his brain activity.

In other words, the how it is said iscrucial and not the what.

46


47/81


They are careful not to explicitly call

the gadget lie detector, but there is

absolutely no question that that is what

they want us to believe it is:

LVA is capable of detecting the

intention behind the lie, and by sodoing can lead you in identifying and

revealing the lie itself.


There is, of course, not a shred of

evidence for a relationship between

voice and brain activity of the proposed

kind. And a thorough scrutiny of the

description of the method in the

American patent documents confirms

the suspicion that the method is pure

nonsense, perhaps best described asstatistics based on digitization artefacts.

47


48/81


The statistics is based upon what is defined as

thorns and plateaus which has no relevanceat all for voice analysis and is moreover

dependent on how the signal is sampled.


Gadgets like these do not deserve to be

taken seriously as such, but their use in

forensic investigations must be. If bogus

lie detectors like the ones described here

are used not just by shady private

investigators, but by insurance

companies, police departments and

security agencies, this poses a threat thatwe must oppose more actively.

48


49/81

1

1

Speech Processing and Biometrics Group (GTPB)

Signal Processing Institute (ITS), LIAP

Forensic Automatic Speaker Recognition

FORENSIC SPEECH SCIENCE

Dr. Andrzej Drygajlo

[email protected]

Speech Processing and Biometrics Group

Signal Processing Inst itute (ITS -LIAP)

Swiss Federal Institu te of Technology Lausanne (EPFL)

School of Criminal Sciences

University of Lausanne

2



Biometric characteristics in forensic applications

Biological traces DNA (DeoxyriboNucleic Acid), blood, saliva,etc.

Biological (physiological) characteristics fingerprints , eye irises and retinas, hand palms and

geometry, and facial geometry

Behavioral characteristics dynamic signature, gait, keystroke dynamics, lip motion

Combined voice

49


50/81

2

3



Popular biometric characteristics (modalities)

Fingerprint

Voice

Face

Retina

Signature

Iris

4



Forensic Biometric Applications

Forensic Biometrics Individualisation of human beings

Challenge: to automate forensic biometric methods

Existing systems and databases

Automatic Fingerprint Identification System (AFIS USmade) and fingerprints databases

DNA sequencers and DNA databases

Challenge: Large scale automatic systems anddatabases for: speech, handwriting, face images,earmarks, etc.

50


51/81

3

5



Constraints

Systems developed according to specifiedrecommendations from:

Tool perspective (recognition and computertechnology)

Forensic expert perspective (methodology)

Criminal policy perspective (investigation)

Legal perspective (impact of the application ofthe data and privacy protection law on theefficiency of the methods used)

Judicial perspective (the role of the court)

6



Law enforcement and forensic applications

The law enforcement applications include the use ofbiometrics to recognize individuals

Apprehended or incarcerated because of criminal activity

Suspected of criminal activity

Whose movement is restricted as a result of criminal activity

The biometric may be used to identify non-cooperative andunknown subjects, to ensure that the correct inmates arereleased, or to verify that individuals under home arrest arein compliance

51


52/81

4

7



Forensic Speaker Recognition

Aural-perceptual methods earwitnesses, line-ups

Visual methods and voiceprint? visual comparison of spectrograms of linguistically identical

utterances (utterly misleading!)

Aural-instrumental methods analytical acoustic approach combined with an auditory phonetic

analysis

Automatic methods Speaker verification not adequate

Speaker identification not adequate

Bayesian framework for the evaluation of identity

8



Forensic specificity

Short utterances

Questioned recording - uncontrolled environment

Investigations in controlled conditions (longer utterances)

Telephone quality (95%)

Clear understanding of the inferential process

Respective duties of the actors involved in the judicialprocess: jurists, forensic experts, judges, etc.

The forensic experts role is to testify to the worth of the evidence by using,if possible a quantitative measure of this worth.

It is up to the judge and/or the jury to use this information as an aidto their deliberations and decision.

52


53/81

5

9



Forensic Experts Role

A forensic expert testifying in court to a conclusion in anindividual case is not an advocate, but a witness whopresents factual information and offers a professionalopinion based upon that factual information.

Expert opinion testimony is, and will remain, one of themost powerful forms of evidence in the courtroom.

In order for it to be effective, it must be carefullydocumented , and expressed with precision, but withoutoverstatement, in as neutral and objective a way as theadversary system permits.

Professional concepts must be articulated in a way laypersons (like the judge and the lawyers) can understand.

10



Individual Case

Trace Suspect

Casework

Suspected speakerreference database

Suspected speakersingle recording

Questioned recording

53


54/81

6

11



Adversary System

The speaker at the origin

of the questioned recording

is not the suspected speaker

The suspected speaker

is the source of the

questioned recording

12



Outline

Automatic Speaker Recognit ion

Voice as Evidence

Bayesian Interpretation of Evidence

Corpus Based Methodology Univariate Scoring Method

Multivariate Direct Method

Strength of Evidence

Evaluation of the Strength of Evidence

Mismatched Recording Conditions Aural Speaker Recognition

54


55/81

7

13



Automatic Speaker Recognition

Speaker recognition is the general term used to include allof the many different tasks of discriminating people based onthe sound of their voices.

Speaker identification is the task of deciding, given asample of speech, who among many candidate speakerssaid it. This is an N-class decision task, where N is thenumber of candidate speakers.

Speaker verification is the task of deciding, given a sample

of speech, whether a specified candidate speaker said it.This is a 2-class decision task and is sometimes referred toas a speaker detection task.

14



Recognitionresults

Speech

wave

Training

Recognition

Featureextraction

Referencetemplates/modelsfor each speaker

Similarity(Distance)

Principal structure of speaker recognition systems

Decision / Interpretation ?

55


56/81

8

15



Principal structure of speaker recognition systems

Featureextraction


Models foreach speaker

ScoreSpeech wave

Training

Text-dependent methods:

- Dynamic Time Warping (DTW)- Hidden Markov Models (HMMs)

Text-independent methods:

- Vector Quantization (VQ)- Gaussian Mixture Models (GMMs)

16



Frame

Window

Feature vector

Feature Extraction

56


57/81

9

17



Gaussian Mixture Model (GMM)

1 2

1 2

1 2

( ) ( )

(1) (1) (1)

(

(2) (2) (2)

)T

T

T

v D v D

v

v v

v

v

v

v

D

Acoust ic vectors

for training

GMM

Feature 1 Feature 2 Feature D

Histograms

score = log-likelihood (speech | model)

18



Speaker Verif ication

The odds form of Bayes theorem H0 the speakers model ( ) and the tested

recording (T) have the same source

H1 the speakers model ( ) and the testedrecording (T) have different sources

1

0

1

0

1

0( ) ( | ) ( | )

( ) ( | ) ( | )

P P T P T

P P TH H HP

H H

T

H =

0

1

( | )

( | )

P T

P T

>

Decision threshold

Likelihood ratio

0

1

57


58/81

10

19



Outline


Voice as Evidence





Evaluation of the Strength of Evidence Mismatched Recording Conditions Aural Speaker Recognition

20



Interpretation of Evidence

Bayesian interpretation (BI)Principle

The Bayesian model, proposed for forensic speaker recognitionby Lewis in 1984, allows for revision based on new information ofa measure of uncertainty (likelihood ratio of the evidence(province of the forensic expert)) which is applied to the pair ofcompeting hypotheses.

The Bayesian model shows how new data (questioned recording)can be combined with prior background knowledge (prior odds(province of the court)) to give posterior odds (province of thecourt) for judicial outcomes or issues.

prior odds x ? = posterior odds

58


59/81

11

21




Bayesian interpretation (BI)

( )( )

( )( )

( )( )

0 0

1 1

0

1

P E P E P

P P EH H

H HH

P EH =

prior

background

knowledge

posterior

knowledge

on the issue

New

Data

Prior odds Posterior oddsLikelihood

Ratio (LR)

province of the court province of the courtprovince of the

forensic expert

22



Voice as Evidence

In the case of questioned recording (trace),the evidence does not consist in speechitself, but in the quantified degree ofsimilarity between speaker dependent

features extracted from the trace, andspeaker dependent features extracted fromrecorded speech of a suspect, representedby his/her model.

59


60/81

12

23



Voice as Evidence

Featureextraction



Score

Suspected speakerreference database (R)

Suspect

TraceEvidence (E)

Suspected speaker model

Signification ?

Bayesian InterpretationQuestioned recording

24



Outline


Voice as Evidence







60


61/81

13

25




The odds form of Bayes theorem H0 the suspected speaker is the source of the

questioned recording (within-source variability)

H1 the speaker at the origin of the questionedrecording is not the suspected speaker(between-sources variability)

1

0

1

0

1

0( ) ( | ) ( | )

( ) ( | ) ( | )

P P E P E

P P EH H HP

H H

E

H =

0

1

( | )( | )

P E

P HE

HLikelihood ratio Strength of evidence

similarity

typicality

26



Outline


Voice as Evidence







61


62/81

14

27



Uni- and Multivariate Methods

Scoring Method: Likelihoodcalculated from distribution ofscores modeling within-sourceand between-sources variability

H0 : distribution of scores ofwithin-source variability

H1 : distribution of scores ofbetween-sources variability

3 databases: Suspect Reference Database

(R)

Potential Population

Database (P) Suspect Control Database(C)

Direct Method: Likelihooddirectly calculated from GMM ofthe suspect and GMM of thepotential population

H0 : GMM of the suspect

H1 : GMMs of the potentialpopulation

2 databases : Suspect Reference Database

(R)

Potential Population Database(P)

Databases Used: R= 5 utterances per speaker (2-3 min each)

P = 100 speakers (2-3 min each)

C = 30-40 utterances per speaker (10-20 seceach)

28



Corpus Based Methodology

3 databases (DBs) Potential population database (P)

Large-scale database used to model the potentialpopulation of speakers to evaluate the between-sourcesvariability

Suspected speaker reference database (R) Database recorded with the suspected speaker to model

her/his speech

Suspected speaker control database (C)

Database recorded with the suspected speaker toevaluate her/his within-source variability

62


63/81

15

29



Scoring Method

Trace

Relevant population

Suspect

Casework


Suspected speakercontrol database (C)

Potential populationdatabase (P)

30



Within-source variability

Featureextraction



Scores


Suspect


Distribution of thewithin-source variability

Suspect

Suspected speakercontrol database (C)

63


64/81

16

31



Between-sources Variability

Featureextraction



ScoresTrace

Speaker models of thepotential population



Distribution of thebetween-sources variability

32



Evaluation of the within-source variability

O c

c u re n c e

s

Similarityscores

Comparison of the suspected speaker modelswith the utterances of his control database (C)

64


65/81

17

33



Evaluation of the between-sources variability

O c

c u re n c

e s

Similarity scores

Comparison of the trace with the speaker models ofpotential population database (P)

34



Likelihood ratio

P (E | H1) / P (E | H2) = 0.15 / 0.002 = 75

Similarity scores

E s

t i m a

t e d

p r o

b a

b i l

i t y

E = 6

65


66/81

18

35



Outline


Voice as Evidence






36



Strength of Evidence - Likel ihood ratio

A likelihood ratio of 9.16obtained means that it is9.16 times more li kely

to observe the score (E)given the hypothesis H0(the suspect is the sourceof the questioned

recording) than given thehypothesis H1 (thatanother speaker from therelevant population is thesource of the questionedrecording).

66


67/81

19

37



DET (Detection Curve)

DET curve can be computed from distributions of scores with a variable threshold

38



Analysis and compar ison

TracePotentialpopulationdatabase (P)

Featureextraction

Feature extractionand modelling

Featureextraction

Feature extractionand modelling

Suspectedspeakercontrol

database (C)

Suspectedspeaker

referencedatabase (R)

Features

Suspected

speakermodel Features

Relevant

speakersmodels

Comparativeanalysis

Comparativeanalysis

Comparativeanalysis

Similarityscores

Similarityscores

Evidence (E)

67


68/81

20

39



Interpretation of the evidence

Similarityscores

Similarityscores

Evidence (E)

Modelling of thewithin-source variability

Modelling of thebetween-sources variability

Numerator of thelikelihood ratio

Denominator of thelikelihood ratio

Likelihood ratio (LR)

Distribution of thewithin-source variability

Distribution of thebetween-sources variability

40



Individual Case

Trace Suspect

Casework

Suspected speakerreference database

Suspected speakersingle recording


68


69/81

21

41



Scoring Method with Limited Suspect Data

1

0

1

0

1

0( ) ( | ) ( | )

( ) ( | ) ( | )

P P E P E

P P EH H HP

H H

E

H =

The odds form of Bayes theorem H0 the two recordings have the same source

H1 the two recordings have different sources

Likelihood ratioStrength of evidencewith respect to new hypotheses0

1( | )( | )

P E

P HE

H

42



Direct Method

The odds form of Bayes theorem H0 the speakers model ( ) and the

questioned recording (T) have the same source

H1 the speakers model ( ) and thequestioned recording (T) have different sources

1

0

1

0

1

0( ) ( | ) ( | )

( ) ( | ) ( | )

P P T P T

P P TH H HP

H H

T

H =

0

1

( | )

( | )

P T

P T

Likelihood ratio

0

1

Strength of evidence ?

69


70/81

22

43



Multivariate (Direct) Method LR Numerator

Featureextraction



Score


Suspect

Trace


Numerator of the likelihood ratioQuestioned recording

score = log-likelihood (trace | H0)

44



Featureextraction


Model ofall speakers

ScoreTrace

Model of the

potential population



Multivariate (Direct) Method LR Denominator

Denominator of the likelihood ratio

score = log-likelihood (trace | H1)

70


71/81


72/81

24

47




Univariate (Scoring) Method

48



Cumulative Density Functions

72


73/81

25

49



Tippett plots (reliability-survival functions)

Univariate (Scoring) Method

50




Multivariate (Direct) Method

73


74/81

26

51



Tippett plots (reliability-survival functions)

Multivariate (Direct) Method

52



Outline


Voice as Evidence







74


75/81

27

53



Using databases with mismatched recording conditions

FBI NIST 2002 Database : 2

conditions (Microphone -

Telephone)

The extent of mismatch can be measured using statistical testing

54



Compensating for Mismatch

E

H1scores(matched conditions)

Pot Pop. H1scores(mismatched conditions)

Ho scores(matched conditions)

Not compensating for mismatch can be the dif ference between

an LR < 1 and an LR > 1

75


76/81

28

55



Outline


Voice as Evidence






56



Experimental Framework

Listeners

90 listeners whose mother-tongue is French

Laypersons with no phonetic training

Same computer and headphones

Training

No limitation on the number of listening trials

Testing

Verbal scores scale from 1 through 7

Perceptual cues

AuralAural Speaker RecognitionSpeaker Recognition

76


77/81

29

57



Perceptual Verbal Scale and Perceptual CuesPerceptual Verbal Scale and Perceptual Cues

Score 1 I am sure that the two speakers are not the same Score 2 I am almost sure that the two speakers are not the same Score 3 It is possible that the two speakers are not the same Score 4 I cannot decide Score 5 It is possible that the two speakers are the same Score 6 I am almost sure that the two speakers are the same Score 7 I am sure that the two speakers are the same

Perceptual Verbal Scale

58



Strength of Evidence for Aural RecognitionStrength of Evidence for Aural Recognition

0.0

0.2

0.4

0.6

1 2 3 4 5 6 7

Estimate

dProbability

H1 Ho

)(

)(

1

0

HEP

HEPLR =

EPerceptual Verbal Score

Likelihood Ratio (LR) = Ratio of the heights on the histograms for the

two hypotheses at the point "E"

Discrete scoresHistograms used to estimatethe probabilities of

scores for each hypothesis

77


78/81

30

59



Evaluating Strength of Evidence in Matched ConditionsEvaluating Strength of Evidence in Matched Conditions

AuralAutomat ic

Similar separations between curves for aural and automatic systems

Ref. PSTN vs Traces PSTN

60



Evaluating Strength of Evidence in Mismatched ConditionsEvaluating Strength of Evidence in Mismatched Conditions

Aural

Automat ic

Better curve separation in

aural recogniti onBetter evaluation of LR for aural

recognition in mismatched conditions

Ref. PSTN vs Traces Noisy PSTN

78


79/81

31

61



10-2

10-1

100

101

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Likelihood Ratio (LR)

EstimatedProbability

H0 Aural

H1 Aural

Automatic-Adapted

Automatic-Adapted

Evaluating Strength of Evidence in Adapted ConditionsEvaluating Strength of Evidence in Adapted Conditions

Adaptation for noisy conditions results in the improvementof performance of automatic recognition

Ref. PSTN vs Traces Adapted Noisy PSTN

62



Admissibi lity of Scient if ic Evidence (USA)

Daubert criteria: whether the theory or technique can be, and has been

tested,

whether the technique has been published or subjectedto peer review,

whether actual or potential error rates have beenconsidered,

whether standards exist and are maintained to controlthe operation of the technique,

whether the technique is widely accepted within therelevant scientific community.

79


80/81

32

63



References

Ph. Rose, Forensic Speaker Identification, Taylor and Francis, London,2002.

D. Meuwly, A. Drygajlo, "Forensic Speaker Recognition Based on aBayesian Framework and Gaussian Mixture Modelling (GMM)", TheWorkshop on Speaker Recognition 2001: A Speaker Odyssey, Crete,Greece, June, 2001, pp. 145-150 .

A. Drygajlo, D. Meuwly, A. Alexander, "StatisticalMethods and Bayesian Interpretation of Evidence inForensic Automatic Speaker Recognition",EUROSPEECH'2003, Geneva, Switzerland, Sept. 2003,pp. 689-692.

A. Alexander, A. Drygajlo, "Scoring and Direct Methodsfor the Interpretation of Evidence in Forensic SpeakerRecognition, ICSLP 2004, Jeju, Korea, 2004.

64



References

F. Botti., A. Alexander, and A. Drygajlo, An interpretation framework for theevaluation of evidence in forensic automatic speaker recognition with limitedsuspect data, Odyssey 2004, The Speaker and Language RecognitionWorkshop, Toledo, Spain, 2004, pp. 6368.

A. Alexander, F. Botti, and A. Drygajlo, Handling Mismatch in Corpus-BasedForensic Speaker Recognition, Odyssey 2004, The Speaker and LanguageRecognition Workshop, Toledo, Spain, May 2004, pp. 6974

A. Alexander, F. Botti, D. Dessimoz, A. Drygajlo, "The Effect of MismatchedRecording Conditions on Human and Automatic Speaker Recognition in ForensicApplications", Forensic Science International, 146S (2004), pp. S95-S99.

D. Meuwly, A. Drygajlo, "A Bayesian Interpretation of Evidence in ForensicAutomatic Speaker Recognition", to be published in Forensic ScienceInternational.

J. Gonzalez-Rodriguez, A. Drygajlo, D. Ramos-Castro, M. Garcia-Gomar, J.Ortega-Garcia, "Robust Estimation, Interpretation and Assessment of LikelihoodRatios in Forensic Speaker Recognition", to be published in Computer Speechand Language.

80


81/81

65



Conclusions

The Bayes model, current interpretation framework usedin forensic science, is adapted for forensic automaticspeaker recognition

The corpus based methodology provides a coherentway of assessing and presenting the evidence ofquestioned recording

Distributions of likelihood ratios can be used for the

evaluation of the performance of automatic and auralmethods in forensic speaker recognition applications

66

While there is certainly no perfect solution available in the field of forensicspeaker recognition at present, the scientific community is under a moralobligation to contribute whatever possible to aid the course of justice toestablish scientifically founded methodology and techniques

What is clearly needed is joint research initiatives of forensic scientists

and speech engineers in order to study problems arising from the actualtechnology and from practical work of forensic experts and gain a morecomplete insight into the concept of the individuality of voice

Conclusions

Date post:	03-Jun-2018
Category:	Documents
Upload:	abhishek-goyal
View:	233 times
Download:	0 times

Tutorial Slides Eriksson Drzygajlo

Documents