+ All Categories
Home > Science > Chenoweth os bridge 2015 pp

Chenoweth os bridge 2015 pp

Date post: 06-Aug-2015
Category:
Upload: dreamwidth
View: 133 times
Download: 2 times
Share this document with a friend
Popular Tags:
45
Calculating Guilt: Using open-source software in forensic DNA testing Sarah Chenoweth [email protected] @sarahquaint
Transcript

Calculating Guilt:Using open-source

software in forensic DNA testing

Sarah [email protected]

@sarahquaint

Disclaimer

• All opinions are my own.

• Dammit, Jim, I’m a chemist, not a programmer.

• …or a statistician.

• slideshare.net/dreamwidth

Gameplan

• Forensic DNA 101

• What sort of profiles do I obtain?

• Statistics: giving weight to those profiles

• Open-or-not software for calculating these statistics

Anne Arundel County Police

Crime LabDNA Technical Leader

source: Wikimedia commons

Rosalind was robbed.

• 23 pairs of chromosome

• >3 billion base pairs

• ~2% is coding DNA (genes)

• ~20-40% is regulatory

• ~50% is highly repetitive

AATGAATGAATGAATGAATGAATGAATG <— 7 repeats

AATGAATGAATGAATGAATGAATGAATGAATGAATGAAT <— 9.3 repeats

STR = Short Tandem Repeat

On chromosome 11, there is an area called TH01, where the STR “AATG” repeats over an over again.

On the chromosome from my mother, it repeats 7 times, and on the one from my father, it repeats 9.3 times.

Source: National Human Genome Research Institute

STR = Short Tandem Repeat

You are not a special snowflake.

• Most of your DNA, including your genes, is “highly conserved”

• All humans are 99.9% identical

• Of course, 0.1% of 3 billion = 3 million base pairs of variation

This is me.

It’s like an EAN on the back of a book…

• A forensic DNA profile is the length of 23 STRs, each between 100-500 base pairs in length

• <3% of 1% of 1% of your genome

• Unique “barcode”, except for identical siblings.

Receive evidence.

Sample. Extract.

Quantitate.

Amplify.

Measure.

What we receive.

What gives useful results.

Included or excluded?

• Single-source profiles are simple. But we mostly see mixtures.

• DNA is the gold standard, carries a lot of weight.

• Must characterize all inclusions with a statistic.

• Make the qualitative statement (excluded, or matches), characterize it with a quantitative statistic, and let the trier of fact evaluate.

Nice, 2 person mixture

Nice, 2 person mixture

• 4 alleles at Penta E: 5,7,9,13

• Say this is an assault. We can assume that the victim is present, and we know the victim is 7,9.

• So: what are the odds that a random person in the population is a 5,13?

Likelihood Ratio (LHR)

Likelihood Ratio (LHR)• How frequently do we see the 5 allele? About

4%

• How frequently do we see the 13 allele? About 5%

• At this one locus: 360 times more likely it’s Sarah & Robert than Sarah & someone picked at random from the population.

• Calculate this at all 22 loci, and multiply together: 1.6 x 1023 (160,000,000,000,000,000,000,000)

The world is a dirty place.

The world is a dirty place.

A wretched hive of scum and villainy.

A wretched hive of scum and villainy.

“A reasonable degree of scientific certainty.”

• DNA is a living, biological substance = messy

• Our testing procedure is super-sensitive. <10 cells

• The law wants a clear line between guilty and not guilty; science is full of, “Well, maybe; it depends.”

• Our classic statistical tools can’t handle these incomplete mixtures.

Nice, 2 person mixture…

Same… except.

That one little allele.

The 9 allele is justbelow the threshold.

…now what?• Only use the loci where the

suspect is present? That’s horribly biased.

• Throw up our hands and refuse to draw conclusions on partial data? Also biased!

• The least awful solution is to only use the loci that we know have complete info: the ones with two minor loci.

source: my sister, who is the biological mother of this pouty kid.

The loci with 2 minor alleles :

4 out of 22: 18% of the data.LRH: 1,400,000

Understating is just as bad as overstating.

• Well, almost. The justice system is designed to err on the side of caution, and benefit the defendant.

• Take a conservative approach.

• But not using all the data isn’t always conservative: what if that was exculpatory information?

Probabilistic genotyping

Semi-continuous

• Considers the probability of drop out when calculating the LHR.

• Open source. Fast.

• Still doesn’t use all the data (peak height ratios, stutter).Scenario:

The victim is: 20,20The suspect is: 19,22

What is the probability the suspect isa contributor, but the 19 dropped out?

Lab Retriever

• scieg.org/lab_retriever.html

• github.com/SCIEG/LabRetriever

Lab Retriever

Lab Retriever

Lab Retriever

• if we had a complete mixture =1.6 x 1023 160,000,000,000,000,000,000,000

• partial mixture, so we only use 4 loci for LHR = 1.4 x 106 = 1,400,000

• same partial mixture, semi-continuous LHR = 7.3 x 1020 = 730,000,000,000,000,000,000

Probabilistic genotyping

Continuous

• Markov-chain Monte Carlo (MCMC) simulations.

• Uses all of the data, with fewer assumptions.

• Doesn’t just give you the best estimate: gives you a range.Probable genotype ofthe minor contributor:

AC: 40%BC: 25%CC: 20%CQ: 15%

STRMix

• Developed by the ESR (Environmental Science and Research, NZ) and FSSA (Forensic Science South Australia)

• Increasingly becoming the standard

• 20K USD initially, 5K/yr support contract

The justice system does not embrace open

source.• The data is reliable: but is my interpretation?

• But I don’t tell “the whole truth, and nothing but the truth.” I can only answer the questions I’m asked.

• Prosecutor misstatement: “That means there’s a one in a quadrillion chance it’s someone else!”

• Defense misstatement: “She didn’t test the DNA of a quadrillion people, so there’s no way that’s true!”

Currently, in forensic DNA:

• Binary statistics: yes

• Semi-continuous: yes

• Continuous: no

• Frequency databases: yes

• Data analysis: no

• CODIS: hell to the nosource: Wikimedia commons

Statistics are hard.

source: Bill Gacey @Flickr

There is too much.Let me sum up:

• Transparency is the key to credibility.

• I need to document all my observations, results, and calculations so they are reproducible.

• Open software are necessary for independent verification.

Thank [email protected]

twitter: sarahquaint

slideshare.net/dreamwidth


Recommended