Model-based species identification using DNA barcodes

Model-based species identification using DNA

barcodes

Bogdan Paşaniuc

CSE Department, University of Connecticut

Joint work with Ion Măndoiu and Sotirios Kentros

Outline

Existing approaches to species identification

Proposed statistical model based methods

Experimental Results Ongoing Work and Conclusions

Background on DNA barcoding

Recently proposed tool for species identification Use short DNA region as “fingerprint” for the

species Region of choice: cytochrome c oxidase subunit

1 mitochondrial gene ("COI", 648 base pairs long).

Key assumption: inter-species variability higher than intra-species variability

Species identification problem Given:

Database DB containing barcodes from known species New barcode x

Find: a high confidence assignment to a species in the DB

UNKNOWN, if confidence not high enough

Use additional evidence/methods to resolve

UNKNOWN assignments and possible discovery of new species

Existing approaches and limitations

Neighbor Joining tree for new + known barcodes [Meyers&Paulay05] One barcode per species Runtime does not scale well with #species (quadratic

or worse)

Likelihood ratio test for species membership using MCMC [Matz&Nielsen06] Impractical runtime even for moderate #species

Distance-based [BOLD-IDS, TaxI(Steinke et al.05)] Unclear statistical significance

BOLD

BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07] http://www.barcodinglife.org Currently: 28,129 species, 251,429

barcodes

Identification System: BOLD-IDS Distance-based (NJ tree for visualization) Employs a threshold (less than 1%

divergence) to get a tight match to a barcode in the DB

http://www.barcodinglife.org/

http://www.barcodinglife.org/

BOLD-IDS

[Ekrem et al.07]: “…identifications by the BOLD facility must be cautiously evaluated as the system at present may return high probabilities of placements that obviously are erroneous”

Outline




Bayesian approach to species identification

Assign barcode x=x1x2x3…xn to species SPi that maximizes P(SPi|x) over all species SPi

P(SPi|x) computed using Bayes’ theorem: P(SP|x) = P(x|SP)*P(SP)/P(x) Uniform prior P(SP) P(x) constant for fixed x Need model for P(x|SP)

We explored three scalable models: position weight matrices, Markov chains, hidden Markov models Similar to models used successfully in other sequence

analysis problems such as DNA motif finding and protein families

Positional weight matrix (PWM)

Assumption: independence of loci

P(x|SP) = P(x1|SP)*P(x2|SP)*…*P(xn|SP)

For each locus, P(xi|SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP

Inhomogeneous Markov Chain (IMC)

Takes into account dependencies between consecutive loci

start

A

C

T

G

A

C

T

G

A

C

T

G

A

C

T

G

…

locus 1 locus 2 locus 3 locus 4

1t 2t 3t

),()()|( 1

1

11

ii

n

i

i xxtxMxP

Hidden Markov Model (HMM) Same structure as the IMC

Each state emits the associated DNA base with high probability; but can also emit the other bases with probability equal to mutation rate

Barcode x generated along path p with probability equal to product of emission & transitions along p

P(x|HMM) = sum of probabilities over all paths Efficiently computed by forward algorithm

Accuracy on BOLD dataset

37 species with at least 100 barcodes from BOLD 10-50% barcodes removed and used for

test IMC yields better accuracy in all cases

10% 20% 30% 40% 50%

PWM 90.08% 90.01% 90.02% 89.68% 89.69%IMC 99.97% 99.93% 99.90% 99.91% 99.89%

HMM 99.57% 99.57% 99.66% 99.70% 99.76%

Score normalization DB barcodes have non uniform lengths and

cover different regions of the COI gene Membership probabilities not always

comparable Normalization scheme:

Species models constructed only over positions covered in DB

Scores normalized using background IMC constructed from all sequences in DB

1

1 1

1

1

1

),(

),(ln)()(ln

)|()|(ln)(

n

i ii

iii

i

Mxxt

xxtxx

MxPMxPxScore

Computing the confidence of assignment

x assigned to species SP with score s p-value: probability that a barcode generated

under background model Ḿ has a score s’ s Methods for p-value estimation:

Random sampling Generate random sequences and count how

many exceed the score Exact computation (for PWMs):

Dynamic programming [Rahmann03] Branch and bound [Zhang et. Al 07] Shiffted FFTs [Nagarajan et al. 05]

Exact computation for PWMs [Rahmann03]

Computes the entire distribution Scores rounded by a granularity factor Score is a sum of n independent variables

(score contribution of each position) Probability of a rand. seq. of length i having a

score of computed from the contribution of first i-1 positions and current position

Exact computation for IMCs

Define as the prob. of a random seq of length i having score and last letter

Basic recurrence:

)(iyf

),()),(

),(ln()(},,,{

1 yztyzt

yztffi

GTCAzi

iiz

iy

)|)),...,(( 1 MyxxxScoreP iiM

y

IMC exact p-value computation Initially

The probability of a random barcode having score

Runtime , where R is the difference between max and min score for any i.

o/w 0,(y)πln- lnπnπ(α if(y),π

)(0

yf

},,,{

)()(GTCAy

ny

n ff

)/( 2 RnO

Outline




Experimental setup (1) Compared methods

IMC Species with highest score If score < species specific threshold UNKNOWN

Distance-based (BOLD-IDS like) Species containing barcode showing less divergence If divergence > threshold (default 1%) UNKNOWN

Basic questions What is the effect of training set size (#barcodes

per species) on accuracy? What is the effect of the #species on accuracy?

Experimental setup (2)

Two scenarios: Complete DB: all new barcodes belong to

species in DB Incomplete DB: some new barcodes belong

to species not in DB

Accuracy measures

True positive rate = TP/(TP+FP) Barcodes belonging to species present in

the DB TP = #barcodes assigned to correct species FP = #barcodes assigned to incorrect species

Barcodes belonging to species not present in DB TP = #barcodes assigned to unknowns FP = #barcodes assigned to species in the DB

Effect of #barcodes/species Datasets containing all BOLD species with at

least 5/25 barcodes BOLD5: 1508 sp, 28600 barcodes BOLD25: 270 sp, 17197 barcodes

DB composed of randomly picked 5-20 barcodes from all species in BOLD25

Test barcodes Complete database scenario

All remaining barcodes from BOLD25 Incomplete database scenario

All barcodes from BOLD5 not in DB

Effect of #barcodes/species, complete DB

IMC and Distance (thr 1%)

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

0 0.1 0.2 0.3 0.4 0.5

Unknown rate

TP ra

te

5101520

Effect of #barcodes/species, incomplete DB

barcodes belonging to species in DB

0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

0.00 0.10 0.20 0.30 0.40 0.50

Unknown rate

TP

5101520

barcodes not belonging to species in DB

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50

Unknown rateTP

5101520

Effect of #species Datasets containing all BOLD species with at least

5/10 barcodes BOLD5: 1508 sp, 28600 barcodes BOLD10: 690 sp, 23558 barcodes

DB composed of randomly picked 100 to 690 species from BOLD10 10 barcodes per species

Test barcodes Complete database scenario

All remaining barcodes from picked species Incomplete database scenario

All barcodes from BOLD5 not in DB

Effect of #species, complete DB

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.1 0.2 0.3 0.4 0.5

Unknown rate

TP ra

te

100200300400500600690

Effect of #species, incomplete DB

Barcodes belonging to species in the DB

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Unknown rate

TP ra

te

100200300400500600690

Barcodes NOT belonging to species in DB

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Unknown rate

TP ra

te

100200300400500600690

Outline




Conclusions & Ongoing work IMC provides a scalable method for species identification

High accuracy, with useful tradeoff between TP rate and unknown rate

Efficiently computable p-values

Comprehensive comparison of identification algorithms to be submitted to 2nd International Barcode Conference Broad coverage of methods

tree-based, distance-based, character-based, model-based Assessment of further effects besides #species and

#barcodes/species Barcode length Barcode quality Number of regions Runtime scalability (up to millions of species)

Diverse datasets (BOLD, cowries, flu viruses, simulated data, etc.)

Date post:	22-Feb-2016
Category:	Documents
Upload:	etta
View:	42 times
Download:	0 times

Model-based species identification using DNA barcodes

Documents