Comparison of Genomic DNA to cDNA Alignment Methods

transcript

Miguel Galves and Zanoni Dias

Institute of Computing – Unicamp – Campinas – SP – Brazil

{miguel.galves,zanoni}@ic.unicamp.br

Scylla Bioinformatics – Campinas – SP – Brazil

{miguel,zanoni}@scylla.com.br

Agenda

Introduction Problem Aligners Data set Subsets Evaluation Methods Results: Exact Alignments Results: EST Alignments Running Time Comparison Conclusions

Introduction

Identifying genes in non-characterized DNA sequences is one of the greatest challenges in genomics

EST-to-DNA alignment is one of the most common methods

EST are key to understanding the inner working of an organism

– Human being has between 30000 and 35000 genes– Alternative Splicing plays an important role in diversity

CCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGGCCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGG

Problem

Mature mRNA

Intron

Problem: How to solve ?

Classic algorithms– Dynamic programming

Heuristic based algorithms– Multi-steps– Based on other tools such as Blast and

local alignments.

Aligners

Java version of global and semi-global– Affine gap penalty function– Linear space– Global algorithm by Miller and Myers (1988)– Semi-global based on global algorithm

Heuristic based algorithms– sim4, Spidey and est_genome

Data Set

Human genome database– Based on FASTA a GENBANK’s flat format file from

NCBI repository.

Filtering criteria– Genes, mRNAs and CDS with /pseudo tag– mRNAs without any CDS– Genes without any mRNA– CDS matching wrong patterns

23124 genes and 27448 mRNAs stored in database

Subsets

Subset 1Subset 1:: 66 genes from chromossome Y whith less than 100000 bases

Subset 2: 50 complete genes from chromossome Y whith less than 100000 bases

Subset 3: 8056 complete genes from all chromossomes whith less than 100000 bases

Subset 4: 493 artificial EST based on complete genes from chromossome 6 with less than 100000 bases

Evaluation methods

Number of gaps introduced in the aligned gene sequence

Delta exons Bases similarity percentage Mismatch percentage

Experimental method

Two score systems, from 15 previously defined and an alignment strategy were choosed, using subsets 1 and 2:– Semi-global aligner– (1,-2,-1,0) and (1,-2,-10,0) score systems

The classic semi-global aligner was compared to sim4, Spidey and est_genome, both with subsets 3 and 4

Results: Exact Alignments

Extra GapStrategy Avg SD %Score 0

SG(1, -2, -1, 0) 0.00 0.00 100.00%

SG(1, -2, -10, 0)

0.00 0.00 100.00%

sim4 1.11 1.63 54.56%

est_genome 16.99 21.49 27.84%

Spidey 0.15 1.39 97.43%

Delta ExonsStrategy Avg SD %Score 0

SG(1, -2, -1, 0) 0.00 0.00 100.00%

SG(1, -2, -10, 0) 0.01 0.07 99.91%

sim4 -0.01 0.20 97.46%

est_genome -0.14 0.30 76.79%

Spidey -4.04 3.10 0.00%

Base SimilarityStrategy Avg SD %Scr. 100%

SG(1, -2, -1, 0) 99.89% 0.49% 53.56%

SG(1, -2, -10, 0) 99.89% 0.49% 53.49%

sim4 99.39% 1.34% 22.79%

est_genome 53.83% 35.00% 18.11%

Spidey 80.34% 36.49% 44.25%

Mismatch PercentageStrategy Avg SD %Scr. 100%

SG(1, -2, -1, 0) 0.00% 0.00% 100.00%

SG(1, -2, -10, 0) 0.01% 0.03% 99.47%

sim4 0.17% 0.21% 36.68%

est_genome 1.19% 1.26% 21.55%

Spidey 0.15% 0.98% 90.65%

Results: EST Alignments

Running Time Comparison

EST-to-DNA

(sec/alignment)

mRNA-toDNA

(sec/alignment)

sim4 0.013 0.170

Spidey 0.066 0.140

est_genome 0.640 3.400

Semi-global 0.670 5.170

Conclusions

Classic semi-globl algorithm produces good results– Running time is a problem, although it can be

improved

Sim4 produces the best results amont external softwares tested

Thanks

Comparison of Genomic DNA to cDNA Alignment Methods

Science