Post on 19-Jun-2015
transcript
Comparison of Genomic DNA to cDNA Alignment Methods
Miguel Galves and Zanoni Dias
Institute of Computing – Unicamp – Campinas – SP – Brazil
{miguel.galves,zanoni}@ic.unicamp.br
Scylla Bioinformatics – Campinas – SP – Brazil
{miguel,zanoni}@scylla.com.br
Agenda
Introduction Problem Aligners Data set Subsets Evaluation Methods Results: Exact Alignments Results: EST Alignments Running Time Comparison Conclusions
Introduction
Identifying genes in non-characterized DNA sequences is one of the greatest challenges in genomics
EST-to-DNA alignment is one of the most common methods
EST are key to understanding the inner working of an organism
– Human being has between 30000 and 35000 genes– Alternative Splicing plays an important role in diversity
CCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGGCCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGG
Problem
Mature mRNA
mRNA
Intron
Exon
Problem: How to solve ?
Classic algorithms– Dynamic programming
Heuristic based algorithms– Multi-steps– Based on other tools such as Blast and
local alignments.
Aligners
Java version of global and semi-global– Affine gap penalty function– Linear space– Global algorithm by Miller and Myers (1988)– Semi-global based on global algorithm
Heuristic based algorithms– sim4, Spidey and est_genome
Data Set
Human genome database– Based on FASTA a GENBANK’s flat format file from
NCBI repository.
Filtering criteria– Genes, mRNAs and CDS with /pseudo tag– mRNAs without any CDS– Genes without any mRNA– CDS matching wrong patterns
23124 genes and 27448 mRNAs stored in database
Subsets
Subset 1Subset 1:: 66 genes from chromossome Y whith less than 100000 bases
Subset 2: 50 complete genes from chromossome Y whith less than 100000 bases
Subset 3: 8056 complete genes from all chromossomes whith less than 100000 bases
Subset 4: 493 artificial EST based on complete genes from chromossome 6 with less than 100000 bases
Evaluation methods
Number of gaps introduced in the aligned gene sequence
Delta exons Bases similarity percentage Mismatch percentage
Experimental method
Two score systems, from 15 previously defined and an alignment strategy were choosed, using subsets 1 and 2:– Semi-global aligner– (1,-2,-1,0) and (1,-2,-10,0) score systems
The classic semi-global aligner was compared to sim4, Spidey and est_genome, both with subsets 3 and 4
Results: Exact Alignments
Extra GapStrategy Avg SD %Score 0
SG(1, -2, -1, 0) 0.00 0.00 100.00%
SG(1, -2, -10, 0)
0.00 0.00 100.00%
sim4 1.11 1.63 54.56%
est_genome 16.99 21.49 27.84%
Spidey 0.15 1.39 97.43%
Results: Exact Alignments
Delta ExonsStrategy Avg SD %Score 0
SG(1, -2, -1, 0) 0.00 0.00 100.00%
SG(1, -2, -10, 0) 0.01 0.07 99.91%
sim4 -0.01 0.20 97.46%
est_genome -0.14 0.30 76.79%
Spidey -4.04 3.10 0.00%
Results: Exact Alignments
Base SimilarityStrategy Avg SD %Scr. 100%
SG(1, -2, -1, 0) 99.89% 0.49% 53.56%
SG(1, -2, -10, 0) 99.89% 0.49% 53.49%
sim4 99.39% 1.34% 22.79%
est_genome 53.83% 35.00% 18.11%
Spidey 80.34% 36.49% 44.25%
Results: Exact Alignments
Mismatch PercentageStrategy Avg SD %Scr. 100%
SG(1, -2, -1, 0) 0.00% 0.00% 100.00%
SG(1, -2, -10, 0) 0.01% 0.03% 99.47%
sim4 0.17% 0.21% 36.68%
est_genome 1.19% 1.26% 21.55%
Spidey 0.15% 0.98% 90.65%
Results: EST Alignments
Results: EST Alignments
Running Time Comparison
EST-to-DNA
(sec/alignment)
mRNA-toDNA
(sec/alignment)
sim4 0.013 0.170
Spidey 0.066 0.140
est_genome 0.640 3.400
Semi-global 0.670 5.170
Conclusions
Classic semi-globl algorithm produces good results– Running time is a problem, although it can be
improved
Sim4 produces the best results amont external softwares tested
Thanks