Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | brian-shepherd |
View: | 213 times |
Download: | 0 times |
Hugh E. Williams and Justin Zobel
IEEE Transactions on knowledge and data engineeringVol. 14, No. 1, January/February 2002
Presented by Jitimon Keinduangjun
AgendaAgenda
1. Introduction
2. What are the problems?
3. What are other people doing?
4. Indexed Genomic Retrieval with CAFÉ
5. Experimental Results
6. Conclusion
AT
GC
1. Introduction1. Introduction
• Biological sequence databases contain several sequences of both DNA and Protein.
• DNA (Deoxyribonucleic Acid) is the primary genetic material in all living organisms– A molecule composed of two complementary nucleotide
strands connected by base pairs that each base will pair
with only one another:
adenine (A) pairs with thymine (T)
guanine (G) pairs with cytosine (C)
1. Introduction (1)1. Introduction (1)
• A DNA sequence consists of– 4 alphabets : A G C T– 1 extra alphabet : N for unknown bases
• DNA sequence database> gi|1786692|gb|AE000155|ECAE000155 Escherichia coli , tesA, ybbA genes from base s 510705 to 522297 (section 45 of 400) of the complete genome TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTTCTACGAAATAGACTAGAAATAGTCTAGTCTACG> gb|L02373|ECORHSCA Escherichia coli Rhs core genes, complete cdsTAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTT
: :
Alphabet ‘ > ’ separates each sequence and identifies its information
2. What are the problems?2. What are the problems?
2.1 Databases and query sequences contain low quality sequences therefore all techniques also must improve accuracy of querying results
2.2 All techniques also require long computation time
2.1 Low quality DNA sequences2.1 Low quality DNA sequences
• Substitution, Insertions, Deletions
– Exact-match is not very efficient
– Similarity search is required
• All algorithms will find all segment pairs whose scores must be improved by insertions and deletions
Query: 3 LTRYCA - -GFTSLLKCNDADTIYDG 28
| | | | | | | | | | | | | | | | | | |
Subject : 3325 LTRYCAPAGFXALLKCNDADT--DG 3350
2.2 L2.2 Long computation time requiredong computation time required
• Various and huge data size of database
• A database contains many different sequences, of variable lengths which requires local similarity for database search
3. What are other people doing?3. What are other people doing?
3.1 SSERACH Algorithm– Using Dynamic Programming (DP)
• Very Slow, Very sensitive
3.2 BLAST Algorithm– Blast 1.4 (Old version): ungapped alignment
• Speed, sensitive
– Blast 2.0 (New version): gapped alignment• High Speed, less sensitive
3.3 FASTA Algorithm– Using DP-based Techniques: gapped alignment
• Slow, more sensitive
Edit distance and Dynamic Programming• Assume that the given two sequences are A and B
– n and m are the length of sequence A and sequence B, respectively– s (an,bm): similarity score between two aligned sequence a and b– Identical aligned pairs have a positive score 1 and non-identical pairs have
a score 0– Distance Matric D : Di,0 = Dj,0 = 0 for i = 0,1,…,n and j = 0,1,…,m
– Time complexity is O(n*m)
Di-1,j
Di,j = max Di,j-1
Di-1,j-1 + s(ai,bj) { }
3.1 SSEARCH Algorithm3.1 SSEARCH Algorithm
3.1 SSEARCH Algorithm (1)3.1 SSEARCH Algorithm (1)
• Example: Pairwise alignment via DP– Sequence a : ACGACA– Sequence b : AGCAC
-AGCAC
-0
0
0
0
0
0
A0
1
1
1
1
1
C0
1
1
2
2
2
G0
1
2
2
2
2
A0
1
2
2
3
3
C0
1
2
3
3
4
A0
1
2
3
4
4
sequence
b
sequence a
Possible results of 3 alignments(1) a: ACGACA - b: A -G -CAC(2) a: ACG -ACA b: A -GCAC -(3) a: A -CGACA b: AGC -AC -
Insert
Delete
Match
di-1,j-1 di-1,j
di,j-1 di,j
3.2 BLAST Algorithm for DNA3.2 BLAST Algorithm for DNA
• Sequence A : Length N and Sequence B : Length M
M Similarity Scores for DNA:Match = 5, Mismatch = -4 (WU-BLAST)Match = 1, Mismatch = -3 (NCBI)
M Scanning forexact matches
The list of words
hit
hit
extending
. . . . .
N
W=12Keyword Tree
A CT
AT
AC
G T CG
C
1 2 3 54: ::: :Generating Keyword Tree
Note: Extension consumes > 90% of all processing times.
3.3 FASTA Algorithm for DNA3.3 FASTA Algorithm for DNA
• Sequence A : Length N and Sequence B : Length M
M Scanning forexact matches
The list of words
hit
. . . . .
N
W=12Keyword Tree
A CT
AT
AC
G T CG
C
1 2 3 54: ::: :Generating Keyword Tree
M
N Alignment subsequences
4. Indexed Genomic Retrieval4. Indexed Genomic Retrievalwith CAFÉwith CAFÉ
4.1 Indexing with Café
4.2 Coarse Searching with Café (Filtering)
4.3 Fine Searching with Café as the method of FASTA
4.1 Indexing with CAFÉ4.1 Indexing with CAFÉ
• Inverted indexes consist of two component:– A search structure– Posting lists
• Example of an inverted index
ACCC 12,(3:144,154,962), 38,(2:47,1045)
The pattern occurs– 3 times in the 12th sequence, at offsets 144,154,and 962– 2 times in the 38th sequence, at offsets 47 and 1045
• These indices are compressed for reducing space described in detail elsewhere.
4.2 Coarse Searching with CAFÉ4.2 Coarse Searching with CAFÉ
• A novel Ranking technique using the index structure
Score for ranking: COMBINED = COVERAGE- k*(LENGTH-COVERAGE)
COVERAGE = 9LENGTH = 9
COVERAGE = 21LENGTH = 55
COVERAGE = 6LENGTH = 55
Example: Ranking by CAFÉ Example: Ranking by CAFÉ
Homologous -chain hemoglobinHomologous -chain hemoglobin
Human - ChimpanzeeHuman - Chimpanzee
Human - RatHuman - Rat
Human - PotatoHuman - Potato
5. Experimental Results5. Experimental Results
5.1 Test Data
5.2 Space
5.3 Retrieval Effectiveness
5.4 Speed
5.1 Test Data5.1 Test Data
• PIR Database for assessing the accuracy of search system.
• GenBank Database for assessing speed and index space requirements.
5.2 Space5.2 Space
• Uncompressed index size~9.7 times the collection size
• Compressed index size (Café index)~2.2 times the collection size
• The retrieval of uncompressed nucleotide data reduces the speed of Café system
5.3 Retrieval Effectiveness5.3 Retrieval Effectiveness
5.4 Speed5.4 Speed
6. Conclusion6. Conclusion
• Café system affords much faster query evaluation than exhaustive searching.
• Better accuracy than the most widely used search tool, BLAST 2.
• Café indices are smaller than the annotated source databases and the indices of previous indexed systems.