Post on 30-Dec-2015
description
transcript
2 2
GeneMatcher2
• The GeneMatcher system comprises of hardware and software components that significantly accelerate a number of computationally intensive sequence similarity search algorithms.
• There are two hardware components:– GeneMatcher accelerator– Post-Processor (Blastmachine)
• Two client intefaces:– Unix command line– Web-based GUI (BioView Workbench)
3 3
GeneMatcher2 Architecture
GeneMatcher2 Blast machine
Switch
CPU 1 CPU 2 CPU 6912...
Query #1 (agaggt..)
a g a
Query #n...
Web interface
4 4
GeneMatcher2 System
• Massively Parallel Bioinformatics supercomputer• Array of ASIC (Application Specific Integrated Circuit)
chips combined with state-of-the-art Linux cluster technology
• Accelerates dynamic programming search algorithms• 3,000 to 220,000 processors• Thousands of times faster than general purpose
computers
5 5
3 Processor units(6,142 processors
per unit)
Up to 4 disk drivesFor database storage
ULTRASparccomputer
GeneMatcher2 Components
6 6
GeneMatcher2 Algorithms
• HMM and HMM-Frame– Searches protein or DNA sequence data with domain models– HMM-Frame aligns protein models to DNA with frame shift
and optional intron tolerance
• Profile and Profile-Frame– Position-specific scoring with profile models– Frame shift tolerant protein profile searches against DNA
sequence data
• GeneWise– Aligns protein sequences or HMM against genomic data– Tolerates introns and frame shifts
7 7
GeneMatcher2 Algorithms cont,
• Smith-Waterman– Comparison of DNA-DNA, Protein-Protein, Protein-DNA or
DNA-DNA through protein– Frame algorithms tolerate frame shifts, unlike BLAST
counterparts– Optional intron tolerance for searches of genomic data– Highly sensitive search capacity finds hits BLAST
potentially misses– NCBI Blast
8 8
• Blast is an approximation of Smith-Waterman • So is FastA, but it's better and has protein fragment
searches • Approx. may not yield correct results in some situations:
– Data with many ambiguities or frameshifts, such as raw ESTs and unfinished genomic sequence
– Distantly related sequences– When global alignments are desired– Protein alignment of Sequences with introns (not penalized on
GeneMatcher)
What about Blast?
9 9
•Comparison of sensitivity and selectivity of various sequence search methods
•Sensitivity: What proportion of the real hits are reported? (More sensitive means more real hits)•Selectivity: What proportion of the reported hits are real? (More selective means less false positives)
Why GeneMatcher2
Less Falsepositives
More true positives
10 10
GeneMatcher2 Performance•Time-to-completion comparison of original methods and methods on GeneMatcher2
•TBLASTX improvement is 20-fold•Other methods at least 100-fold improvement
Source: Genome Canada Bioinformatics Platform Project
NCBI TBLASTX
Parac
el T
BLASTX
Decyp
her T
BLASTX
WUSTL H
MM
clu
ster
Decyp
her H
MM
FASTA Sm
ith-W
ater
man
GeneM
atch
er2
SW
EBI Gen
eWis
e
Parac
el G
eneW
Ise
376
140.1
161316
270
1000
Runtime for an average query
Method
0
200
400
600
800
1000
Se
co
nd
s
* * *
11 11
• Load a sequence (or set of sequences) as a query set if it will be used several times
• Select the appropriate search depending on the query type and database type (only suitable candidates will be displayed on the search forms)
• Check your form options!• Watch the search queue (can raise priority of small
jobs if machine is busy)• Select a result format
Running a search
12 12
• While you can load your own databases, disk space on the post-processor is not infinite! Ask us about maintaining public databases that are not currently available.
• If you upload a private database. Special files need to be created to use translated database searches such as rframe.
• You can create private data sets to search against (e.g. Unigene-mouse and Unigene-rat in a data set called Unigene-rodent). These don’t take up any space.
Databases
13 13
Hidden Markov Models
THE LAST FAST CAT+++ ++++ ++++ +++ all matches
“AST” from LAST “V” from VERY
}
Multiple sequence alignment(Clustalw or T-coffee)
THE LAST FAT CATTHE FAST CATTHE VERY FAST CATTHE FAT CAT
Seq 1Seq 2Seq 3Seq 4
Positive examples
THE LAST FA T CATTHE FAST CATTHE VERY FAST CATTHE FA T CATTHE LAST FAST CAT orororor or
VERY gap gapgapgap
Position specificPositive examples
THE VAST FAST CATQuery
HMM Build
Hidden MarkovModel
GeneMatcher2
THE VAST VERY FAST CATQuery
Only nothing, “LAST” or “VERY”in that position