October 26, 2001
Supercomputers for BioInformatics and The
GridRaj Godhia
Consultant, Cray Inc.c/o Mega Computing (S) Pte Ltd
October 26, 2001
Cray-NCI AnnouncementCRAY INC. AND NATIONAL CANCER
INSTITUTE COLLABORATE ON MORE-POWERFUL BIOINFORMATICS RESEARCH
TOOLS
SEATTLE--(BUSINESS WIRE)--July 9, 2001-- Goal is to Exploit Unique Supercomputer Technologies to Identify and Analyze Genes Involved in Cancer and Other Diseases; Demonstration Project Produces Full STR Mapping of Genome
Cray Inc. (Nasdaq:CRAY) today announced it is collaborating with the National Cancer Institute (NCI) to develop bioinformatics research tools substantially more powerful than those available today. Bioinformatics is a high-potential market that involves applying computer technology to biology and medicine.
By exploiting several unique, ultra-fast technologies originally designed into Cray supercomputers for classified government use, the NCI and Cray are working to create genome analysis software capable of identifying and analyzing genes involved in cancer and other diseases.
In an initial demonstration project, scientists at the NCI's Advanced Biomedical Computing Center in Frederick, Md., produced a comprehensive map of short tandem repeat sequences (STRs) -- often used as gene markers -- for the entire human genome. Using the Cray SV1(TM) supercomputer located at the NCI, computations that previously took hours are being completed in seconds. This will enable biologists to do full-scale analyses that previously were impractical, Cray officials said.
"In preliminary testing, the unique technologies available on Cray vector supercomputers have provided enormous speed-ups for full-scale analysis of some common types of bioinformatics problems," said Bill Long, Cray's chief collaborator for the NCI work. "Assuming this validation continues, we believe there is a potential to make full-scale, exhaustive analysis of many bioinformatics
problems feasible for the first time." Although exhaustive analysis typically produces results that are ore complete and reliable than methods based on statistical sampling, he said, to date exhaustive analysis has been too slow and expensive to use routinely. Short tandem repeats, also known as microsatellites, are repetitive sequences of DNA that scientists have exploited for several years as tools to map new genes, study the structure of chromosomes, and compare the DNA of different species, all of which are major areas of interest in biology and medical research.
Other bioinformatics software tools under development in the NCI-Cray collaboration include: non-tandem repeats, EST cluster assembly, CG island detection, genome assembly from BAC clones, SNP (single nucleotide polymorphism) analysis, and the extension to protein sequences for proteomic applications.
"We are excited about the initial results of our collaboration with the NCI and optimistic about the larger potential for applying our unique technologies in the field of bioinformatics," said Jim Rottsolk, Cray Inc. chairman and CEO. Cray SV1 supercomputer systems start at under $1 million (U.S. list), are air cooled and fit easily into office environments.
About NCI's Advanced Biomedical Computing Center
The NCI's Advanced Biomedical Computing Center (Frederick, Md.) serves 1,800 biological researchers worldwide. Using a Cray supercomputer, ABCC played a critical role in solving the 3-D structure of HIV-1 protease, an enzyme that HIV utilizes to infect human immune cells. With the 3-D structure clarified, scientists were able to design highly effective protease inhibitors that are now the mainstay of AIDS therapy. For this work, ABCC was named a finalist for the prestigious Computerworld Smithsonian science award in 2000.
October 26, 2001
National Cancer Institute – Cray Collaboration
• Use the special hardware features of the Cray SV1 cluster to address genomic and proteomic issues.
• Integrate genomics, post-genomic, and proteomic methods to provide insights into the mechanism of cancer.
• NCI making results such as STR Database available via the web.
October 26, 2001
NCI’s Advanced Biomedical Computing Center
Par
S D
O R I G I N 2 000
Sil iconG raphi cs
O R I G I N 2 000
Sil iconG raphi cs
S D
O R I G I N 2 000
Sil iconG raphi cs
O R I G I N 2 000
Sil iconG raphi cs
S D
O R I G I N 2 000
Sil iconG raphi cs
O R I G I N 2 000
Sil iconG raphi cs
S D
O R I G I N 2 000
Sil iconG raphi cs
O R I G I N 2 000
Sil iconG raphi cs
S D
Sil iconG ra phic sCo mp ute r Syst em s
XLS E R I E S
SD
A P HA
GENERO
SLP H A E RV E R840 0
d i g i t a lS D
S il ic onGra phicsC om put er S ystem s
CHALLENGE
X LS E R IE S
S D
IBMRISC 6000
SD
Sun
W
E N T E R P R I S E
W
3 0 0 0
D R IV E NU L TR A S P A R C
Cray J90SE 16PE 1GW Cray SV1 96PE 12GW Cray J90 8PE 256MW
GigaRing
Parallel Vector Environment
Origin 2000 64PE 32GbSGI Servers Compaq 8400 IBM SP2
Storagetek Tape Silo
Workstations and File Servers
October 26, 2001
What Is an STR, and Why Do I Care?
• STR ( Short Tandem Repeat )– String of ‘n’ letters ( nucleotides ) repeated ‘m’ times (‘m’
usually >6) : ATATATATATATAT• Why STRs are important
– They can be associated with gene locations, diseases, and other important biology
– They can affect the accuracy of algorithms used to assemble the genome
– They are used for forensic identification– …
October 26, 2001
Human Genome
• > 3 Billion Base Pairs of Nucleotides
• All Short Tandem Repeats (2-8) found in <10 minutes on Cray SV1 – 1 CPU; 150 sec on 15 CPUs of SV1e.
• NCI believes such methodologies show great
promise for genome analysis and proteomics
October 26, 2001
Unique Cray Features
• Several capabilities, not just one– Unique, hard-to-replicate combination of hardware features– Benefits from applying multiple processors (CPUs)
• Originally created for intelligence community– ~100x faster than anything else for classified problems– Key bioinformatics problems look like classified problems
• Bioinformatics ‘connection’ was serendipitous– One clever individual
• Resident in Cray SV1, MTA-2, SV2– Experience to date is with SV1 series
Cray SV1™ Supercomputer
October 26, 2001
SV1 Kernel Performance
• Nucleotide encoding: 600M characters/sec.
• Difference counting: 200M starting points/sec.
– For a 32 nucleotide sequence, this would be 6.4G nucleotides/second
• Reverse complement: 4G nucleotides/sec.
– For example, the complete human genome can be
reverse complemented in about 1 second
October 26, 2001
Performance Comparisons
69
9000
0
5000
10000
Millions of Characters/Second (1 processor)
Alpha Cray SV1
1200
10
0
200
400
600
800
1000
1200
Solution Time (seconds)
Benson's methodology
SGI O2KCray SV1
Source: NCI
October 26, 2001
Kernel Status & Plans
• Available:– Nucleotide encoding– Reverse complement ( turn ACCTG into CAGGT )– Difference count– Tandem repeat search
• In progress:– Amino acid encoding & comparison scoring– Nucleotide sorting ( for non-tandem repeats )– Higher level drivers– …
October 26, 2001
Supercomputing and The Grid• Several organizations in Asia intend to implement GLOBUS on
Cray SV1 systems and make them available to BioInformatics users
• Cray systems will play a major role on The Grid
– Supercomputer centers like SDSC have always provided service to remote users
• Some organizations are confronting implementation issues running “coupled” jobs on The Grid using distributed memory techniques
– Shared memory supercomputers may play an important role as “couplers” for Grid-based distributed applications
October 26, 2001
Thank you.
godhia @ cray.com