Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | shona-anderson |
View: | 216 times |
Download: | 0 times |
SLRITools Project:Providing a Platform for Bioinformatics Research
Michel Dumontier
Bioinformatics Technology ConferenceFebruary 3 - 6, 2003
Michel Dumontier – SLRITools Project
SLRITools Outline
• Introduction• Open-source• Toolkit Foundation• Toolkit Projects• Future Prospects
Michel Dumontier – SLRITools Project
Christopher W.V. Hogue Lab:An Engineering Approach Towards
Cellular SimulationWhole Cell
Visualization
GRID Computing Layer
Modular Cell SimulationSoftware Layer
Data Access Layer
CellGeometry
MoleculesSeqHound
InteractionsReactions
Kinetics, PTMs
Initial Conditions
microscopy NCBI/EBI/DDBJPDB
BINDExpression, Concentration,Localization/distributions
Proteomics/Genomics
Michel Dumontier – SLRITools Project
SLRITools
Purpose• Make freely available our sequence and
structure manipulation and analysis infrastructure and tool software to the greater benefit of the Bioinformatics community
Michel Dumontier – SLRITools Project
SLRIToolsDescription• Mainly C-based cross-platform toolkit for
dealing with biological information, especially protein structure/function.
• Extends the freely available NCBI C/C++ Toolkits and forms the basis for a number of powerful applications
• GPL/LGPL/PAL licenses• Currently hosted at
http://sourceforge.net/projects/slritools
• Training tutorials http://bioinfo.mshri.on.ca/tkcourse/
• Canadian Bioinformatics Workshops http://bioinformatics.ca
Michel Dumontier – SLRITools Project
SLRIToolsProjects • SLRI lib - common library that extends NCBI Toolkit
• SeqHound - Sequence and Structure Database Management System
• BIND - Biomolecular Interaction Network Database
• Text Indexer - ASN.1 indexer
• NBLAST – Cluster variant of BLAST for NxN comparisons
• Kangaroo – Regular expression search of DNA/protein/CDR
Michel Dumontier – SLRITools Project
BIND SeqHound
MoBiDiCK
TraDES
database NCBI c NCBI c++ SLRI
2.6M lines of source code160 Person-years of work
Hogue Lab - Source Code
450,000 lines of source code22 Person-years of work
Industry Standard65 lines/day
http://ncbi.nlm.nih.gov/IEB/
http://sourceforge.net/projects/slritools
Michel Dumontier – SLRITools Project
SLRITools Outline
• Introduction• Open-Source• Toolkit Foundation• Toolkit Projects• Future Prospects
Michel Dumontier – SLRITools Project
Going Open Source
• Subject to the Intellectual Property Policy of Mt. Sinai Hospital
• Does the software have the potential to improve patient care ?
• Does the software have economic benefits that will fund new research and development?
• Patents, Licenses & Publications
Michel Dumontier – SLRITools Project
Software LicensesStage 1) “Not Released”
– “No license”– internal use only– Protects commercial interest of MSH
• distributedfolding
Stage 2) “Free to Academics”
– Executables provided free, source upon request– Publication– Companies must license from MSH
• MCODE, TRADES, SSSF
Stage 3) “Public Use License”
– GNU Public License• SeqHound, BIND Data Manager, BIND specification
– Perl Artistic License/Lesser GNU Public License• SeqHound Remote Interfaces for BioPERL/ C, C++
API
MSH Boardsubcommittee oncommercialization
SLRI Industrial LiasionTech Transfer OfficePatent IP
Michel Dumontier – SLRITools Project
SLRITools Outline
• Introduction• Open-Source• Toolkit Foundation• Toolkit Projects• Future Prospects
Michel Dumontier – SLRITools Project
SLRITools Foundation
• National Center for Biotechnology Information (NCBI)
• NCBI Toolbox - Information Engineering Branch– http://www.ncbi.nlm.nih.gov/IEB/ – GenBank, Entrez, BLAST, Sequin, OMIM,
RefSeq1. Data Model – An explicit, complete data model of
biological sequences, structures, bibliographic data, and associated annotations
2. Data Encoding - A formal specification and encoding rules. The telecommunications standard, ASN.1, has been used for this. Recently it has been mapped to a similar language, XML. Provides automatic code generators.
Michel Dumontier – SLRITools Project
SLRITools Foundation II
3. Programming Libraries – Originally written in a portable dialect of C.
Recently a new generation is being written in C++.
– Compiled and occasionally tested over 14 OS • Linux, HPUX, MacOS 9/X, Irix, Solaris,
Windows 3.1/95/NT/2000/XP, BeOS, QNX, alpha, BSD, AIX, parisc-Linux, Sony PlayStation2 Linux
• 16/32/64 bit hardware
– Open Source – Free License– ftp://ftp.ncbi.nih.gov/toolbox/
Michel Dumontier – SLRITools Project
SLRITools Outline
• Introduction• Open-Source• Toolkit Foundation• Toolkit Projects• Future Prospects
Michel Dumontier – SLRITools Project
SeqHound
• SeqHound is a sequence and structure database management system that inherits the NCBI data model and mirrors the NCBI core biological sequence and structure information
• Why did we develop SeqHound?– Too many hits to NCBI server -> banned IP!– Data transmission & network connection issues– Generate more sophisticated API to access data
currently only available within the NCBI– Faster, local or remote access with a variety of
programming languages– Provide functionality necessary to retrieve
specialized subsets of sequences, structures and structural domains.
Michel Dumontier – SLRITools Project
SeqHound
Nucleic AcidsProteins3D StructuresDomainsPubMed LinksTaxonomy IdentifiersCoding RegionsGenome SetsRedundancyNeighborsGO AnnotationLocusLinkFielded Text IndexMedline XML/DB2
GFFFASTAClustalPDBXML ASN.1
database fill database update
databases
SeqHound APIlocal API, remoteAPI server, www
server
Applications
NCBI ftp site
NCBI toolkitbzip library CodeBase
http://seqhound.mshri.on.ca
150+ functions
Daily Updated
Michel Dumontier – SLRITools Project
SeqHound Resources
• SeqHound is accessible via
– http://seqhound.mshri.on.ca– Simple web interface (under development) – C, C++, Java (new!), Perl remote API or an
optimized local API. (->SOAP?)• Timeline
– Redundant fail-over server mid-summer– Concurrent with Bioperl release
• Freely available article published in BMC Bioinformatics 2002, 3:32
• http://www.biomedcentral.com/1471-2105/3/32/
Michel Dumontier – SLRITools Project
BIND
Biomolecular Interaction Network Database
Motivation:• Massive influx of biomolecular interaction data
requires repository, standards and access
Goals:• Provide a standard, comprehensive and
integrated interaction resource to the scientific community
• Define protein function and mechanisms• Recover and integrate biomolecular
interaction knowledge (backfilling)• Discover new knowledge through data mining
Michel Dumontier – SLRITools Project
Result:• Database to archive and exchange molecular
assembly information• Describes
– Interactions– Complexes– Pathways
• BIND has an extensive data model, GNU software tools and is based on the NCBI toolkit.
• Recently funded for a 3 year effort at 25M CDN– CIHR (1M) OGI/Genome Canada (12.5M)
Ontario R&D Challenge fund (5.2M) – IBM, MDS Proteomics and Foundry Networks– Sun
http://bind.ca
Michel Dumontier – SLRITools Project
BIND Data PoliciesGenBank Policy
– BIND data is freely available for any purpose
Direct Submission– Submitters cannot limit the intended use of
submitted BIND data– Submitters have the right to edit/alter their
records over time– Suggestions made by a third party will be
forwarded by us to the submitters to seek approval for any changes or corrections
Availability– ftp://ftp.bind.ca – ASN.1/XML data+specification
Michel Dumontier – SLRITools Project
Molecular Complex Detection (MCODE)
• Assume densely connected regions of a heterogeneous interaction network represent molecular complexes
• MCODE finds densely connected regions of a graph
• Weight nodes by local density (scoring function)• From highest weighted node, recursively add
neighbours above threshold score to complex
• Evaluation (Yeast):• 88/221 CellZome hand annotated complexes• 64/208 MIPS complexes (166 predicted)
• 200 complexes predicted in 15,143 protein interactions from yeast
Published: BMC Bioinformatics 2003. 4:2.• http://www.biomedcentral.com/1471-2105/4/2
Michel Dumontier – SLRITools Project
Dense Fibrillar Center
Fibrillar Center
Granular Component
9-core from ~15,000 yeast interactions
Michel Dumontier – SLRITools Project
FAST = “parallel” RPS BLAST
Used to spot domain similarities in a protein interaction cluster
Server-generated scalable FLASHgraphics – zoomable, printable.
Followed-up by zoom in on FASTA formatted sequences to see domain superposition and links to SMART/PFAM
Michel Dumontier – SLRITools Project
NBLASTDescription:• NBLAST is a cluster computer variant of BLAST• It performs the minimum number of sequence
comparisons and stores sequence alignments and the list of similar sequences (neighbours) as binary ASN.1 (XML)
• NBLAST is written in C using the NCBI C Toolkit.
• Separate function and database layersAccessibility: via SeqHound• http://seqhound.mshri.on.ca Neighbours DB (codebase)• ftp://ftp.mshri.on.ca/pub/nblastPublished: BMC Bioinformatics 2002, 3:13• http://www.biomedcentral.com/1471-
2105/3/13/
Ookpik CFI/ORDCF Funded.
216 P-III 45064 GB 1.2 TB disk
NBLASTRPS-BLASTTRADESMoBiDiCK
http://bioinfo.mshri.on.ca/yac/ http://sourceforge.net/projects/slritools/
Michel Dumontier – SLRITools Project
Kangaroo
Description:• Kangaroo is implemented to facilitate a wide
range of queries with no restriction on the length or complexity of the query expression
• Uses regular expression• Search DNA, protein, or coding region• Web-based form and results• Links to SeqHoundAccessibility:• http://bioinfo.mshri.on.ca/kangaroo currently
supports searches on 10 organisms (including human, mouse)
Published: BMC Bioinformatics 2002, 3:20 http://www.biomedcentral.com/1471-2105/3/20
Michel Dumontier – SLRITools Project
Summary
• Robust tools and services based on the NCBI data model
• Flexible licensing
Future Prospects• BIND/SeqHound Web Services (SOAP)• SeqHound
– Web Interface– InterPro|COG
• Larger & more sophisticated BIND (JAVA)• Grid Engine & Cell Simulation
Michel Dumontier – SLRITools Project
Christopher W.V. Hogue Lab Projects/Graduate Students
• BIND– Gary Bader– Doron Betel
• SeqHound– Katerina Michalickova
• Protein Folding/CASP Predictions– Howard Feldman
• Species Specific Protein Scoring Functions– Michel Dumontier
• Cell Simulation/Systems Biology– Adrian Heilbut– Ken Lau
• FPGA Hardware Database Search Engines– Ruth Isserlin
Michel Dumontier – SLRITools Project
BIND = “Blueprint Initiative”
• Database Curation– Vicki Lay– Susan Moore – Brigitte Tuekam – Cheryl Wolting
• Software Engineering– Neil Bahroos– Ian Donaldson – Marc Dumontier– Vladimir Grytsan – Hao Lieu– Greg Pintile– John Salama
• Administration– Eric Andrade – Marianne Rukavina– Sue Sroka – Greg Van
Volkenburg
• IT– Greg Clark– Edward Lee