Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | clifford-sanders |
View: | 217 times |
Download: | 0 times |
http://www.itb.cnr.it/bioinfogrid
Grid Enabled High Throughput Virtual Screening Against Four Different Targets Implicated in Malaria
Presented by
Vinod KasamCLADE workshop, HPDC conference,
June 25, 2007, Monterey Bay
25-06-2007, Monterey Bay 2
Outline
• Wisdom introduction• Biological targets• Resources used in wisdom• Production environment• Results• Issues• Conclusions
25-06-2007, Monterey Bay 3
Introduction to the disease : malaria
• ~300 million people worldwide are affected
• 1-1.5 million people die every year
• Widely spread
• Caused by protozoan parasites of the genus Plasmodium
Complex life cycle with multiple
stages
25-06-2007, Monterey Bay 4
WISDOM-II, second large scale docking deployment against malaria
Parasite DNA synthesis
Parasite cell replication
Parasite DNA synthesis
Parasite detoxification
CEA, Acamba project, France
U. of Modena, Italia
U. of Los Andes, VenezuelaU. of Modena, Italia
U. of Pretoria,South-Africa
Biology partners
Tubulin from Plasmodium/plant/mamal
DHFR from Plasmodium falciparum
DHFR from Plasmodium vivax
GST from Plasmodium falciparum
Malaria target Involved in
25-06-2007, Monterey Bay 5
• Biological goal
Proposition of new inhibitors for a family of proteins produced by Plasmodium
• Biomedical informatics goal Deployment of in silico virtual docking on the grid
• Grid goal
Deployment of a CPU consuming application generating large data flows to test the grid operation and services => “data challenge”
WISDOM : Wide In Silico Docking On Malaria
25-06-2007, Monterey Bay 6
High Throughput Virtual Docking
Compounds:ZINC- 4,3MChembridge - 500 000
Targets:
3D structures in PDB
One homology model
Millions of chemicalcompounds available High Throughput Screening
1-10$/compound, several hours
Molecular docking (FlexX, Autodock)20 cents/compound, 1 minute
Data challenge on EGEE~ 3 months on ~2000 computers
Hits screeningusing assays performed onliving cells
Leads
Clinical testing
Drug
25-06-2007, Monterey Bay 7
Objective of the WISDOM development
• Objective– Dock a whole compound database in a limited time with a minimal
human involvement during the data challenge.
• Need an optimized environment– Production in Limited time– Performance are important
• Need a fault tolerant environment– Stress usage of the grid during the DC– Grid is heterogeneous and dynamic– Data produced are important and can’t be easily reproduced
• Need an automatic production environment– Grid API are not fully adapted for a bulk use at a large scale– Ease the execution– User-friendly hi-level services
25-06-2007, Monterey Bay 8
Use of a production system
• Managing thousands of jobs and files is a manually labor-intensive task– Job preparation, submission and monitoring, output retrieval,
failure identification and resolution, job resubmission…– In order to efficiently use the resources
• The amount of transferred data impacts on grid performance– The data must be installed on the grid– The database is stored into subsets
• Grid process introduces significant delays– The submitted jobs must be sufficiently long in order to reduce
the impact of this middleware overhead
• The production system will provide automated and fault-tolerant jobs and files management
25-06-2007, Monterey Bay 9
Grid added value for international collaboration on neglected and emerging diseases
• Grids offer unprecedented opportunities for sharing information and resources world wide
Grids are unique tools for :-Collecting and sharing information (Epidemiology, Genomics)-Networking experts-Mobilizing resources routinely or in emergency (vaccine & drug discovery)
25-06-2007, Monterey Bay 10
Grid added value of EGEE for a large scale in silico experimentation
• Large computing and storage resources
• 24 hours a day availability of resources, user support
• Workload Management Service
• Information and Monitoring Services
• Data Management Services
• Security
• Reliability of services
25-06-2007, Monterey Bay 11
Simplified grid workflow
• FlexX license server :– 6000 floating licenses offered by BioSolveIT to SCAI– Maximum number of concurrent used licenses was 5000
StorageStorageElementElement
ComputiComputingngElementElement
Site1
Site2
StorageStorageElementElement
User interfaceUser interface
ComputiComputingngElementElement
Compounds database
Parameter settingsTarget structures
Compounds sub lists
Results
Results
Statistics
Compounds list
ResourceResourceBrokerBroker
Software
25-06-2007, Monterey Bay 12
User Interface
HealthGrid Server
Web Site
WMSSEsCEs &WNs
FlexLM
Schema of the current WISDOM production environment
User Interface
WISDOM production
system
WMSSubmits the jobs
Checks job status Resubmits
CEs &WNs
FlexXjob
SEs
Structure file
Compounds file
inputs
outputs
Output file
Local server
Web Site WISDOMDB
Statistics
FLEXlm
licenselicense
FlexX
Statistics
DMS/GFTP
25-06-2007, Monterey Bay 13
Grid infrastructures and projects contributing to WISDOM-II
: European grid infrastructure : European grid project
EELA
EUMedGrid EUChinaGrid
: Regional/national grid infrastructure
AuvergridEGEE
TWGrid
EMBRACE BioinfoGridSHARE
25-06-2007, Monterey Bay 14
Instances on different infrastructures
Instances deployed on the different infrastructures during the WISDOM-II data challenge
25-06-2007, Monterey Bay 15
Deployment on different infrastrucures
• Up to 5000 computers in more than17 countries mobilized from october 2006 – Jan 2007 to provide CPU
• 1.738 TB of data produced
Distribution of jobs
1% 2% 2% 3%3%
3%
3%
5%
6%
7%
12%15%
38%
EGEE Germany Switzerland
EGEE Asia Pacific
EGEE Russia
Auvergrid
EuChinaGrid
EELA
EGEE South Western Europe
EGEE Central Europe
EGEE Northern Europe
EGEE Italy
EGEE South Eastern Europe
EGEE France
EGEE UKI
25-06-2007, Monterey Bay 16
Statistics of deployment
• First DC:– 80 CPU years– 1 TB– 1700 CPUs used in parallel– July 1st - August 15th 2005
• 2nd DC– 100 CPU years– 800 GB– 1700 CPUs used used in parallel– May 1st -April 15th 2006
• 3rd DC– 413 CPU years– 1.7 TB– Up to 5000 CPUs in parallel– 1st October 2006 - 31 January
2007
Number of Jobs 77,504
Total Number of completed dockings 156,407,400
Estimated duration on 1 CPU 413 years
Duration of the experiment 76 days
Average throughput 78,400 dockings/hour
Maximum number of loaded licences (concurrent running jobs)
5,000
Number of used computing elements 98
Average duration of a job 41 hours
Average crunching factor 1,986
Volume of output results 1,738 TB
The crunching factor is the ratio of the total CPU time over the duration of the experiment. It represents the average number of CPUs used simultaneously all along the data challenge and is a metric of the parallelization gain.
25-06-2007, Monterey Bay 17
Biological results
The repartition of docking energies of the ZINC database against GST A structure.
(The red column represents a score of -24kj/Mol, the docking score of a co-crystallized
ligand (GTX) of GST A chain)
0
50000
100000
150000
200000
250000
300000
350000
Nu
mb
er
of
com
pou
nd
s
-50 -46 -42 -38 -34 -30 -26 -22 -18 -14 -10 -6 -2 2 6 10 14 18
Docking Energy
25-06-2007, Monterey Bay 18
Issues
• Scheduling efficiency of the grid is still a major issue
• The resource broker is still the main bottleneck
• This deployment also shows that it is not possible to do a naive blacklisting of the failing resources, for the simple fact that virtually all the grid resources have produced aborted jobs, and this blacklisting should also take care of the site scheduled downtimes.
• Store and treat the data in a relational database
25-06-2007, Monterey Bay 19
Interactive Web Portal
• User Friendly Interface for biologists
• Real Time output of the results– 3D views of the docking poses and structures
• Resubmission of docking jobs
25-06-2007, Monterey Bay 20
Conclusion
• Take advantage of the EGEE services, APIs and resources.
• Demonstrated the relevance of computational grids in life science applications
• Manual intervention is reduced (automatic resubmission of jobs)
• Use of AMGA to store results and statistics immediately.
• Interoperable Web Service InterfaceWSDL following the WS-I profile
• Improved flexibility to deploy other bioinformatics applications.
25-06-2007, Monterey Bay 21
The next steps
• To address the issue of resource brokers, we are trying to submit the jobs by bypassing resource brokers
• Docking step still requires a lot of manual intervention – Task: improve output data collection and post-docking analysis
• The next step after docking is Molecular Dynamics– Task: deploy Molecular Dynamics computations on grid infrastructures
(successfully deployed already on one target, plasmepsin) – Contribution from CNRS-IN2P3, within the framework of BioinfoGRID,
Fraunhofer SCAI and University of Modena
• Beyond virtual screening, the long term vision: building a grid for malaria– To provide services to research labs working on malaria– To collect and analyze epidemiological data
25-06-2007, Monterey Bay 22
Long term vision: a grid for malaria
Use the grid technology to foster research and development on malaria and other neglected diseases
Univ. Los Andes:Biological
targets, Malaria biology
LPC Clermont-Ferrand:
Biomedical grid
SCAI Fraunhofer:Knowledge extraction,
Chemoinformatics
Univ. Modena:Biological targets,
Molecular Dynamics
ITB CNR:Bioinformatics,
Molecular modelling
Univ. Pretoria:Bioinformatics, Malaria biology
Academica Sinica:Grid user interface
Contacts also established with WHO, Microsoft, TATRC, Argonne, SDSC, SERONO, NOVARTIS, Sanofi-Aventis, Hospitals in subsaharian Africa,
HealthGrid:Biomedical grid, Dissemination
CEA, Acamba project:
Biological targets, Chemogenomics
25-06-2007, Monterey Bay 23
Acknowledments
Academia Sinica
BioSolveIT
CNR-ITB
CNRS
CEA
Healthgrid
IN2P3
LPC
SCAI Fraunhofer
Università di Modena e Reggio Emilia
Université Blaise Pascal
University of Pretoria
University of Los Andes
Auvergrid
AccambaBioInfoGRID
EGEE
EMBRACE
EUChinaGRID
EUMedGRID
SHARE
TWGrid
Conseil Regional d’Auvergne
European Union
wisdom.healthgrid.org