Post on 13-Jan-2016
transcript
Sys-Bio Talk, 24th Feb 2005
Towards Grid-Based System Biology
Dr Richard SinnottTechnical Director National e-Science Centre
||| Deputy Director (Technical) Bioinformatics Research Centre
University of Glasgow
24th February 2005
Sys-Bio Talk, 24th Feb 2005
Grids? E-Science? E-Research?
sensor nets
Shared data archives
computers
software
colleagues
instruments
Grid
methodologies transforming science, engineering, medicine and business
driven by exponential growth in data, compute demands enabling a whole-system approach
Sys-Bio Talk, 24th Feb 2005
Cambridge
Newcastle
Edinburgh
Oxford
Glasgow
Manchester
Cardiff
Southampton
London
Belfast
Daresbury Lab
RALHinxton
NeSC in the UK
NeSC
Core National Grid Service
White Rose Grid
HPC(x)
CSAR
Previous work on UK e-Science Grid based on GT2Demonstrated broad set of applications across it
Monte Carlo simulations of ionic diffusion through radiation damaged crystal structures Integrated Earth system modelling BLAST on the Grid Grid Integration Test Script Suite …
Transition to WSRF/OGSA under discussionTwo UK OGSA Test Grid projects started in January
UCL, Imperial College, Universities of Edinburgh and Newcastle Universities of Portsmouth, Reading, Manchester, Westminster and CCLRC
There are still issues to be resolvedOGSA definition and delivery
Standards OGSI, WSRF, … …and Technologies GT3, GT4…
Hosting environments & PlatformsCombinations of services supportedMaterial and grids to support adopters
Challenges/ Opportunities
?
The next Grid software
Sys-Bio Talk, 24th Feb 2005
Life Sciences
Extensive Research Community>1000 per research university
Extensive ApplicationsMany people care about them
Health, Food, Environment
Interacts with virtually every disciplinePhysics, Chemistry, Maths/Stats, Nano-engineering, …
450+ databases relevant to bioinformatics (and growing!)
Heterogeneity, Interdependence, Complexity, Change, …
Sys-Bio Talk, 24th Feb 2005
Systems Biology?N
ucl
eoti
de
seq
uen
ces
Nu
cleo
tid
e st
ruct
ure
s
Gen
e ex
pre
ssio
ns
Pro
tein
Str
uct
ure
s
Pro
tei n
fu
nct
ion
s
Pro
tein
-pro
tein
inte
ract
ion
(p
ath
way
s)
Cel
l
Cel
l sig
nal
lin
g
Tis
sues
Org
ans
Ph
ysio
logy
Org
anis
ms
Pop
ula
tion
s
+ links to plant/crops, environmental, health, … information sources
Sys-Bio Talk, 24th Feb 2005
More genomes …...Arabidopsis
thaliana
mouse
rat
Caenorhabitis elegans
Drosophilamelanogaster
Mycobacteriumleprae
Man
Plasmodiumfalciparum
Mycobacteriumtuberculosis
Neisseria meningitidis
Z2491
Helicobacter pylori
Xylella fastidiosa
Borrelia burgorferi
Rickettsia prowazekii
Bacillus subtilis
Archaeoglobusfulgidus
Campylobacter jejuni
Aquifex aeolicus
Thermotoga maritima
Chlamydiapneumoniae
Pseudomonasaeruginosa
Ureaplasmaurealyticum
Buchnerasp. APS
Escherichia coli
Saccharomycescerevisiae
Yersinia pestis
Salmonellaenterica
Thermoplasmaacidophilum
Sys-Bio Talk, 24th Feb 2005
Distributed and Heterogeneous data
LPSYVDWRSA GAVVDIKSQG ECGGCWAFSA IATVEGINKI TSGSLISLSE QELIDCGRTQ NTRGCDGGYI TDGFQFIIND GGINTEENYP YTAQDGDCDV
Sequence Structure Function
Gene expression Morphology
Sys-Bio Talk, 24th Feb 2005
Database GrowthPDB Content Growth
•DBs growing exponentially!!!•Biobliographic (MedLine, …)
•Amino Acid Seq (SWISS-PROT, …)
•3D Molecular Structure (PDB, …)
•Nucleotide Seq (GenBank, EMBL, …)
•Biochemical Pathways (KEGG, WIT…)
•Molecular Classifications (SCOP, CATH,…)
•Motif Libraries (PROSITE, Blocks, …)
Sys-Bio Talk, 24th Feb 2005
Is Grid the Answer? Some key problems to be addressed
Tools that simplify access to and usage of data Internet hopping is not ideal!
Tools that simplify access to and usage of large scale HPC facilities
qsub [-a date_time] [-A account_string] [-c interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list] [-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V] [-W additional_attributes] [-z] [script]
Tools designed to aid understanding of complex data sets and relationships between them
e.g. through visualisation
Sys-Bio Talk, 24th Feb 2005
Access to and Usage of Data
Grid technology should allow tohide heterogeneity, deal with location transparency,address security concerns,…
Data Access and Integration Specification (DAIS) being defined by GGF
OGSA-DAI and DAIT projects key role in shaping these standardsOther commercial solutions
IBM Information Integrator, …
Sys-Bio Talk, 24th Feb 2005
Access to and Usage of HPC facilities
Consider whole genome-genome (2*3*10^9 bp) comparisons between two species
Current strategy essentially chops up one genome and fires searches for those fragments in the other then re-assembles results
messy approximate matching - re-assembly difficult important correlations can be lost
– to make this tractable so called junk DNA ignored – chopping may introduce artefacts or hide phenomena
Better to put both full genomes in memory and perform a useful complete comparisonOnly possible with very high-end machines (available via grids)Should not have to be script writer/Linux sys-admin to use these facilities
Sys-Bio Talk, 24th Feb 2005
Cognitive aspects of Data
Life science data can be “ugly”Raw data sets messyRequires significant effort to understandSchemas/data models evolving…
Tools needed to Simplify understandingImprove analysisNavigate through potentially huge data sets
e.g. to find genes of interest in chromosomes of different species
Sys-Bio Talk, 24th Feb 2005
Nu
cleo
tid
e se
qu
ence
s
Nu
cleo
tid
e st
ruct
ure
s
Gen
e ex
pre
ssio
ns
Pro
tein
Str
uct
ure
s
Pro
tei n
fu
nct
ion
s
Pro
tein
-pro
tein
inte
ract
ion
(p
ath
way
s)
Cel
l
Cel
l sig
nal
lin
g
Tis
sues
Org
ans
Ph
ysio
logy
Org
anis
ms
Pop
ula
tion
s
BRIDGESSBRN VOTES
DyVOSE
GHI
JDSS
Sys-Bio Talk, 24th Feb 2005
Overview of BRIDGES
Biomedical Research Informatics Delivered by Grid Enabled Services (BRIDGES)
NeSC (Edinburgh and Glasgow) and IBM Started October 2003
Supporting project for CFG project Generating data on hypertensionRat, Mouse, Human genome databases
Variety of tools usedBLAST, BLAT, Gene Prediction, visualisation, …
Variety of data sources and formatsMicroarray data, genome DBs, project partner research data, …
Aim is integrated infrastructure supportingData federationSecurity
Sys-Bio Talk, 24th Feb 2005
Bridges Project
Glasgow Edinburgh
Leicester Oxford
London
Netherlands
Publically Curated Data
Private data
Private data
Private data
Private data
Private data
Private data
CFG Virtual Organisation Ensembl
MGI
HUGO
OMIM
SWISS-PROT
… DATA HUB
RGD
SyntenyService
Information Integrator
OGSA-DAI
Magna Vista Service
VO Authorisation
blast
+ + +
Sys-Bio Talk, 24th Feb 2005
JDSS Project
Public data resources opennessOften cannot query directly Often not easy/possible to find schemasJoint Data Standards Study investigating this
Started on 1st June and involves– Digital Archiving Consultancy– Bioinformatics Research Centre (Glasgow)– NeSC (Edinburgh and Glasgow)
Look at technical, political, social, ethical etc issues involved in accessing and using public life science resources
– Interview relevant scientists, data curators/providers 8 month project with final report due imminently
– Funded by MRC, BBSRC, Wellcome Trust, JISC, NERC, DTI
Sys-Bio Talk, 24th Feb 2005
Dynamic Virtual Organisations for e-Science Education (DyVOSE) project
Two year project started 1st May 2004 funded by JISCExploring advanced authorisation infrastructures for security
… in Grid Computing Module as part of advanced MSc at Glasgow– Provide insight into rolling Grid out to the masses! ScotGrid
Authorisation decisions
Authorisation checks
PERMIS based
authorisation
Education VO policies
GU Condor pool
Other (known!) Grid resources
DyVOSE Project
Sys-Bio Talk, 24th Feb 2005
Scottish Bioinformatics Research Network
Four year proposal expected to start imminently
Funded (£2.4M) by Scottish Enterprise, Scottish Higher Education Funding Council, Scottish Executive Environment and Rural Affairs Department
Involves Glasgow, Dundee, Edinburgh, Scottish Bioinformatics Forum
Aim to provide bioinformatics infrastructure for Scottish health, agriculture and industry
Infrastructure support at Dundee, Edinburgh and Glasgow to support first-rate research in bioinformatics at each academic institute
Infrastructure support at three institutes, to support inter-institutional sharing of compute and data resources through application of Grid computing
Outreach and training activities mediated by the Scottish Bioinformatics Forum
Sys-Bio Talk, 24th Feb 2005
VOTESVirtual Organisations for Trials and Epidemiological Studies
3 year MRC (£2.8M) funded project expected to start imminentlyPlans to develop Grid infrastructure to address key components of clinical trial/observational study
Recruitment of potentially eligible participants Data collection during the study Study administration and coordination
– Involves Glasgow, Oxford, Leicester, Nottingham, Manchester
Clinical Virtual Organisation Framework
IMP
CVO-2 (e.g. for
recruitment)
Used to realise
GPs
Lei- Nott GLA
OX
Disease registries
Hospital databases
Transfer Grid
CVO-1 (e.g. for data collection)
Clinical trial data sets
Sys-Bio Talk, 24th Feb 2005
Genetics and Healthcare Initiative
Five (2+3) year proposal (£4.4M) expected to start imminently
Funded by Health Department and Department for Enterprise and Lifelong Learning
Involves Glasgow, Dundee, Edinburgh, Aberdeen
– focus of genetics as applied to healthcare
– first two years emphasis on providing a platform for research into the genetic basis of common complex diseases in Scotland
» Mental health, cardiovascular, … » Plan to establish 15,000 family-based intensively-phenotyped cohort
recruited from the East and West of Scotland
– basis for neutralising heritable (genetic) risk factors in disease surveillance, treatment optimisation, avoidance of adverse drug events and prediction of response to therapy, health care planning and drug discovery, …
Sys-Bio Talk, 24th Feb 2005
Systems Biology?
Once we have (securely) connected all relevant data sets and simplified access to and usage of HPC resources, wrapped your favourite bioinformatics applications as Grid services...
what questions would you like to ask?– How does a cell work?– Why do people who eat less tend to live longer?– How many people across Scotland had a heart attack in the
last 5 years took drug X, and of those that did where genes A or B influenced by this drug?
– Who has performed an experiment similar to mine and where their results similar?
– …
Sys-Bio Talk, 24th Feb 2005
www.nesc.ac.uk
Sys-Bio Talk, 24th Feb 2005
www.nesc.ac.uk
Sys-Bio Talk, 24th Feb 2005
Bridges Portal
Sys-Bio Talk, 24th Feb 2005
www.nesc.ac.uk
MagnaVista
Sys-Bio Talk, 24th Feb 2005
MagnaVista
Sys-Bio Talk, 24th Feb 2005
QTL upload
Sys-Bio Talk, 24th Feb 2005
QTL upload
Sys-Bio Talk, 24th Feb 2005
QTL browsing
Sys-Bio Talk, 24th Feb 2005
Grid Blast Client
• Allows ‘genome scale’ blasting
• Uses ScotGrid and idle compute resources of training lab Condor pool
Sys-Bio Talk, 24th Feb 2005
Sys-Bio Talk, 24th Feb 2005
Sys-Bio Talk, 24th Feb 2005