Date post: | 29-Nov-2014 |
Category: |
Technology |
Upload: | bosc |
View: | 615 times |
Download: | 0 times |
Biogrid – Bioinformatics for the grid
Joel Hedlund <[email protected]>
Biogrid User and Developer
Linköping University, Sweden
Birds-of-a-feather session tonight: see me after this talk!
Outline
• What is it?
• What is it good for?
• Does it really work?
• Gory details.
• Why did we do this?
• Profit!
What is it?
NDGF BIO Community Grid
Bioinformatics for the Grid
What is it?
• Unified interface
...to popular bioinformatic applications
...on shared, distributed computational resources
...using versioned and cached databases
What is it good for?
• Burst computing
– High demand for short periods of time• high during development / production
• low during analysis / writing papers
– Share resources to enable more efficient use
• Database accessibility
• Availibility
• Unified interface
What is NDGF?
What is NDGF?
• Nordic Data Grid Facility
• A WLCG Tier1 facility
– Worldwide LHC Computational Grid
– Stores and processes data from LHC at CERN
• peak rate ≈ 1.6Gb/s, when the accelerator is running(and that’s after most of the data have been filtered away)
”Does it really work, this distributed thingie?”
”Does it really work, this distributed thingie?”
Why yes, very well thank you!
NDGF
• 96% availablity(highest of all Tier1 facilities)
• Third largest Tier1 facility in the world
• Lowest ratio of failed ATLAS jobs
• Production goals met, and beyond– Goal: 8% of all ATLAS resources (10.5% provided)
– Goal: 9% of all ALICE resources (12% provided)
* Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)
DISTRIBUTION
IS A
STRENGTH
It enforces unification
It ensures availability
Does it really work?
It’s good enough for LHC.
It’s good enough for Bioinformatics.
Gory details
Biogrid provides
Optimised applications:
– BLAST
– ClustalW
– HMMER
– Muscle
– Mafft
Planned: molecular dynamics, phylogeny...
Biogrid provides
Versioned, indexed and cached databases
– UniProtKB (subreleases)
– Uniref (subreleases)
Planned: genomes (EnsEMBL), nucleotides (EMBL)...
Cached database access
Database files are transfered to the cluster at most once per project.
Unified Interface
Unified Interface
Unified Interface
DATA
RESULTS
Unified Interface
• XRSL Job DescriptionStandard in ARC Grid Middleware
• Well defined runtime environments$HMMERDIR: node local (fast) scratch dir containing db files
prepare_db: download and unpack db files on the fly from front node to $HMMERDIR
XRSL Job Description
(jobName=refinehmm-family023)
(runTimeEnvironment=APPS/BIO/HMMER2.3.2)
(cpuTime=3000)
(executable=refinehmm.jobscript.sh)
(inputFiles=
(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)
(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)
(family023.hmm ””)
)
(outputfiles=
(family023.refined.hmm ””)
)
XRSL Job Description
(jobName=refinehmm-$HMM_NAME)
(runTimeEnvironment=APPS/BIO/HMMER2.3.2)
(cpuTime=3000)
(executable=refinehmm.jobscript.sh)
(inputFiles=
(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)
(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)
($HMM_NAME.hmm ””)
)
(outputfiles=
($HMM_NAME.refined.hmm ””)
)
Unified Interface
• Run on any resource I can access:$ ngsub myjob.xrsl
• ...or run on my buddy’s cluster:$ ngsub -c kiniini.csc.fi myjob.xrsl
• Check jobs:$ ngstat refinehmm-family023
(or use Grid Monitor web interface at www.nordugrid.org)
• Fetch results:$ ngget refinehmm-family*
DATA RESULTSGRID
What do I need?
1. A resource with ARC and Biogrid REs
2. An ARC client
3. A Grid Certificate(available from a number of global certificate authorities)
4. Time allowance on the resource
5. Biogrid VO MembershipNot really necessary, but it will get you 1 & 4( )
What do I need?
...or you can just grab the RE scripts off the biogrid website,
and your db of choice from the biogrid dCache.
Why did we do this?
Bioinformatic applications...
– CPU intensive
– Small input and output files
– ”Large” databases can be cached
...are very well suited for distributed computing.
Profit!
Subclassification of the MDR superfamily
• 15000 membersfrom all kingdoms of life
• 500 families25% sequence identity
• 40 human members
• Different substrate specificities
• Different subunit & cofactor count
• 2 HMMs available for superfamily detection
• None for any of the individual families
Subclassification of the MDR superfamily
• We made HMMs for all MDR (sub)families with 20+ members.
• 86 families
• 34 detected subfamilies to 14 of these
• 11579 / 15000 sequences classified
• ≈5000*hmmsearch vs UniProtKB
Manuscript in preparation
refinehmm
• Algorithm for automated HMM refinement
• Produces stable and reliable HMMs
• Developed using Biogrid REs and resources
Will also be open source software once the paper is out.
Acknowledgements
• Olli TourunenBiogrid developer
• Bengt PerssonBiogrid PI
• NDGFMichael GrønagerJosva Kleist
• Biogrid co-applicantsAnn-Charlotte Berglund SonnhammerErik SonnhammerInge Jonassen
Supercomputing centers
• NSCJens Larsson, Leif Nixon
• HPC2NÅke Sandgren
• OthersC3SE, CSC, Uppmax, Lunarc, PDC, Aalborg University, Oslo University
Birds-of-a-feather session tonight: see me after the talk!
Joel Hedlund
Biogrid User and Developer
Linköping University, Sweden
Acknowledgements
• Olli TourunenBiogrid developer
• Bengt PerssonBiogrid PI
• NDGFMichael GrønagerJosva Kleist
• Biogrid co-applicantsAnn-Charlotte Berglund SonnhammerErik SonnhammerInge Jonassen
Supercomputing centers
• NSCJens Larsson, Leif Nixon
• HPC2NÅke Sandgren
• OthersC3SE, CSC, Uppmax, Lunarc, PDC, Aalborg University, Oslo University
Birds-of-a-feather session tonight: see me after the talk!
Joel Hedlund
Biogrid User and Developer
Linköping University, Sweden