+ All Categories
Home > Technology > Hedlund_biogrid_BOSC2009

Hedlund_biogrid_BOSC2009

Date post: 29-Nov-2014
Category:
Upload: bosc
View: 615 times
Download: 0 times
Share this document with a friend
Description:
 
35
Biogrid Bioinformatics for the grid Joel Hedlund <[email protected]> Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after this talk!
Transcript
Page 1: Hedlund_biogrid_BOSC2009

Biogrid – Bioinformatics for the grid

Joel Hedlund <[email protected]>

Biogrid User and Developer

Linköping University, Sweden

Birds-of-a-feather session tonight: see me after this talk!

Page 2: Hedlund_biogrid_BOSC2009

Outline

• What is it?

• What is it good for?

• Does it really work?

• Gory details.

• Why did we do this?

• Profit!

Page 3: Hedlund_biogrid_BOSC2009

What is it?

NDGF BIO Community Grid

Bioinformatics for the Grid

Page 4: Hedlund_biogrid_BOSC2009

What is it?

• Unified interface

...to popular bioinformatic applications

...on shared, distributed computational resources

...using versioned and cached databases

Page 5: Hedlund_biogrid_BOSC2009

What is it good for?

• Burst computing

– High demand for short periods of time• high during development / production

• low during analysis / writing papers

– Share resources to enable more efficient use

• Database accessibility

• Availibility

• Unified interface

Page 6: Hedlund_biogrid_BOSC2009

What is NDGF?

Page 7: Hedlund_biogrid_BOSC2009

What is NDGF?

• Nordic Data Grid Facility

• A WLCG Tier1 facility

– Worldwide LHC Computational Grid

– Stores and processes data from LHC at CERN

• peak rate ≈ 1.6Gb/s, when the accelerator is running(and that’s after most of the data have been filtered away)

Page 8: Hedlund_biogrid_BOSC2009
Page 9: Hedlund_biogrid_BOSC2009
Page 10: Hedlund_biogrid_BOSC2009

”Does it really work, this distributed thingie?”

Page 11: Hedlund_biogrid_BOSC2009

”Does it really work, this distributed thingie?”

Why yes, very well thank you!

Page 12: Hedlund_biogrid_BOSC2009

NDGF

• 96% availablity(highest of all Tier1 facilities)

• Third largest Tier1 facility in the world

• Lowest ratio of failed ATLAS jobs

• Production goals met, and beyond– Goal: 8% of all ATLAS resources (10.5% provided)

– Goal: 9% of all ALICE resources (12% provided)

* Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)

Page 13: Hedlund_biogrid_BOSC2009

DISTRIBUTION

IS A

STRENGTH

Page 14: Hedlund_biogrid_BOSC2009

It enforces unification

It ensures availability

Page 15: Hedlund_biogrid_BOSC2009

Does it really work?

It’s good enough for LHC.

It’s good enough for Bioinformatics.

Page 16: Hedlund_biogrid_BOSC2009

Gory details

Page 17: Hedlund_biogrid_BOSC2009

Biogrid provides

Optimised applications:

– BLAST

– ClustalW

– HMMER

– Muscle

– Mafft

Planned: molecular dynamics, phylogeny...

Page 18: Hedlund_biogrid_BOSC2009

Biogrid provides

Versioned, indexed and cached databases

– UniProtKB (subreleases)

– Uniref (subreleases)

Planned: genomes (EnsEMBL), nucleotides (EMBL)...

Page 19: Hedlund_biogrid_BOSC2009

Cached database access

Database files are transfered to the cluster at most once per project.

Page 20: Hedlund_biogrid_BOSC2009

Unified Interface

Page 21: Hedlund_biogrid_BOSC2009

Unified Interface

Page 22: Hedlund_biogrid_BOSC2009

Unified Interface

DATA

RESULTS

Page 23: Hedlund_biogrid_BOSC2009

Unified Interface

• XRSL Job DescriptionStandard in ARC Grid Middleware

• Well defined runtime environments$HMMERDIR: node local (fast) scratch dir containing db files

prepare_db: download and unpack db files on the fly from front node to $HMMERDIR

Page 24: Hedlund_biogrid_BOSC2009

XRSL Job Description

(jobName=refinehmm-family023)

(runTimeEnvironment=APPS/BIO/HMMER2.3.2)

(cpuTime=3000)

(executable=refinehmm.jobscript.sh)

(inputFiles=

(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)

(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)

(family023.hmm ””)

)

(outputfiles=

(family023.refined.hmm ””)

)

Page 25: Hedlund_biogrid_BOSC2009

XRSL Job Description

(jobName=refinehmm-$HMM_NAME)

(runTimeEnvironment=APPS/BIO/HMMER2.3.2)

(cpuTime=3000)

(executable=refinehmm.jobscript.sh)

(inputFiles=

(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)

(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)

($HMM_NAME.hmm ””)

)

(outputfiles=

($HMM_NAME.refined.hmm ””)

)

Page 26: Hedlund_biogrid_BOSC2009

Unified Interface

• Run on any resource I can access:$ ngsub myjob.xrsl

• ...or run on my buddy’s cluster:$ ngsub -c kiniini.csc.fi myjob.xrsl

• Check jobs:$ ngstat refinehmm-family023

(or use Grid Monitor web interface at www.nordugrid.org)

• Fetch results:$ ngget refinehmm-family*

DATA RESULTSGRID

Page 27: Hedlund_biogrid_BOSC2009

What do I need?

1. A resource with ARC and Biogrid REs

2. An ARC client

3. A Grid Certificate(available from a number of global certificate authorities)

4. Time allowance on the resource

5. Biogrid VO MembershipNot really necessary, but it will get you 1 & 4( )

Page 28: Hedlund_biogrid_BOSC2009

What do I need?

...or you can just grab the RE scripts off the biogrid website,

and your db of choice from the biogrid dCache.

Page 29: Hedlund_biogrid_BOSC2009

Why did we do this?

Bioinformatic applications...

– CPU intensive

– Small input and output files

– ”Large” databases can be cached

...are very well suited for distributed computing.

Page 30: Hedlund_biogrid_BOSC2009

Profit!

Page 31: Hedlund_biogrid_BOSC2009

Subclassification of the MDR superfamily

• 15000 membersfrom all kingdoms of life

• 500 families25% sequence identity

• 40 human members

• Different substrate specificities

• Different subunit & cofactor count

• 2 HMMs available for superfamily detection

• None for any of the individual families

Page 32: Hedlund_biogrid_BOSC2009

Subclassification of the MDR superfamily

• We made HMMs for all MDR (sub)families with 20+ members.

• 86 families

• 34 detected subfamilies to 14 of these

• 11579 / 15000 sequences classified

• ≈5000*hmmsearch vs UniProtKB

Manuscript in preparation

Page 33: Hedlund_biogrid_BOSC2009

refinehmm

• Algorithm for automated HMM refinement

• Produces stable and reliable HMMs

• Developed using Biogrid REs and resources

Will also be open source software once the paper is out.

Page 34: Hedlund_biogrid_BOSC2009

Acknowledgements

• Olli TourunenBiogrid developer

• Bengt PerssonBiogrid PI

• NDGFMichael GrønagerJosva Kleist

• Biogrid co-applicantsAnn-Charlotte Berglund SonnhammerErik SonnhammerInge Jonassen

Supercomputing centers

• NSCJens Larsson, Leif Nixon

• HPC2NÅke Sandgren

• OthersC3SE, CSC, Uppmax, Lunarc, PDC, Aalborg University, Oslo University

Birds-of-a-feather session tonight: see me after the talk!

Joel Hedlund

[email protected]

Biogrid User and Developer

Linköping University, Sweden

Page 35: Hedlund_biogrid_BOSC2009

Acknowledgements

• Olli TourunenBiogrid developer

• Bengt PerssonBiogrid PI

• NDGFMichael GrønagerJosva Kleist

• Biogrid co-applicantsAnn-Charlotte Berglund SonnhammerErik SonnhammerInge Jonassen

Supercomputing centers

• NSCJens Larsson, Leif Nixon

• HPC2NÅke Sandgren

• OthersC3SE, CSC, Uppmax, Lunarc, PDC, Aalborg University, Oslo University

Birds-of-a-feather session tonight: see me after the talk!

Joel Hedlund

[email protected]

Biogrid User and Developer

Linköping University, Sweden


Recommended