Hedlund_biogrid_BOSC2009

Biogrid – Bioinformatics for the grid

Joel Hedlund <[email protected]>

Biogrid User and Developer

Linköping University, Sweden

Birds-of-a-feather session tonight: see me after this talk!

Outline

• What is it?

• What is it good for?

• Does it really work?

• Gory details.

• Why did we do this?

• Profit!

What is it?

NDGF BIO Community Grid

Bioinformatics for the Grid

What is it?

• Unified interface

...to popular bioinformatic applications

...on shared, distributed computational resources

...using versioned and cached databases

What is it good for?

• Burst computing

– High demand for short periods of time• high during development / production

• low during analysis / writing papers

– Share resources to enable more efficient use

• Database accessibility

• Availibility

• Unified interface

What is NDGF?

What is NDGF?

• Nordic Data Grid Facility

• A WLCG Tier1 facility

– Worldwide LHC Computational Grid

– Stores and processes data from LHC at CERN

• peak rate ≈ 1.6Gb/s, when the accelerator is running(and that’s after most of the data have been filtered away)

”Does it really work, this distributed thingie?”

”Does it really work, this distributed thingie?”

Why yes, very well thank you!

NDGF

• 96% availablity(highest of all Tier1 facilities)

• Third largest Tier1 facility in the world

• Lowest ratio of failed ATLAS jobs

• Production goals met, and beyond– Goal: 8% of all ATLAS resources (10.5% provided)

– Goal: 9% of all ALICE resources (12% provided)

* Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)

DISTRIBUTION

IS A

STRENGTH

It enforces unification

It ensures availability

Does it really work?

It’s good enough for LHC.

It’s good enough for Bioinformatics.

Gory details

Biogrid provides

Optimised applications:

– BLAST

– ClustalW

– HMMER

– Muscle

– Mafft

Planned: molecular dynamics, phylogeny...

Biogrid provides

Versioned, indexed and cached databases

– UniProtKB (subreleases)

– Uniref (subreleases)

Planned: genomes (EnsEMBL), nucleotides (EMBL)...

Cached database access

Database files are transfered to the cluster at most once per project.

Unified Interface

Unified Interface

Unified Interface

DATA

RESULTS

Unified Interface

• XRSL Job DescriptionStandard in ARC Grid Middleware

• Well defined runtime environments$HMMERDIR: node local (fast) scratch dir containing db files

prepare_db: download and unpack db files on the fly from front node to $HMMERDIR

XRSL Job Description

(jobName=refinehmm-family023)

(runTimeEnvironment=APPS/BIO/HMMER2.3.2)

(cpuTime=3000)

(executable=refinehmm.jobscript.sh)

(inputFiles=

(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)

(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)

(family023.hmm ””)

)

(outputfiles=

(family023.refined.hmm ””)

)

XRSL Job Description

(jobName=refinehmm-$HMM_NAME)

(runTimeEnvironment=APPS/BIO/HMMER2.3.2)

(cpuTime=3000)

(executable=refinehmm.jobscript.sh)

(inputFiles=

(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)

(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)

($HMM_NAME.hmm ””)

)

(outputfiles=

($HMM_NAME.refined.hmm ””)

)

Unified Interface

• Run on any resource I can access:$ ngsub myjob.xrsl

• ...or run on my buddy’s cluster:$ ngsub -c kiniini.csc.fi myjob.xrsl

• Check jobs:$ ngstat refinehmm-family023

(or use Grid Monitor web interface at www.nordugrid.org)

• Fetch results:$ ngget refinehmm-family*

DATA RESULTSGRID

http://www.nordugrid.org/

What do I need?

1. A resource with ARC and Biogrid REs

2. An ARC client

3. A Grid Certificate(available from a number of global certificate authorities)

4. Time allowance on the resource

5. Biogrid VO MembershipNot really necessary, but it will get you 1 & 4( )

What do I need?

...or you can just grab the RE scripts off the biogrid website,

and your db of choice from the biogrid dCache.

Why did we do this?

Bioinformatic applications...

– CPU intensive

– Small input and output files

– ”Large” databases can be cached

...are very well suited for distributed computing.

Profit!

Subclassification of the MDR superfamily

• 15000 membersfrom all kingdoms of life

• 500 families25% sequence identity

• 40 human members

• Different substrate specificities

• Different subunit & cofactor count

• 2 HMMs available for superfamily detection

• None for any of the individual families

Subclassification of the MDR superfamily

• We made HMMs for all MDR (sub)families with 20+ members.

• 86 families

• 34 detected subfamilies to 14 of these

• 11579 / 15000 sequences classified

• ≈5000*hmmsearch vs UniProtKB

Manuscript in preparation

refinehmm

• Algorithm for automated HMM refinement

• Produces stable and reliable HMMs

• Developed using Biogrid REs and resources

Will also be open source software once the paper is out.

Acknowledgements

• Olli TourunenBiogrid developer

• Bengt PerssonBiogrid PI

• NDGFMichael GrønagerJosva Kleist

• Biogrid co-applicantsAnn-Charlotte Berglund SonnhammerErik SonnhammerInge Jonassen

Supercomputing centers

• NSCJens Larsson, Leif Nixon

• HPC2NÅke Sandgren

• OthersC3SE, CSC, Uppmax, Lunarc, PDC, Aalborg University, Oslo University

Birds-of-a-feather session tonight: see me after the talk!

Joel Hedlund

[email protected]



mailto:[email protected]

Acknowledgements

• Olli TourunenBiogrid developer

• Bengt PerssonBiogrid PI

• NDGFMichael GrønagerJosva Kleist

• Biogrid co-applicantsAnn-Charlotte Berglund SonnhammerErik SonnhammerInge Jonassen

Supercomputing centers

• NSCJens Larsson, Leif Nixon

• HPC2NÅke Sandgren

• OthersC3SE, CSC, Uppmax, Lunarc, PDC, Aalborg University, Oslo University

Birds-of-a-feather session tonight: see me after the talk!

Joel Hedlund

[email protected]



mailto:[email protected]

Date post:	29-Nov-2014
Category:	Technology
Upload:	bosc
View:	615 times
Download:	0 times

Hedlund_biogrid_BOSC2009

Technology