Post on 26-Jan-2015
description
transcript
Reproducible
Research and
the Cloud
Dr Kenji Takeda (Kenji.Takeda@Microsoft.com)
Microsoft Research
@azure4research
Microsoft Research
Scientific Discovery
Credit: ROYAL INSTITUTION OF GREAT BRITAIN / SCIENCE PHOTO LIBRARY
𝜌𝐷𝑣
𝐷𝑡= −𝛻𝑝 + 𝛻 ∙ 𝜯 + 𝒇
The Research Lifecycle
Data
Acquisition & modelling
Collaboration and
visualisation
Analysis & data mining
Dissemination & sharing
Archiving and preserving
fourthparadigm.org
Believe it or not: how much can we rely on
published data on potential drug targets?
“at least 50% of published studies, even those in top-tier academic journals,
can’t be repeated with the same conclusions by an industrial lab”
Osherovich, L. Hedging against academic risk. SciBX 14 Apr 2011 (doi:10.1038/scibx.2011.416).
CLOUD COMPUTING
Global
presence
Datacenter
Edge point
The Microsoft Cloud
Cloud Computing
Choose from multiple runtimes and languages for your applications: Python, Java, PHP, .NET, Node.js
Run Linux on Windows Azure Virtual Machines (VHD)
Support multiple frameworks and popular open source applications with Windows Azure Web Sites
HDInsight Hadoop for Big Data analysis
Windows Azure
http://github.com/windowsazure
REPRODUCIBLE RESEARCH
• Computational experiments should be recomputable for all time
• Recomputation of recomputable experiments should be very easy
• It should be easier to make experiments recomputable than not to
• Tools and repositories can help recomputationbecome standard
• The only way to ensure recomputability is to provide virtual machines
• Runtime performance is a secondary issue
Ian Gent , Alexander Konovalov and Lars KotthoffSteven Crouch, Devasena Inupakutika
Recomputation.org
Zanadu.IO
khmer-protocols:
• Effort to provide standard “cheap” assembly protocols for cloud machines.
• Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. Est ~$150 per data set
• Open, versioned, forkable, citable.
Open Science
C. Titus Brown, @ctitusbrown
http://ged.cse.msu.edu/http://ivory.idyll.org/
Explicitly a “protocol” – explicit steps, copy-paste, customizable, versioned; not black box.
No requirement for computational expertise or significant computational hardware.
~1-5 days to teach a bench biologist to use.
$100-150 of rental compute (“cloud computing”)…
…for $1000 data set.
Now adding in quality control and internal validation steps.
Some thoughts…
Reproducible computing
environment(Azure)
Publicly available
data(MMETSP)
Open and versioned protocol
Provenance
tracking and
registration
(Synapse?)
Distribution Modeller
<compute + data>
Middle ground between:
Exploratory science
Procedural science
Black box that can be cracked open and modified
Interactive with auto-provenance
• Reproducing my
own results
• Replicating other
people’s results
• Reproducing other
people’s results
Repeatability, Replicability,
Reproducibility, Reuse
“reviewers have no time and no resources to reproduce
data and to dig deeply into the presented work. “Life Sci VC: Academic bias & biotech failures: http:// lifescivc.com/2011/03/academic-bias-
biotech-failures/#0_ undefined,0_
Ph
oto
: lee
chan
tmca
rth
ur,
CC
-BY
Windows Azure for Research
• Azure Research Awards
• Windows Azure for Research Training Courses
– Manchester, 3-4 April’14
• Webinars
• Technical resources & curriculum
• Research community engagements
www.azure4research.com
THANK YOU
Kenji.Takeda@Microsoft.com
www.azure4research.com
Windows Azure for Research Group
@azure4research