Date post: | 09-Jan-2017 |
Category: |
Science |
Upload: | uppsala-university |
View: | 168 times |
Download: | 0 times |
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Ola SpjuthSNIC, UPPMAX and Science for Life
LaboratoryUppsala University, Sweden
Ola Spjuth
• Associate Professor in Pharmaceutical Bioinformatics
• Guest Researcher
• Co-Director
• Manager of Bioinformatics Compute and Storage facility
2003: First sequenced human genome - 13 years for $3 billions
2015: Human whole genome sequenced in 3 days for ~$1150
…requires supercomputersfor analysis and storage
Massively parallel sequencing….
2010: Science for Life Laboratory inaugurated
An internationally leading center that develops and applies
large-scale technologies for molecular biosciences with a focus
on health and environment.
National platform since 2013
Stockholm node
Uppsala node
2. Data delivery
Data generation and delivery
3. Analysis
Scientists
www.uppmax.uu.se/uppnexHigh-performance computers and large scale storage for bioinformatics analysis.
1. Sample transfer
Sequence production 2014:• Generated > 120 Tbp of sequence data• 13.7 Gbp/hour, 3.8 Mbp/sec (on average)
Hardware resourcesmilou: HP cluster of 208 nodes
pica: 6 (7) PBHitachi storage
halvan: 2 TB high-memory computer
Fast network via SUNET
Backup via SNIC
Long-termstorage atSweStore
nestor: 48 nodes production cluster
meles: 547 TBHitachi storage mosler: 24
nodes, 223 TBSmog: 100 nodes, ~300 TB
2015: 250 nodes
2016: 200new nodes
+1 PB
+2 PB
A national e-Infrastructure for NGS
Software + reference data
Support
Education
Compute resources
Storage resourcesEfficiency + automation
What we sequenced at NGI /
Chipster workbench on UPPMAX
UpCloud – smog - (OpenStack)
• Open catalogue of VMIs• Hosted at Uppsala University
M. Dahlö, F. Haziza, A. Kallio, E.
Korpelainen, E. Bongcam-
Rudloff, and O. Spjuth.
BioImg.org: A catalogue of
virtual machine images for the
life sciences. Accepted in
Bioinformatics and Biology
Insights.
www.bioimg.org
Managing Virtual Machine Images
Mosler overview
• e-Infrastructure for working with sensitive data
• Copy of Norwegian solution (TSD)
• Designed to look like UPPMAX clusters
Mosler specifications
• High-performance computing in a virtualized environment (OpenStack)
• 2-factor authentication• Restricted data transfer in/out• Only accessible over remote desktop (ThinLinc) via
Mosler dashboard
• Aim: Compliant with all laws and regulations for analyzing sensitive data in Sweden
Consortia
DBA
Consortiummember
MyResearch
Virtual environment
storage compute
Mosler
Datahosting
Datasyncing
Access, analysis
Data hosting use case
Manager
DBA
Scientist
LifeGene
Virtual environment
storage compute
Mosler
1. Requestfor data
2. Approval
3. Dataextraction
4. Datatransfer
5. Access, analysis
Data extraction use case
Nov 2014
20M € total grant4M € IT-infrastructure
X-Ten System
• First system able to deliver 1000$ genome• Each run 1.2TB data
• 16 Human genome (30X)• 3 days per run
• Population scale genomics• 15K genomes per year
Swedish Genome Initiative
Call for a reference variation Database (1000 genomes) and for Whole Human Genome (half price).
Goal: 5.000 genomes 2015, 10.000 genomes 2016
Aug-11 Mar-12 Sep-12 Apr-13 Nov-13 May-14 Dec-14 Jun-15 Jan-160
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000NGI-Stockholm Procution (Jan-12 to Dec-15)
Production date
Giga
Bas
esData production
Conservative Prediction(60% of maximum production)
Whole Genome Sequencing
• Data on new scale, 80% expected to be sensitive New challenges
• Funding for IT-infrastructure from KAW foundation– Resources for data production (2 M EUR)– Resources for scientists (2 M EUR)
• A national security project funded by Swedish Research Council (5 M EUR over 4 years) – SNIC Sens
SNIC-Sens
• 4-year project, started Jan 2015• Project owner: SNIC (Ann-Charlotte Sonnhammer)• Project leader: Ola Spjuth (until end of this week)• Aims:
– Specifications for analyzing sensitive data in SNIC (hardware, legal, contracts, processes etc.)
– Evaluation on the use of public cloud providers (Google, Amazon)
– Make available e-Infrastructure for production and research of data generated at NGI, blueprint for other domains
SNIC-Sens roadmap
• Information classification workshop (21/5)• Risk/vulnerability analysis (2/6)• Specifications for hardware procurement• Public tender (end of this week)• Installation and testing of production system (Aug-
Sept)• Installation, configuration and testing of research
system (Q3-Q4)• Research system online (Q1 2016)
Two pilots for clinical data management
CML, Lucia Cavelier
MDR, Åsa Melhus