+ All Categories
Home > Documents > FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data...

FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data...

Date post: 02-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
23
FireCloud NCI Cloud Resource David Siedzik, Broad Institute Data Sciences Platform CI4CC Spring 2018 Symposium April 3, 2018
Transcript
Page 1: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

FireCloud NCI Cloud ResourceDavid Siedzik, Broad Institute Data Sciences PlatformCI4CC Spring 2018 SymposiumApril 3, 2018

Page 2: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Broad’s Data Sciences Platform

Our vision is to accelerate science, transform medicine, and improve lives through data technologies.

We do this by creating software, operating services, and nucleating collaborations that maximize the impact of the data sciences on the biomedical ecosystem.

2

Page 3: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

DSP areas of focusThe Data Sciences Platform organizes its work in four focus areas:

● Data Generation: processing all sequence data generated by the Broad Genomics Platform

● Platform: developing and operating a cloud-based, freely-available platform for scalable, collaboration bioinformatics analysis (powers FireCloud and research/production tools for several key studies)

● Algorithms: development and support of GATK and Picard tools

● Engagement: user-facing websites for collecting research data directly from participants, and for delivering insights to researchers through curated, interactive data exploration tools 3

Page 4: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

FireCloud overview

An intuitive, powerful web application for collaborative research in the cloud

FireCloud is a web application, hosted by Broad, that gives researchers access to powerful compute, up-to-date tools, and convenient security within the intuitive framework of a workspace.

Users can share a workspace with others, and, if desired, a user can access data directly through Google Cloud Platform tools in addition to the FireCloud application and API.

FireCloud is free to access by everyone, with only costs for storage and compute billed by GCP.

Page 5: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

FireCloud’s components include:

• Workspaces: a secure sandbox to organize data, analyses, and results; and to share with collaborators

• Workflows: scientific tools chained together using the WDL language, and the Cromwell execution engine to dispatch jobs

• Notebooks: scalable and secure Jupyter notebooks to run ad-hoc analyses, running on powerful Spark clusters

• Tool Repository: an index of shared tools, driven by the community

• Data Library: an index of available datasets and files, with capabilities to search based upon user data access

FireCloud overview

5

Page 6: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Data curation

A project manager for a data generation consortium works with contributing institutions to process data before documenting it and preparing it to be shared with collaborators for analysis: core investigators at first and then the world.

Key user scenarios

Batch processing

An analyst wants to run a best-practices mutation detection workflow on 150 individuals from a published dataset.

She needs to find the dataset, the latest version of the workflow, and launch this operation without detailed knowledge of cloud computing tools.

Interactive analysis

A small multi-institution analysis team is preparing to explore a new dataset before performing a novel analysis.

In multiple passes, investigators use bash commands, command line utilities, and Python to inspect results before finalizing an approach.

6

Page 7: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Not just an application: a platform for biomedical data science solutions

FireCloud’s platform comprises multiple services that can be used to drive different user interfaces or scientific portals.

For example, data can be hosted in a FireCloud workspace and used to power an interactive visualization. Or a portal can initiate a workflow analysis on behalf of a user.

...there’s more flexibility around API deployment to come!

FireCloud under the hoodFireCloud Web Portal

Platform Developer API

Dat

a Li

brar

y

Wor

kspa

ces

Wor

kflo

ws

& P

ipel

ines

Not

eboo

ks &

Clu

ster

s

Secu

rity

& A

uthe

ntic

atio

n

7

Page 8: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Single Cell PortalVision: To communicate reproducible science through the dissemination and sharing of single-cell data.

A place for scientific data (reproducibility):•A repository for derived data (expression matrices, visualizations, …).•A repository for primary files (non-human), human files can be linked.

A place for scientific exploration (communication):•Automated generation of interactive plots.•Fidelity to publications is preserved to allow scientists to specify exact plots for key visualizations.

Supporting all phases of your scientific inquiry:•Studies can be private, shared privately, or public.

https://portals.broadinstitute.org/single_cell 8

Page 9: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Single Cell Portal Features● Rich study description● Interactive exploration● Gene query through 2D and 3D● Explore metadata● Sharing gene lists● Data sharing

9https://portals.broadinstitute.org/single_cell

Page 10: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Usage Examples

10

Page 11: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Library & Tools

Total datasets in library = 136

MetricsUser Activity

Total registered users = 3,408

Workflows Run

Total workflows run = 1.4M

11

Submissions Created 2016 to present = 59k+

Workspaces Created 2016 to present = 4k+Total public methods = 360

Page 12: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Running HistXtract on TCGA diagnostic images in just a few clicks

HistXtract is a pipeline for extracting nuclear morphometry features from whole-slide images.

Members of the Getz Lab created an open-access FireCloud workspace preconfigured to download and analyze FFPE images for 9,600 participants across 32 types of cancer.

In just two steps, any FireCloud user can download the available images and run the HistXtract analysis workflow for some or all participants.

ITCR User story: Histopath image analysis

Kong J, Cooper LAD, Wang F, Gao J, Teodoro G, Scarpace L, et al. (2013) https://doi.org/10.1371/journal.pone.0081049 12

Page 13: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

ITCR Story: Trinity Cancer Transcriptome Analysis ToolkitExpanding accessibility across the community

TCAT developers are actively using FireCloud to enable its use on large datasets and share best practices with the community.

● STAR-Fusion and FusionInspector workflows currently in FireCloud

● Currently used by Broad’s clinical research sequencing program in a pilot study in pediatric oncology to demonstrate the use of CTAT in a clinical setting

B Haas, A Regev - Klarman Cell Observatory 13

Page 14: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Broad’s Genomics Platform delivers all WGS and arrays projects into workspaces

Cloud data processing & delivery:

● 46,000+ WGS CRAMs delivered to date (in 76 total workspaces)

● Arrays data delivery live this month● WES planned for Q2 implementation

Delivery for all data types:

● For data types still processed on-premises, FireCloud transfer used in lieu of 3rd party cloud delivery services

FireCloud in production

FireCloud Explorer (alpha)launched in January 2018

- Assists users not familiar with gsutil with upload and download of data

14

Page 15: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

FireCloud in production

376

The Broad Genomics Platform TAG Team (Translational Analysis Group) has run 2,878 analyses on 3,605 unique samples for internal and external projects using FireCloud.

15

Page 16: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

All of Us Research Program

FireCloud is the platform for the researcher workbench that will be used to access All of Us study data.

NIH Data Commons Pilot

FireCloud workspaces are a key component of the NIH Data Commons Pilot Phase, providing preliminary FAIR access to GTEx, TOPMed, and Model Organism Databases (MODs).

NHLBI STAGE Platform

NHLBI will host all its future datasets, many of which are much larger than any previously generated, in a new platform called STAGE. FireCloud workspaces are a key part of this Commons experience.

Related efforts outside of NCI

Each of these Commons efforts, in conjunction with the NCI Commons Core Framework, will extend FireCloud to work with data stored in repositories around the globe.

16

Page 17: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Human Cell Atlas

Broad is operating the production processing facility for the HCA Data Coordination Platform.

DSP will also load HCA data into FireCloud workspaces to enable researchers to run these same production pipelines in addition to interactive analysis.

Brain Cell Data Center

Broad and the Allen Institute will leverage HCA infrastructure to host all data generated from the Brain Initiative Cell Census Network.

This initiative is similar to HCA but focused on the brain, and FireCloud will provide a similar analysis environment.

Related efforts outside of NCI

17

Page 18: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

All of Us Research Program

18

Page 19: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

All of Us Research Program: Researcher Portal

19

Page 20: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

We are continuing with further development to evolve the platform into a citizen of an interoperable world through principles outlined across the Data Biosphere.

● Openness: not only remaining open source, but continuing to engage with and learn from our users to build compelling software for the evolution of biomedical research

● Standards Awareness: deeply engaging in conversations around emerging and ratified standards through the Global Alliance for Genomics and Health and its driver projects (including many programs mentioned earlier)

● Modularity: enabling tailored hosted research environments through the combination of various services and interfaces

● Community: welcoming community contributions to the platform, through code, tools, or dialogue around real-world analysis

What lies ahead

20

Page 21: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Aligning the approach: The Data Biosphere

21

Page 22: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Broad Institute Technical Platforms

Data Sciences Platform

Anthony Philippakis, MD, PhD - Chief Data Officer

Kristian Cibulskis - Director of Engineering

Workbench Engineering and Communications Teams

Genomics Platform

Niall Lennon, PhD - Director, TAG Team

Broad Institute Scientific Programs

Cancer Genome Analysis

Gad Getz, PhD - Director

Klarman Cell Observatory

Aviv Regev, PhD - Director

Data Biosphere University of California, Santa Cruz

University of Chicago

Vanderbilt University

Verily Life Sciences

Chan Zuckerberg Initiative

Funding Partners

National Cancer Institute Cloud Resource Program

NIH Office of the Director Data Commons Pilot Program

National Heart, Lung, and Blood Institute STAGE Program

Acknowledgments

22

Page 23: FireCloud NCI Cloud Resource - Cancer Informatics FireCloud CI4CC deck final.pdfCloud data processing & delivery: 46,000+ WGS CRAMs delivered to date (in 76 total workspaces) Arrays

Thank you!

Twitter: @BroadFireCloudWebsite: www.firecloud.org

Data Biosphere: www.databiosphere.org

23


Recommended