+ All Categories
Home > Technology > Data Enthusiasts London: Scalable and Interoperable data services. Applied to Genomics

Data Enthusiasts London: Scalable and Interoperable data services. Applied to Genomics

Date post: 07-Aug-2015
Category:
Upload: andy-petrella
View: 282 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
by Data Fellas, Data Enthusiasts v 4.0 (July, 13th ‘15) Scalable and Interoperable data services Applied to Genomics
Transcript

by Data Fellas, Data Enthusiasts v 4.0 (July, 13th ‘15)

Scalable and Interoperable data servicesApplied to Genomics

Young Belgian Startup

The Data Fellas Startup

Data ScienceXavier Tordoir@xtordoir

Andy Petrella@noootsab

Data Processing

Scalable Machine Learning

Micro Services oriented

Data Fellas EcosystemWe’ve worked with

First: Data ScienceAnalysis

Spark Notebook

First: Data ScienceAnalysis

Production

Project Generator

Mesos / C* / DCOS

First: Data ScienceAnalysis

Production

Distribution

Micro Service / Binary format

Marathon

First: Data ScienceAnalysis

Production

DistributionRendering

SChema for output

GG / D3 …

First: Data ScienceAnalysis

Production

DistributionRendering

Discovery

Service Metadata

SOLR , …

First: Data ScienceAnalysis

Production

DistributionRendering

Discovery

CatalogSpark Notebookusing Services too

First: Data ScienceAnalysis

Production

DistributionRendering

Discovery

Share Analyses

Share Results

Share Datasets

First: Data Science

Project Code Name:

Shar3

Next: Applied TO Genomics

Genomics data is pretty big

● 100,000’s genomes in 2015● 1,000,000’s … ● 100,000,000’s … ● …

Next: Applied TO Genomics

Genomics data is pretty big and of High dimensionality

One genome:○ 3 billions bases (basic DNA component) sequence○ 30 - 60 x coverage for quality○ 10’s to 100’s millions variants (variable bases

from one individual to the next)

Next: Applied TO Genomics

e.g. 1000genomes project:

● 200TB compressed data● organised in files/directories● data formatted following specs in a … PDF

Data and services schemas are required

What we do with genomics data?

Lots of Querying and Learning:

E.G.

● Population structure is a fundamental basis● Querying relationships between genomes and other

biological features

Hey… no one has all data!

Metadata

What we do with genomics data?

Lots of Querying and Learning:

E.G.

● We do some specific Modelling on some data…

Hey… no two serve the same computations!

Service Discovery

Interoperability

So, no one has all data … BUT all should be able to talk…

Interoperability (GA4GH)

Interoperable… Analysis

Production

DistributionRendering

Discovery

Share Analyses

Share Results

Share Datasets

Interoperable & scalable…

GA4GH + Shar3 = Med@Scale

+ ADAM & spark+ In Memory optimization (Tachyon)+ Deployment (e.g. DCOS)

Wrap-UP

Follow us @DataFellas and get notified about our

+ sharing platform at scale: Shar3

+ Google Genomics At Home (^.^): Med@Scale

+ future plans: modules for Trading, Geospatial, other medical data, …

ReferencesAdam: https://github.com/bigdatagenomics/adamBdg-Formats: https://github.com/bigdatagenomics/bdg-formats

GA4GH website: http://genomicsandhealth.org/GA4GH data working group: http://ga4gh.org/

@Spark-Notebook: https://github.com/andypetrella/spark-notebook/

Med-At-Scale: https://github.com/med-at-scale/high-health

Data Fellas: http://data-fellas.guru/ Training: http://spark4devs.data-fellas.guru/

Q/ATHANKS!


Recommended