+ All Categories
Home > Documents > Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing?...

Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing?...

Date post: 26-May-2020
Category:
Upload: others
View: 13 times
Download: 0 times
Share this document with a friend
20
Can Python Save Next Genera1on Sequencing? Chris Mueller Life Technologies June 30, 2010 SciPy 2010 Aus1n, TX
Transcript
Page 1: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

CanPythonSaveNextGenera1onSequencing?

ChrisMuellerLifeTechnologiesJune30,2010

SciPy2010Aus1n,TX

Page 2: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

HumanGenomeProject

10YearsThousandsofSequencers$3,000,000,00021BillionBasePairs(Gbp)

ModernSequencing

2weeksOneSequencer$6,000100‐200Gbp

Page 3: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™
Page 4: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

TCA-AGCAGCAGGA ||| || ||| ||| TCATAG-AGCGGGA

Page 5: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

10 GB – 2 TB of raw data is transferred

from the instrument to the analysis cluster.

Next Generation Sequencing (NGS)

.

. >187_29_706_F3 T23302010303131123123022203111123200210100122001102 >187_29_800_F3 T31120012213222002222130121121122112032220323121202 >187_29_824_F3 T22211130023020133231323302310303131123123022201211 >187_29_829_F3 T23302010003130123123022203111120122123202132301212 >187_29_858_F3 T23302010303131123123022203111123222123122122321212 >187_29_885_F3 T23302010303131123123022203111121220013212122021222 . .

The sequence from each bead is reported in a data file. Files

can be over 100 GB.

Sample prep breaks the DNA or RNA into short

segments that are attached 500 million to 1

billion beads.

+

The bead sequences are either assembled into a new

genome or “mapped” to a reference genome.

=

The mapped reads from RNA samples can be further analyzed to

determine which genes are active in the sample.

Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™ sequence genomic material in millions of small pieces, enabling a high-level of throughput and sequencing depth.

Page 6: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

AcquireaReference

SequencethediseasedgenomeGATCAACTTG AGGCCAGCCT GACCAACGTG GCAAAACTCC ATCTCTACTA AATTAGCTGA GCGTGGTGGC ACGCACCTGT CATCCCAGTT ACTCAGGAGG AGAATGGCTT GAGCCTGGGA GACAGAGGTT GCAGTGAGCT GAGATTGCAT TAGCCTGGGT GACAGAGTGA AATGGAGGGA GGAAAAAAAA AAAAAAGGAA AGGGAGCCAG CCTAGGATGG GGAAGGCTCA CCAGAAGTGG ATGCAAAGAG GAGCTCATTC TATTTGCCTA GGAAAGAAAA ACGTCCAGAA ACCTGGCCTT GCCGAGGCCC TCCAGGAAAG CCAGGCAGAC CCTGCTCCTG CTCTGACCCC

SequencetheexpressedRNA

Annotatethediseasedgenome

Compareagainstthehealthysample

The Goal: Medical Genomics

Page 7: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

2‐3HDMovies

6%ofGoogle’sIndex

30xGoogle’sDailySearchTraffic

Sequencethediseasedgenome100Gbp36Hours/4servers

100Gbp15BillionSearchOpsSequencetheexpressedRNA

Annotatethediseasedgenome MillionsofFeatures

CompareagainstthehealthysampleHumanDrivenAnalysisandInterpreta1on

AcquireaReference 10GB

GATCAACTTG AGGCCAGCCT GACCAACGTG GCAAAACTCC ATCTCTACTA AATTAGCTGA GCGTGGTGGC ACGCACCTGT CATCCCAGTT ACTCAGGAGG AGAATGGCTT GAGCCTGGGA GACAGAGGTT GCAGTGAGCT GAGATTGCAT TAGCCTGGGT GACAGAGTGA AATGGAGGGA GGAAAAAAAA AAAAAAGGAA

AGGGAGCCAG CCTAGGATGG GGAAGGCTCA CCAGAAGTGG ATGCAAAGAG GAGCTCATTC TATTTGCCTA GGAAAGAAAA ACGTCCAGAA ACCTGGCCTT GCCGAGGCCC TCCAGGAAAG CCAGGCAGAC CCTGCTCCTG CTCTGACCCC GATATGTAGA AAAGAGGAGA TGGGCTTTGG CCCAGAGGAC AGCAGCGTTA CAGTTCCCAG TCGGTTCAAG GTTGCTAGGC TCAGCGAACT GCAAGTCCCT TTTCTTCCTA AAGGTCCCCA GTTCCTCATG ATTCTTCTGA GGGTCTCATC GGGCCTGCAG TCAGCTAGCC ACCCCACTGC CCCATGCCTG CAGTGAAGAC ACCTGGAAAT GGCTGGTGAC AGAAAAGTCC TCAGGGCCAC AGCACTCTCT TTCAGGTGCC TTGCCTGATG GTGACAAGGC TGGTTTTGCA TAAACAGCTC ACCCAGATGT GGCTTCTGAC TTGAGTGGAC CCCCATGAAG AGCTGCAGAG AAAGACAAGA GAATGGAGAA AATGAGGAGA AAGGAAACTA CAGAGTGAGA ATCTGACAGG TGTGACCACA GCAGTGTGAC ATTTATGAAC AGTGTGGAAA GCCCCTGAAT TAAAAACTTC CTGGGGAAAT AAGCCACTCC ACATATCGGT AGAGTGGAGG AGGCTGTTGA CCCGTGTGTG TCCCCATGAC TCAAAGAGGG CTGTTTCCAA TATCCCGAAA TCAGTCTTGC TGGGAGAACT GGGAAAATAA

ACCCCCGTAG GAAGCTACCT TTAATCCCAA GTGCCCAAGG CTAGGAGAGA GGCGATCCAG GACACCAGTG ACTGACACAG CCAGAGGTGG GAAAGGGGAG GCACAAAAGT GAGGAGTGAG CAAGGGTCTG AGAGGGAAGG CCATGTGGGC ACCCACATCA GAGACTGACA TGAGGATTAA AGGAGAGCAT AGGTGATCGG ACAGAAGAGA GGCAGCTCTA CACCCCTTGC TTGCAATTCT GAGCATTCTG GTTTGGCCAT CAAACCAGAC CTCAATTGAG ACAAGGCTAT TTAAGCTTCC

vs.

50‐500GBRetainedData

Personal Genomics

Page 8: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

4MillionServers 1.2millionserversweresoldinQ42009

TheLargeHadronCollidergenerates300EB/year

3YearsonUT’sRangerSupercomputer

AcquireaReference 1PetabyteofBandwidth 1BillionYouTubeVideos

72MillionHours

1Exabyte/Day

Sequencethediseasedgenome

SequencetheexpressedRNA

Annotatethediseasedgenome

27,000RadiologistsintheUSCompareagainstthehealthysample

Population Genomics

Page 9: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

CanPythonsavenextgenera1onsequencing?

Probablynotonitsown…

HowcanPythonhelpnextgenera1onsequencing?

Howcannextgenera1onsequencinghelpPython?

Beherques1ons:

So…

Page 10: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

NGSDataandWorkflowComponents

ReadsRefs 3°

Assembly Mapping

Annota1onVaria1on

Interac1ons Expression

EDA NovelAppsExploratory Analysis

Undirected analysis and unforeseen sequencing applications

Standard Scientific Workflows

Understand the structure and function of genomic elements

Fundamental Algorithms Transform the raw data into scientifically relevant forms

References and Data

Reference genomes, domain-specific data sets, raw data, analysis results

Page 11: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

ReadsRefs 3°

Assembly Mapping

Annota1onVaria1on

Interac1ons Expression

EDA NovelApps

Algorithms Considera1onsDataandWorkflows

Graphs,Indexschemes,dynamicprogramming

GenerallyI/Oboundwithopportuni1esforparallelism.

Standarddataformatsaretext‐based

Caneasilyspan100sofTBsforasmalllab.Referencesareinconstantflux.

Dataissmaller,butanalysismayrequireroundtripsbacktoreads.

Clustering,sta1s1calmodels,networkanalysis

Informa1onvisualiza1on,data‐mining

Interac1vityisessen1al.

Algorithms and Scale

Page 12: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

ReadsRefs 3°

Assembly Mapping

Annota1onVaria1on

Interac1ons Expression

EDA NovelApps

So8ware Hardware

Graphlibraries,pipelinemanagementsonware,jobschedulers

Clusterswithhigh‐memorynodes,fastaccesstostorage.

Databases,ORMtools,Flatfiles

Distributedfilesystems,FastSANs

Singlenodeswithfastaccesstostorage

Scrip1nglanguages,dataanalysislibraries

Genomebrowsers,Matlab,Statstools,R,etc

Worksta1ons,laptopswithfastaccesstostorage

DataandWorkflows

Software and Hardware

Page 13: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

ReadsRefs 3°

Assembly Mapping

Annota1onVaria1on

Interac1ons Expression

EDA NovelApps

Current Poten1al

Rapidprototyping,pipelinemanagement,u1li1es

Mul1‐processing,queuingtobuildoutpipelinemanager

Pythonparsersforvariousformats,BioPython,SAMToolsPygrforreferenceandannota1onaccess

DiscoorHadoopfordistributedreadmanagement

Galaxy,user‐developedtools

GUIs,Disco

DataandWorkflows

Python and NGS

Usertools,u1li1esshippedwithassemblersandmappers,NGSlibraries(HTSeq),NumPy,SciPy

Morelibraries!

Page 14: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

Example:Chromosome20ExpressionAnalysis

Reads

Refs

Annota1on

Mapping

Expression

ReferenceGenome

Posi1veStrand

UHR

HBR

HELA

UHR

HBR

HELA

Nega1veStrand

Visualiza1on

A portion of the Chromosome 20 expression map created using reads from SOLiD™ at Life Technolgies’ Austin site. The full map spans over 260 feet when printed sequentially. Chromosome

20 represents 2.1% of the human genome.

Custom Python tool rendering to PDF using ReportLab

LifeTech WT pipeline for mapping and expression analysis

Custom Python scripts for result aggregation

14 Billion reads from three sample types. 1400+ files comprising 4TB of sequence data

Human Genome 36.3 Human RefSeq 39

Page 15: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

CulturalChallenges

•  Goals– Mostuserscareaboutthescienceandjustwantthetechnologytowork

•  Skills–  Fewbiologistsandbioinforma1ciansarealsoexpertsinHPCandtera‐scaledatamining

•  Expecta1ons–  Excelprovidesanswersinstantly.NGSanalysisshould,too.

•  Cost–  Ijustspent$500konaninstrumentandIneedtospendhowmuchonacomputeranddeveloper???

As recently as five years ago, computing in the Life Sciences consisted of pencils, lab notebooks, and the occasional Excel spreadsheet.

Needless to day, NGS caught the community completely off guard.

Page 16: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

FinalThoughts•  NGSanalysisisinitsinfancy

– Fundamentalmethodsares1llbeingdeveloped– Thescien1ficcommunityiss1llcomingtogripswithitspoten1alandchallenges

– Therewon’tbeanysilverbulletsinthenearfuture•  NGSanalysisiscomplicatedbythescaleofdata

– Tradi1onalsupercompu1ngdoesn’thelpmuch

•  Developmentparadigmsthatsimplifylargedataprocessingwillsucceedinthisspace

•  ManyPythonprojectsshowpromise,butthere’ss1llworktobedone!

Page 17: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

ThankYou!LifeTechnologiesAus1n

Bioinforma)csJeffSchagemanJoelBrockmanPennWhitleyDanWilliams

TranscriptomicsSheilaHeaterKelliBramlehDianeIlsley

ThePowersthatBeBobSeherquistTimSendera

LifeTechnologiesGlobal

So5wareandHPCLeeJonesPatrickLeGresleySomaleeDahaAsimSiddiquiAaronKitzmillerKeithMoultonMikeLyonsYingZu

ITMichaelMooreAntoineUzzeniDavidMorganJoannaCurlee

SoundingBoards

PeterWangGlenOteroTravisOliphant

© 2010 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners. Void where prohibited, prohibited where void. For Research Use Only. Not intended for any animal or human therapeutic or

diagnostic use.

Page 18: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

WhatPeopleareUsing•  CompiledLanguages:

–  EquallydistributedbetweenC,C++,andJava

•  InterpretedLanguages:–  AlmosteveryoneusesPerl,withabouthalftherespondantsusingRorPython–  But,thefordailyusage,about31%usedPythonand47%usedPerl

•  ForLibraries:–  75%usedBioPerlonaregularbasis.SciPyisthenextmostusedat33%.

BioPythyonwaslowat16%

•  ForSta1s1cs:–  R(70%),Excel(53%),andPerl(35%)werethemostcommon

•  ForVisualiza1on:–  ExcelwasthemostcommonwithRandGNUPlotalsogeungmul1ple

responses

Page 19: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

ComputeEnvironment

•  Peta‐scalestorageenvironment•  Mul1‐coreprocessorsandmul1plenodes

–  thoughnotasmanyasyouthink•  64+GBofRAMhelps•  Heterogeneousprocessors

–  SIMD/GPUs•  Mul1plelanguages

–  C,C++,andJavaarecommonfor‘fast’code– ManytoolsshipwithPerlorPythonu1li1es

•  Thecloudwillmaheratsomepoint

Page 20: Can Python Save Next Generaon Sequencing? · 2014-02-02 · Python Save Next Generaon Sequencing? ... Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™

IdealStack

•  Mustbeuseablebyawiderangeofdevelopersandscien1sts

•  Standarddataformats•  ReferenceManagement•  Op1mizedkernelopera1ons•  Op1mizedrecord‐basedopera1ons•  Clusterawarepipelinemanagement•  Horizontallyscalablerecordstorage•  Databaseintegra1onforLIMs•  Interac1vevisualiza1on


Recommended