What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in...

Post on 16-Sep-2020

0 views 0 download

transcript

WhatarethechallengesforDataScience?

MagnusRattrayDirector,UniversityofManchesterDataScienceInstitute

ProfessorofComputational&SystemsBiologyFacultyofBiology,Medicine&Health

UniversityofManchester

www.datascience.manchester.ac.uk

TheLargeSynopticSurveyTelescope:• 3.2Gpixelcamera• 2000exposurespernight• 20TBpernight• 10yearsurvey100PBdata

Initsfirstmonthofoperation,LSSTwillsurveymoreoftheUniversethanallprevioustelescopes

Astronomy

Particlephysics

LargeHadronCollider(Atlasexperiment)• 1billionproton-protoncollisionseverysecond• Nominaloutputrateofdetector:68TB/s• Actualoutputratetodisk:1.5GB/s(reducedviafastidentificationof“interesting”events)

• Datarateofupto100TBperday,forupto6monthsperyear,for10-15years200PB

Commute-flowisabrandnewgeodemographic classification ofcommutingflowsforEnglandandWalesbasedonorigin-destinationdatafromthe2011Censusthathasbeenusedtoanalysethespatialdynamicsofcommuting.Aninteractivetoolkitis@www.commute-flow.net26milliontraveltoworkflowsrecordedin2011censusforEnglandandWales

Hincks,S.,Kingston,R.,Webb,B.andWong,C.(inpress)ANewGeodemographicClassificationofCommutingFlowsforEnglandandWales.InternationalJournalofGeographicInformationScience.

A new two-tiergeodemographictypologyofcommutingpatternswith9super-groupsandatotalof40groups.Eachincludesapenportraitwithaninteractiveflowmapandradialchart.

Geography

Mental health

Sport

Swimmingpool

Volleyball

1.RawGPSdata

2.Detectionofgeolocationvisited

3.Geolocationsvisited

4.Identificationofplacesvisited

5.Placesvisited

6.Typeofplacesandactivitiesrecognition

7.Out-of-homeactivities

Difrancesco et al. Out-of-home activity recognition from GPS data in schizophrenic patients. IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016).

Respiratoryhealth

Researchisincreasinglydata-drivenBottom-upmodelling:• Definemodelofsystemfromassumedmicroscopicprinciples• Developatractableapproximationto“solve”themodel• Exploresystempropertiesforvariousparametersettings(e.g.growthrates,stationaryproperties,phasetransitions)• Test/refine/revisethemodelgivenexperimentaldata

Data-drivenmodelling:• Identifysystemvariablesthatcanbemeasured:thedata• Fitagenerativeorpredictivestatisticalmodeltothedata• Makeinferences,learnhiddenvariables,scoremodels

Increasinglyweareconnectingtheseapproaches– allowingforstrong“mechanistic”priorknowledgewithindata-drivenmodels

ChallengesforDataScience

• Bigdata– scalability• Complexdata– modelling &inference• Messydata– probability& statistics• Humandata– privacy,ethics,interaction• Accessibledata– openness,reproducibility

“Datahandlingisnowthebottleneck.Itcostsmoretoanalyze agenomethantosequenceagenome.”DavidHaussler

High-throughputDNAsequencing

Example:Genomics

Genomics:bigdata@SRR566546.970HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109length=50TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT+SRR566546.970HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109length=50hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^Y@SRR566546.971HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108length=50GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC+SRR566546.971HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108length=50hhhhgfhhcghghggfcffdhfehhhhcehdchhdhahehffffde`bVd@SRR566546.972HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109length=50TGCATGATCTTCAGTGCCAGGACCTTATCAAGCGGTTTGGTCCCTTTGTT+SRR566546.972HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109length=50dhhhgchhhghhhfhhhhhdhhhhehhghfhhhchfddffcffafhfghe

200GBdatafor60xcoverageoverhumangenome20PBfor100Kgenomes

Royetal.Science2010

RNA-SeqTranscriptomics

Bis-Seq,ChIP-SeqEpigenomics

DNA-SeqGenomics

HiC,ChIA-PETInteractomics

Genomics:complexdata• DNAsequencingisanincrediblydisruptivetechnology• Genomicsisnotjustaboutgenomes!Many‘omics layers

Lister,Pelizzola etal.Nature2009

Genomics:messydata

• 111reference“epigenomes”• 2804high-throughputsequencingdatasets• 1.5x1011mappedsequencereads• >1013sequencedDNAbases(>1000genomes)

Everynew‘omic layerisasbigasagenome

Genomic&Precisionmedicine

Precisiondiagnosis&precisiontreatment

Prognostics&Theranostics

Informingprevention

Newmodelsofcareatdisease

boundariesDrivingrapidinnovation&adoption

Roleofmulti-omics

Linking‘big’data

Re-aligningincentivesforcommiss’ng –drivenbyscience,research

Genomics– humandata

“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland

• Life-coursecomplexityindicatesmultiple(sub-)diseases– Usuallystartsyoung– Mayprogress,remit orrelapse overlife

• Inconsistentgene-environmentinteractionsindicatesmultiple(sub-)diseases– Variableeffectsofgeneticpolymorphisms,e.g.CD14– Variabletreatment-setting interactions

Example:Asthmas StretchGenomics

Calleleassociated

Talleleassociated

Noassociation

CD14EndotoxinReceptor

SimpsonAetal.Endotoxinexposure,CD14,andallergicdisease:aninteractionbetweengenesandtheenvironment.AmJRespir Crit CareMed.2006;174(4):386-92.

50-60%heritabilityintwinstudiesbut<2%phenotype

explainedbycurrentgenomics

SlidesfromIainBuchan

• ProgressionofallergyEczema →Asthma→Rhinitis

• Inferredfrompopulationsummary→

• Assumedcausal linkbetweeneczema– asthma&rhinitis

• Clinicalresponse:target childrenwitheczematoreduceprogressiontoasthma

ReceivedWisdom:AtopicMarch

Spergel &Paller,2003

WorldAllergyOrganization,2014

EcologicFallacyRevealed

Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.

MRCSTELARconsortiumworkingatscaleacrossMAASandALSPACScohorts

Model-basedmachinelearning

allowingfortransitionsbetweenskin,lungandnasalallergiesovertime

BetterTargetsfor‘Omics

Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.

Disambiguatediseaseprofilestomovetowardcausalmodellingandefficientidentificationof

mechanisms

Data TypeLarge-scale Structural Changes

Balanced Translocations

Distant Consanguinity

Uniparental Disomy

Novel / Known Coding Variants

Novel / Known Non-coding

VariantsTargetedgenesequencing û û û û ü ûSNP+arrays ûü û ü ü û ûArrayCGH* ûü û û û û ûExome ûü û ûü ûü ü ûWholeGenome ûü ü ü ü ü ü

+SingleNucleotidePolymorphism*ComparativeGenomicHybridisation

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

0 0.5 1 1.5 2 2.5

GenotypingWholegenome3.3bnbasesBothexonsandintronsExome

10mbasesExonsonly

Panels<10mbases

Subsetofexons

“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland

Towardsgenomicmedicine

Genomics– accessibledata?

• Sequencing100,000genomesfrompatientswithcancerandrarediseases• £24mdatainfrastructureawardfromMRC• GenomicsEnglandClinicalInterpretationPartnerships(GeCIPs)toenhancevalueofdata

• SequencingfacilityattheSangerCentre• 30PBdatainadatacentreonamilitarybase• Researchers(GeCIP members)willnotbeallowedtodownloadrawdatafiles

• Restrictedaccesstodataandcomputethroughsecurevirtualdesktop(Inuvika)

• Analysishastomovetothedata

Buthowdowemovethistoaglobalscale?Howdoweanalyseacrossmanydatasets?

100KGenomesProject

NextGenomicRevolution:Scalingdowntosinglecells

Microfluidicssequencing/cytometry

DNA/RNA

ProteinFuidigm C1

Single-celldata

• Existinggenomicmethodsaverageoveracellpopulationof̴107cells

• Single-cellmethodsuncoverhiddenstructure:– Diversesub-populationsofimmunecells– Clonalstructurewithintumours– Rarecirculatingtumourcellsfromblood– Asynchronouscellulardynamics– Eachcellisnowahigh-dimensionaldatapoint

Clusteringsinglecellproteindata

Amiretal. NatureBiotech.2013

Uncoveringclonalevolutionintumours

Time

Normal cells

t0 t1 t2 t3 tsample

Tissue volumeat time of sampling

A

ABD

ABC

Genotypes

20%

15%

25%

40%

Clones

Life history of the tumor Poly-clonal tumor at sampling

0

Clonal evolution tree

15

20

0

A

AB

40

ABD

25

ABC

FlorianMarkowetz,CRUKCambridge– fromhisblog“ScientificB-sides”

Approach

Targeted:• BasicCNAtoverifyCTCstatus• Target1-20genes• UseWBCsas–ve controls

GenomeWide:• Copynumberalteration(CNA)• WES- comprehensiveanalysis• UseWBCsas–ve controls

6SCLCpatientschosenwith=>4singleisolatedCTCsandCTCpoolsCNAdatafrom6,682cancer-relatedprotein-codinggenes

TP53

* Poolof10CTCs

** * * * * * *

Circulatingtumourcells(CTC)profiling

Expandedstudyongoing,2000CTCsfrom30patients

CTCenrichmentviaCellSearchCTCisolationviaDepArray

CarolineDiveandGed Brady,CRUKManchesterInstitute

Modellingchallenge:confoundingvariation

Stegle etal.NatureReviewsGenetics2014

SinglecelldataLastyear

Single-cellRNA-Seq103 cellsperexperiment107 sequencereadspercell104featuresextractedpercell

CyTOF proteinquantification103cellspersecond106 perexperiment30-50featurespercell

ThisyearSingle-cellRNA-Seq106 cellsperexperiment108 readspercell>105featurespercell

Singlecellmulti-omics

?

Whatarethepinchpoints?

• Datavolume:costandtransferspeed• Dataanalysis:scalablealgorithms• Dataquality:batcheffects,missingdata,missingmetadata,conceptdrift

• Dataintegration:multi-modalmodelling• Reproducibleandrobustresearch

Datavolume

• Movealgorithmstothedata– Putcomputeclosetolocaldata– Commercialcloud(e.g.BaseSpace,Cytobank)– Bespokesecurecloud(e.g.100Kgenomesproject)

• Issuestoconsider– Willyouralgorithmsgivesameresults?– Willtheanalysisbereproducibleinthefuture?– Howtointegrateacrossresources?

Dataanalysis

• Scalingupalgorithms,e.g.DeeplearninglibrariesintegratingCPU/GPUarchitectures

• Fastapproximatemethods• Online/streamingdataprocessing• Avoidsolvingcompute-intensiveintermediatetasks:e.g.avoidgenomicalignmentpriortocountingsub-sequencematches(k-mers)

• Mixedprecisionnumerics

MethodsforMachineLearningnolongersimplyassessedonpredictiveaccuracy

Dataanalysis

Dataquality

Bigcollecteddataaretypicallynotdesignedforasingleresearchquestion(oranyresearchquestion)

Weneedmethodstodealwith:

Confounders,batcheffects,missingdata,missingmetadata,conceptdrift,outliers….

(whileremainingscalable)

RobustandreproducibleresearchPublishdata,code,workflows,versionnumbers,containers…

Resultsshouldnotdependstronglyonarbitrarymodellingchoices“shakethemodel”(ChrisHolmes)

“Hypothesisselection”leadstoupwardsignificancebias• Trytobreakyourmodels• Userobustmodels• Usebootstrapping

Keeptrackofallhypothesesyouhaveconsidered• Storeyourworkinghistory– notebookscience• Publishnegativeresults

Robustandreproducibleresearch• Buildreproducibilityintoyourroutine– don’twaituntilafter

yourpaperisaccepted• Don’tfeaturehere:

Conclusion

• Researchisincreasinglydata-drivenacrossallfields– DataScienceisnowubiquitous

• Newchallengescomefromthescale,complexityandnatureofdata:Bigdata– scalablealgorithmsandarchitecturesComplexdata– bettermodels:bottomupandtopdownMessydata– statisticalthinkingisessentialHumandata– ethicaldimensionsareofkeyimportanceAccessibledata– avaluablecommonresource