WhatarethechallengesforDataScience?
MagnusRattrayDirector,UniversityofManchesterDataScienceInstitute
ProfessorofComputational&SystemsBiologyFacultyofBiology,Medicine&Health
UniversityofManchester
www.datascience.manchester.ac.uk
TheLargeSynopticSurveyTelescope:• 3.2Gpixelcamera• 2000exposurespernight• 20TBpernight• 10yearsurvey100PBdata
Initsfirstmonthofoperation,LSSTwillsurveymoreoftheUniversethanallprevioustelescopes
Astronomy
Particlephysics
LargeHadronCollider(Atlasexperiment)• 1billionproton-protoncollisionseverysecond• Nominaloutputrateofdetector:68TB/s• Actualoutputratetodisk:1.5GB/s(reducedviafastidentificationof“interesting”events)
• Datarateofupto100TBperday,forupto6monthsperyear,for10-15years200PB
Commute-flowisabrandnewgeodemographic classification ofcommutingflowsforEnglandandWalesbasedonorigin-destinationdatafromthe2011Censusthathasbeenusedtoanalysethespatialdynamicsofcommuting.Aninteractivetoolkitis@www.commute-flow.net26milliontraveltoworkflowsrecordedin2011censusforEnglandandWales
Hincks,S.,Kingston,R.,Webb,B.andWong,C.(inpress)ANewGeodemographicClassificationofCommutingFlowsforEnglandandWales.InternationalJournalofGeographicInformationScience.
A new two-tiergeodemographictypologyofcommutingpatternswith9super-groupsandatotalof40groups.Eachincludesapenportraitwithaninteractiveflowmapandradialchart.
Geography
Mental health
Sport
Swimmingpool
Volleyball
1.RawGPSdata
2.Detectionofgeolocationvisited
3.Geolocationsvisited
4.Identificationofplacesvisited
5.Placesvisited
6.Typeofplacesandactivitiesrecognition
7.Out-of-homeactivities
Difrancesco et al. Out-of-home activity recognition from GPS data in schizophrenic patients. IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016).
Respiratoryhealth
Researchisincreasinglydata-drivenBottom-upmodelling:• Definemodelofsystemfromassumedmicroscopicprinciples• Developatractableapproximationto“solve”themodel• Exploresystempropertiesforvariousparametersettings(e.g.growthrates,stationaryproperties,phasetransitions)• Test/refine/revisethemodelgivenexperimentaldata
Data-drivenmodelling:• Identifysystemvariablesthatcanbemeasured:thedata• Fitagenerativeorpredictivestatisticalmodeltothedata• Makeinferences,learnhiddenvariables,scoremodels
Increasinglyweareconnectingtheseapproaches– allowingforstrong“mechanistic”priorknowledgewithindata-drivenmodels
ChallengesforDataScience
• Bigdata– scalability• Complexdata– modelling &inference• Messydata– probability& statistics• Humandata– privacy,ethics,interaction• Accessibledata– openness,reproducibility
“Datahandlingisnowthebottleneck.Itcostsmoretoanalyze agenomethantosequenceagenome.”DavidHaussler
High-throughputDNAsequencing
Example:Genomics
Genomics:[email protected]_11067_FC7070M:4:1:2299:1109length=50TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT+SRR566546.970HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109length=50hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^[email protected]_11067_FC7070M:4:1:2374:1108length=50GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC+SRR566546.971HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108length=50hhhhgfhhcghghggfcffdhfehhhhcehdchhdhahehffffde`[email protected]_11067_FC7070M:4:1:2438:1109length=50TGCATGATCTTCAGTGCCAGGACCTTATCAAGCGGTTTGGTCCCTTTGTT+SRR566546.972HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109length=50dhhhgchhhghhhfhhhhhdhhhhehhghfhhhchfddffcffafhfghe
200GBdatafor60xcoverageoverhumangenome20PBfor100Kgenomes
Royetal.Science2010
RNA-SeqTranscriptomics
Bis-Seq,ChIP-SeqEpigenomics
DNA-SeqGenomics
HiC,ChIA-PETInteractomics
Genomics:complexdata• DNAsequencingisanincrediblydisruptivetechnology• Genomicsisnotjustaboutgenomes!Many‘omics layers
Lister,Pelizzola etal.Nature2009
Genomics:messydata
• 111reference“epigenomes”• 2804high-throughputsequencingdatasets• 1.5x1011mappedsequencereads• >1013sequencedDNAbases(>1000genomes)
Everynew‘omic layerisasbigasagenome
Genomic&Precisionmedicine
Precisiondiagnosis&precisiontreatment
Prognostics&Theranostics
Informingprevention
Newmodelsofcareatdisease
boundariesDrivingrapidinnovation&adoption
Roleofmulti-omics
Linking‘big’data
Re-aligningincentivesforcommiss’ng –drivenbyscience,research
Genomics– humandata
“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland
• Life-coursecomplexityindicatesmultiple(sub-)diseases– Usuallystartsyoung– Mayprogress,remit orrelapse overlife
• Inconsistentgene-environmentinteractionsindicatesmultiple(sub-)diseases– Variableeffectsofgeneticpolymorphisms,e.g.CD14– Variabletreatment-setting interactions
Example:Asthmas StretchGenomics
Calleleassociated
Talleleassociated
Noassociation
CD14EndotoxinReceptor
SimpsonAetal.Endotoxinexposure,CD14,andallergicdisease:aninteractionbetweengenesandtheenvironment.AmJRespir Crit CareMed.2006;174(4):386-92.
50-60%heritabilityintwinstudiesbut<2%phenotype
explainedbycurrentgenomics
SlidesfromIainBuchan
• ProgressionofallergyEczema →Asthma→Rhinitis
• Inferredfrompopulationsummary→
• Assumedcausal linkbetweeneczema– asthma&rhinitis
• Clinicalresponse:target childrenwitheczematoreduceprogressiontoasthma
ReceivedWisdom:AtopicMarch
Spergel &Paller,2003
WorldAllergyOrganization,2014
EcologicFallacyRevealed
Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.
MRCSTELARconsortiumworkingatscaleacrossMAASandALSPACScohorts
Model-basedmachinelearning
allowingfortransitionsbetweenskin,lungandnasalallergiesovertime
BetterTargetsfor‘Omics
Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.
Disambiguatediseaseprofilestomovetowardcausalmodellingandefficientidentificationof
mechanisms
Data TypeLarge-scale Structural Changes
Balanced Translocations
Distant Consanguinity
Uniparental Disomy
Novel / Known Coding Variants
Novel / Known Non-coding
VariantsTargetedgenesequencing û û û û ü ûSNP+arrays ûü û ü ü û ûArrayCGH* ûü û û û û ûExome ûü û ûü ûü ü ûWholeGenome ûü ü ü ü ü ü
+SingleNucleotidePolymorphism*ComparativeGenomicHybridisation
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
0 0.5 1 1.5 2 2.5
GenotypingWholegenome3.3bnbasesBothexonsandintronsExome
10mbasesExonsonly
Panels<10mbases
Subsetofexons
“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland
Towardsgenomicmedicine
Genomics– accessibledata?
• Sequencing100,000genomesfrompatientswithcancerandrarediseases• £24mdatainfrastructureawardfromMRC• GenomicsEnglandClinicalInterpretationPartnerships(GeCIPs)toenhancevalueofdata
• SequencingfacilityattheSangerCentre• 30PBdatainadatacentreonamilitarybase• Researchers(GeCIP members)willnotbeallowedtodownloadrawdatafiles
• Restrictedaccesstodataandcomputethroughsecurevirtualdesktop(Inuvika)
• Analysishastomovetothedata
Buthowdowemovethistoaglobalscale?Howdoweanalyseacrossmanydatasets?
100KGenomesProject
NextGenomicRevolution:Scalingdowntosinglecells
Microfluidicssequencing/cytometry
DNA/RNA
ProteinFuidigm C1
Single-celldata
• Existinggenomicmethodsaverageoveracellpopulationof̴107cells
• Single-cellmethodsuncoverhiddenstructure:– Diversesub-populationsofimmunecells– Clonalstructurewithintumours– Rarecirculatingtumourcellsfromblood– Asynchronouscellulardynamics– Eachcellisnowahigh-dimensionaldatapoint
Clusteringsinglecellproteindata
Amiretal. NatureBiotech.2013
Uncoveringclonalevolutionintumours
Time
Normal cells
t0 t1 t2 t3 tsample
Tissue volumeat time of sampling
A
ABD
ABC
Genotypes
20%
15%
25%
40%
Clones
Life history of the tumor Poly-clonal tumor at sampling
0
Clonal evolution tree
15
20
0
A
AB
40
ABD
25
ABC
FlorianMarkowetz,CRUKCambridge– fromhisblog“ScientificB-sides”
Approach
Targeted:• BasicCNAtoverifyCTCstatus• Target1-20genes• UseWBCsas–ve controls
GenomeWide:• Copynumberalteration(CNA)• WES- comprehensiveanalysis• UseWBCsas–ve controls
6SCLCpatientschosenwith=>4singleisolatedCTCsandCTCpoolsCNAdatafrom6,682cancer-relatedprotein-codinggenes
TP53
* Poolof10CTCs
** * * * * * *
Circulatingtumourcells(CTC)profiling
Expandedstudyongoing,2000CTCsfrom30patients
CTCenrichmentviaCellSearchCTCisolationviaDepArray
CarolineDiveandGed Brady,CRUKManchesterInstitute
Modellingchallenge:confoundingvariation
Stegle etal.NatureReviewsGenetics2014
SinglecelldataLastyear
Single-cellRNA-Seq103 cellsperexperiment107 sequencereadspercell104featuresextractedpercell
CyTOF proteinquantification103cellspersecond106 perexperiment30-50featurespercell
ThisyearSingle-cellRNA-Seq106 cellsperexperiment108 readspercell>105featurespercell
Singlecellmulti-omics
?
Whatarethepinchpoints?
• Datavolume:costandtransferspeed• Dataanalysis:scalablealgorithms• Dataquality:batcheffects,missingdata,missingmetadata,conceptdrift
• Dataintegration:multi-modalmodelling• Reproducibleandrobustresearch
Datavolume
• Movealgorithmstothedata– Putcomputeclosetolocaldata– Commercialcloud(e.g.BaseSpace,Cytobank)– Bespokesecurecloud(e.g.100Kgenomesproject)
• Issuestoconsider– Willyouralgorithmsgivesameresults?– Willtheanalysisbereproducibleinthefuture?– Howtointegrateacrossresources?
Dataanalysis
• Scalingupalgorithms,e.g.DeeplearninglibrariesintegratingCPU/GPUarchitectures
• Fastapproximatemethods• Online/streamingdataprocessing• Avoidsolvingcompute-intensiveintermediatetasks:e.g.avoidgenomicalignmentpriortocountingsub-sequencematches(k-mers)
• Mixedprecisionnumerics
MethodsforMachineLearningnolongersimplyassessedonpredictiveaccuracy
Dataanalysis
Dataquality
Bigcollecteddataaretypicallynotdesignedforasingleresearchquestion(oranyresearchquestion)
Weneedmethodstodealwith:
Confounders,batcheffects,missingdata,missingmetadata,conceptdrift,outliers….
(whileremainingscalable)
RobustandreproducibleresearchPublishdata,code,workflows,versionnumbers,containers…
Resultsshouldnotdependstronglyonarbitrarymodellingchoices“shakethemodel”(ChrisHolmes)
“Hypothesisselection”leadstoupwardsignificancebias• Trytobreakyourmodels• Userobustmodels• Usebootstrapping
Keeptrackofallhypothesesyouhaveconsidered• Storeyourworkinghistory– notebookscience• Publishnegativeresults
Robustandreproducibleresearch• Buildreproducibilityintoyourroutine– don’twaituntilafter
yourpaperisaccepted• Don’tfeaturehere:
Conclusion
• Researchisincreasinglydata-drivenacrossallfields– DataScienceisnowubiquitous
• Newchallengescomefromthescale,complexityandnatureofdata:Bigdata– scalablealgorithmsandarchitecturesComplexdata– bettermodels:bottomupandtopdownMessydata– statisticalthinkingisessentialHumandata– ethicaldimensionsareofkeyimportanceAccessibledata– avaluablecommonresource