+ All Categories
Home > Documents > Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center...

Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center...

Date post: 18-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
36
Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis [email protected]
Transcript
Page 1: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Bioinformatics: A perspective

Dr. Matthew L. Settles

Genome CenterUniversity of California, Davis

[email protected]

Page 2: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Outline

• TheWorldwearepresentedwith• AdvancesinDNASequencing• BioinformaticsasDataScience• Viewportintobioinformatics• Training• TheBottomLine

Page 3: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Cost per Megabase of Sequence

Year

Dollars

2005 2010 2015

$0.1

$1

$10

$100

$1000

Cost per Human Sized Genome @ 30x

Year

2005 2010 2015

$1000

$100000

$10000000

SequencingCosts

• Includes:labor,administration,management,utilities,reagents,consumables,instruments(amortizedover3years),informaticsrelatedtosequenceproductions,submission,indirectcosts.

• http://www.genome.gov/sequencingcosts/

$0.014/Mb $1245perHumansized(30x)genome

October2016

Page 4: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

GrowthinPublicSequenceDatabase

• http://www.ncbi.nlm.nih.gov/genbank/statistics

WGS>1trillionbp

Year

Bases

1990 2000 2010

105

107

109

1011

1013

GenBankWGS

Year

Sequences

1990 2000 2010

102

104

106

108

GenBankWGS

Page 5: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

ShortReadArchive(SRA)Growth of the Sequence Read Archive (SRA) over time

Year

2000 2005 2010 2015

1011

1012

1013

1014

1015

BasesBytesOpen Access BasesOpen Access Bytes

>1quadrillionbp

http://www.ncbi.nlm.nih.gov/Traces/sra/

Page 6: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

IncreaseinGenomeSequencingProjects

• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects

Lists>3700uniquegenus

Page 7: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

BriefHistory

Page 8: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

SequencingPlatforms

• 1986- DyeterminatorSangersequencing,technologydominateduntil2005until“nextgenerationsequencers”,peakingatabout900kb/day

Page 9: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

‘Next’Generation

• 2005– ‘NextGenerationSequencing’asMassivelyparallelsequencing,boththroughputandspeedadvances.ThefirstwastheGenomeSequencer(GS)instrumentdevelopedby454lifeSciences(lateracquiredbyRoche),Pyrosequencing 1.5Gb/day

Discontinued

Page 10: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Illumina

• 2006– Thesecond‘NextGenerationSequencing’platformwasSolexa (lateracquiredbyIllumina).Nowthedominantplatformwith75%marketshareofsequencerandandestimated>90%ofallbasessequencedarefromanIllumina machine,SequencingbySynthesis>200Gb/day.

NewNovaSeq

Page 11: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

CompleteGenomics

• 2006– UsingDNAnanoball sequencing,hasbeenaleaderinHumangenomeresequencing,havingsequencedover20,000genomestodate.In2013purchasedbyBGIandisnowsettoreleasetheirfirstcommercialsequencer,theRevolocity.ThroughputonparwithHiSeq

Humangenome/exomes only.

10,000HumanGenomesperyear

Page 12: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

BenchtopSequencers

• Roche454Junior

• LifeTechnologies• IonTorrent• IonProton

• Illumina MiSeq

Page 13: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

The‘Next,Next’GenerationSequencers(3rd Generation)

• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,RSII~2Gb/day,newerPacBioSequel~14Gb/day,near100Kbreads.

Iso-seq onPacBiopossible,transcriptomewithout‘assembly’

SMRTSequencing

Page 14: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

OxfordNanopore

• 2015– Another3rd generationsequencer,foundedin2005andcurrentlyinbetatesting.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout500Mbperflowcell,capableofnear 200kbreads.

FYI:4th generationsequencingisbeingdescribedasIn-situsequencing

Funtoplaywithbutresultsarehighlyvariable

SmidgION:nanopore sensingforusewithmobiledevices

NanoporeSequencing

Page 15: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Flexibility

DNASequence

Read1(50- 300bp)

Read2(50-300bp) Read2primer

Barcode(8bp)BarcodeReadprimer

DepthofCoverage1X

100000X

WholeGenome

1KB

ReductionTechniques

CaptureTechniques

Fluidigm AccessArrayAmplicons

FeworSingleAmplicons

Genomicreductionallowsforgreatercoverageandmultiplexingof

samples.

Youcanfinetuneyourdepthofcoverage

needsandsamplesizewiththereduction

technique

RADseq

GreaterMultiplexing

SingleMultiplexing

Page 16: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

SequencingLibraries• DNA-seq• RNA-seq• Amplicons• CHiP-seq• MeDiP-seq• RAD-seq• ddRAD-seq• Pool-seq• EnD-seq

DNase-seqATAC-seqMNase-seqFAIRE-seqRibose-seqsmRNA-seqmRNA-seqTn-seqQTL-seq

tagRNA-seqPAT-seqStructure-seqMPE-seqSTARR-seqMod-seqBrAD-seqSLAF-seqG&T-seq

Page 17: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

omicsmaps.com

Page 18: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Thedatadeluge

• PluckingthebiologyfromtheNoise

Page 19: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Reality

• Itsmuchmoredifficultthanwemayfirstthink

Page 20: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Therealcostofsequencing

Pre-NGS(Approximately 2000)

Now(Approximately 2010)

Future(Approximately 2020)

0%20

4060

80

100%

Data reductionData management

Sample collection and experimental design

Sequencing Downstreamanalyses

Dat

a m

anag

emen

t

Sboner etal.GenomeBiology201112:125doi:10.1186/gb-2011-12-8-125

Page 21: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Bioinformatics

Biology

ComputerScience

MathStatistics

Biostatistics

ComputationalBiology

‘Thedatascientistrolehasbeendescribedas“partanalyst,partartist.”’Anjul Bhambhri,vicepresidentofbigdataproductsatIBM

BioinformaticsisDataScience

Page 22: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

DataScience

Datascienceistheprocessofformulatingaquantitativequestionthatcanbeansweredwithdata,collectingandcleaningthedata,analyzingthedata,andcommunicatingtheanswertothequestiontoarelevantaudience.

FiveFundamentalConceptsofDataSciencestatisticsviews.com November11,2013byKirkBorne

Page 23: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

7StagestoDataScience

1. Definethequestionofinterest

2. Getthedata3. Cleanthedata4. Explorethedata

5. Fitstatisticalmodels

6. Communicatetheresults7. Makeyouranalysisreproducible

Page 24: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

1. Definethequestionofinterest

Beginwiththeendinmind!whatisthequestionhowarewetoknowwearesuccessfulwhatareourexpectations

dictatesthedatathatshouldbecollectedthefeaturesbeinganalyzedwhichalgorithmsshouldbeuse

Page 25: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

2. Getthedata3. Cleanthedata4. Explorethedata

Knowyourdata!knowwhatthesourcewastechnicalprocessinginproducingdata(bias,artifacts,etc.)“DataProfiling”

Dataareneverperfectbutloveyourdataanyway!thecollectionofmassivedatasetsoftenleadstounusual,surprising,unexpectedandevenoutrageous.

Page 26: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

5. FitstatisticalmodelsOverfittingisasinagainstdatascience!

Model’sshouldnotbeover-complicated

• Ifthedatascientisthasdonetheirjobcorrectlythestatisticalmodelsdon'tneedtobeincrediblycomplicatedtoidentifyimportantrelationships

• Infact,ifacomplicatedstatisticalmodelseemsnecessary,itoftenmeansthatyoudon'thavetherightdatatoanswerthequestionyoureallywanttoanswer.

Page 27: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

6. Communicatetheresults7. Makeyouranalysisreproducible

Rememberthatthisis‘science’!Weareexperimentingwithdataselections,processing,algorithms,ensemblesofalgorithms,measurements,models.Atsomepointthesemustallbetestedforvalidityandapplicability totheproblemyouaretryingtosolve.

Page 28: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Datasciencedonewelllookseasy– andthat’sabigproblemfordatascientists

simplystatistics.orgMarch3,2015byJeffLeek

Page 29: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Training:DataScienceBias

DataScience(dataanalysis,bioinformatics)ismostoftentaughtthroughanapprenticemodel

Differentdisciplines/regionsdeveloptheirownsubcultures,anddecisionsarebasedonculturalconventionsratherthanempiricalevidence.• Programminglanguages• Statisticalmodels(Bayesvs.Frequentist)• Multipletestingcorrection• Applicationchoice,etc.These(andothers)decisionsmatteralot indataanalysis"Isawitinawidely-citedpaperinjournalXXfrommyfield"

Page 30: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

TheDataScienceinBioinformatics

Bioinformaticsisnotsomethingyouaretaught,it’sawayoflife

MickWatson– Rosland Institute

“The best bioinformaticians I know are problem solvers – theystart the day not knowing something, and they enjoy finding out(themselves) how to do it. It’s a great skill to have, but for most,it’s not even a skill – it’s a passion, it’s a way of life, it’s a thrill. It’swhat these people would do at the weekend (if their families letthem).”

Page 31: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Models

• Workshops• Oftenenrolledtoolate

• Collaborations• Moreexperiencepersons

• Apprenticeships• Previouslabpersonneltoyoungpersonnel

• FormalEducation• Mostprogramsaregraduatelevel• FewUndergraduate

Page 32: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Bioinformatics

• KnowandUnderstandtheexperiment• “TheQuestionofInterest”

• Buildasetofassumptions/expectations• Mixoftechnicalandbiological• Spendyourtimetestingyourassumptions/expectations• Don’tspendyourtimefindingthe“best”software

• Don’tunder-estimatethetimeBioinformaticsmaytake• Bepreparedtoaccept‘failed’experiments

Page 33: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

BottomLine

TheBottomLine:Spendthetime(andmoney)planningandproducinggoodquality,accurateandsufficientdata foryourexperiment.

Gettoknowtoyourdata,developandtestexpectations

Result,you’llspendmuchlesstime(andlessmoney)extractingbiologicalsignificanceandresultsduringanalysis.

Page 34: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Substrate

ClusterComputing

CloudComputing

BASTM Laptop&DesktopLINUX

Page 35: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Environment

“CommandLine”and“ProgrammingLanguages”

VS

BioinformaticsSoftwareSuite

Page 36: Bioinformatics: A perspective · Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu. Outline • The World we are

Prerequisites

• Accesstoamulti-core(24cpu orgreater),‘high’memory64GborgreaterLinuxserver.

• Familiaritywiththe’commandline’andatleastoneprogramminglanguage.

• Basicknowledgeofhowtoinstallsoftware• BasicknowledgeofR(orequivalent)andstatisticalprogramming• BasicknowledgeofStatisticsandmodelbuilding


Recommended