+ All Categories
Home > Documents > Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and...

Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and...

Date post: 19-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
24
2015 - BMMB 852D: Applied Bioinforma8cs Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinforma8cs Consul8ng Center Penn State
Transcript
Page 1: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

2015-BMMB852D:AppliedBioinforma8cs

Week13,Lecture25

IstvánAlbert

BiochemistryandMolecularBiologyandBioinforma8csConsul8ngCenter

PennState

Page 2: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Genomerepresenta8onconcepts

•  Atthesimplestlevelofabstrac8onthegenomeisrepresentedbyaonedimensional“space”(lines)

•  Genomeistwostrandedàalinecorrespondstoeachstrand

•  Eachstrandhasapolarityàeachlinehasadirec8on

•  Strands(lines)arepaired

•  Thesmallestunitisonebaseàoneintegeronthenumberline

•  Annota8ons(features)aresegments(coordinates)oneachline

Page 3: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Genomiccoordinates–briefoverviewDNAtwostrandedanddirec8onal

Butthereisonlyonecoordinatesystem

200 300

upstreamfortheforwardstrand

Standardformatsusestart<endevenforthereversestrand

Theupstreamregion–beforethe5’endrela8vetothedirec8onoftranscrip8on

upstreamforthereversestrand

5’ 3’

5’3’

Page 4: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Coordinatesystems

•  0basedà0,1,2,…9•  1basedà1,2,3,…10

Typically

•  0basedarenon-inclusive10:20à[10,20)

•  1basedincludebothends10:20à[10,20]

Page 5: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Comparingcoordinatesystems

VoteforwhatyouthinkisbeXer

1 based indexing

0 based indexing

Thirdelement

Firstten

Secondten

Thirdten

Onebaselongintervalstar8ngatthe10thelement.

Lengthofaninterval

Fiveelementsstar8ngatindex1000

Emptyinterval

Page 6: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Fundamentalintervalformats

•  SAM/BAM–SequenceAlignmentMap

•  VCF/BCFàforvariantcalls

•  BED/GFFàGeneAnnota8onrepresenta8on•  BEDgraph,Wiggleàvaluesoverintervals

Page 7: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Whatisagenomicfeature?

•  Feature:agenomicregion(interval)associatedwithacertainannota8on(descrip8on).

TypicalaXributestodescribeafeature

1.  chromosome2.  start3.  end4.  strand5.  name

Whydowehavesomanyvariants?Thereisnogoodra8onalreason…historyIguess

Page 8: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Valuesonintervals

•  Asinglevaluecharacterizesanen8reintervalàscore(value)fortheinterval

•  Con8nuousvaluesàdifferentvalueforeachbaseoftheintervalàanalogoustoaseriesof1bplongintervals

Differentdatarepresenta8onformats

Page 9: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

hXp://genome.ucsc.edu/FAQ/FAQformat.html

Page 10: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Twocommonlyusedformats

•  BED–UCSCgenomebrowserà0basednoninclusiveàalsousedtodisplaytracksinthegenomebrowser(US“standard”)(variants:bigBed,bedgraph)

•  GFF–Sangerins8tuteinGreatBritainà1basedinclusiveindexingsystem(“Europeanstandard”),(variants:GTF,GFF2.0)

Page 11: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

BEDformatSearchforBEDformat

Tabseparated3requiredand9op8onalcolumns.Lowernumberedfiledmustbefilled.

1.   chrom(nameofthechromosome,sequenceid)2.   chromStart(star8ngposi8ononthechromosome)3.   chromEnd(endposi8onofthechromosome,notethisbaseisnotincluded!)4.   name(featurename)5.   score(between0and1000)6.   strand(+or-)7.   thickStart(thestar8ngposi8onatwhichthefeatureisdrawnthickly)8.   thickEnd(theendingposi8onatwhichthefeatureisdrawnthickly)9.   itemRGB(RGBcolorà255,0,0displaycolorofthedatacontained)10.  blockCount(thenumberofblocks(exons)intheBEDline.)11.  blockSizes(acomma-separatedlistoftheblocksizes)12.  blockStarts(acomma-separatedlistoftheblockstarts)

Thesefilesmayalsotakeatrackdefini8onlinethatisvisualiza8onspecific

Page 12: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

BedGraphFormat

Tabseparated4requiredcolumns.

1.   chrom(nameofthechromosome,sequenceid)2.   chromStart(star8ngposi8ononthechromosome)3.   chromEnd(endposi8onofthechromosome,notethisbaseisnotincluded!)4.   dataValue(valueofthedataforthatregion)

Page 13: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

GFFformatSearchforGFF3àhXp://www.sequenceontology.org/gff3.shtml

Tabseparatedwith9columns.MissingaXributesmaybereplacedwithadotà.

1.   Seqid(usuallychromosome)2.   Source(whereisthedatacomingfrom)3.   Type(usuallyatermfromthesequenceontology)4.   Start(intervalstartrela8vetotheseqid)5.   End(intervalendrela8vetotheseqid)6.   Score(thescoreofthefeature,afloa8ngpointnumber)7.   Strand(+or–)8.   Phase(usedtoindicatereadingframeforcodingsequences)9.   AZributes(semicolonseparatedaXributesàName=ABC;ID=1)

peopleliketostuffalotofinforma8onhere

Page 14: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Wiggleformat

•  twoversionsàfixedstepandvariablestepeachtryingtoop8mizetheamountofdatastorage

fixedStep chrom=chr1 start=100 step=1 10 15 11 22 … … …

variableStep chrom=chr1 100 10 101 15 102 11 103 22 variableStep chrom=chr2 2000 23 2005 40 … … …

Wiggleisannastyformat–itlookssimplerthanitis–pleaseavoid

Page 15: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Wemayhavedataindifferentcoordinatesystems!

Being“oneoff”isoneofthemostcommonerrorsinbioinforma8cs.

ConversionfromGFFtoBED

(start,end)à(start–1,end)

ConversionfromBEDtoGFF

(start,end)à(start+1,end)

NotthattherewillbedifferenceswhenselecangposiaonsthatdependontheENDcoordinate!

Page 16: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Handlingcoordinatesrelaavetointervals

Whatarethecoordinateofthebaseprecedingandfollowingtheinterval.Seemstrivialanditis-withacatch.

GFF[start,end]àbasebeforestartisatstart-1BED[start,end)àbasebeforestartisatstart-1GFF[start,end]ànextbaseaperendisatend+1BED[start,end)ànextbaseaperendisatend

Page 17: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Represen8ngintervalrela8onships

•  Wehaveagenewiththreesplicingvariants

Startat1000endsat8000,eachexonis1kbandisseparatedby1kbHowtorepresentthisrelaaonship?

Page 18: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Datarepresenta8on

•  BothBEDandGFFfilescanrepresentthem

•  TwocommonversionsofGFFàGTF2andGFF3(note:tooldocumenta.oncano/enwrongandshowsaweirdcombina.onofthesetwoformats)

•  InGFFthecontentoftheATTRIBUTE(9th)columnspecifiestherela8onshipbetweenfeatures

Page 19: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

GTF/GFFformatsGTFaXributes:

–  gene_idvalue;agloballyuniqueiden.fierforthegenomicsourceofthetranscript

–  transcript_idvalueagloballyuniqueiden.fierforthepredictedtranscript.

gene_id“G1”transcript_id“T1”

GFFaXributes:

ID=exon1;Parent=T1

SeetheGFF3siteforexactspecifica8onofthethesemean.

Important:Morethanoneparentmaybelisted!

Page 20: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

ExampleintervalasGTF

Adis8nctlineisenteredforeachexon,repeatedforeachtranscript

Page 21: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

ExampleintervalasGFF3

Thesameexonmaybepartofdifferenttranscripts(parents)

Page 22: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

ExampleintervalinBED

FromtheBEDformatspecifica8on

Page 23: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

VisualizinginIGV

Page 24: Week 13, Lecture 25 · Week 13, Lecture 25 István Albert Biochemistry and Molecular Biology and Bioinformacs Consul8ng Center Penn State Genome representaon concepts • At the simplest

Homework25

•  CreateandvisualizeinIGVanintervalfilethatcontainsthreesplicevariantsofa1kblonggenewith5exons.

•  Showthefileandascreenshot


Recommended