+ All Categories
Home > Technology > genomation: summary of genomic intervals

genomation: summary of genomic intervals

Date post: 28-Aug-2014
Category:
Upload: altuna-akalin
View: 3,246 times
Download: 1 times
Share this document with a friend
Description:
an R package that contains a collection of tools for visualizing and analyzing genome-wide data sets. The package works with a variety of genomic interval file types and enables easy summarization and annotation of high throughput data sets with given genomic annotations. http://al2na.github.io/genomation/
Popular Tags:
28
genomation package Altuna Akalın Usage and ubiquity of genomic interval summaries Using genomation More information genomation a toolkit to summarize, annotate and visualize genomic intervals Altuna Akalın 1 February 24, 2014 1 * presented by. Package developed by Altuna Akalın and Vedran Franke
Transcript
Page 1: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

genomationa toolkit to summarize annotate and visualize genomic intervals

Altuna Akalın1

February 24 2014

1 presented by Package developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 2: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 3: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 4: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 5: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 6: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Quick introduction

The genomation is an R package that expedites genomicinterval summary and annotation It has the following features

1 Annotation of genomic intervals eg see what of yourintervals overlap with exonintronpromoters

2 Summary of genomic scores or read coverages over pre-definedregions

eg extract the conservation profile over ChIP-seq binding sites

(equi-width regions) or CpG islands (nonequi-width regions)

3 Visualize genomic interval summaries as meta-region plots orheatmaps

4 Work with multiple file formatseg BAM BED bigWig GFF and generic tabular text files

containing chromosome location information

5 do all these in R )

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 7: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely used

Summaries of genomic intervals are one of the useful ways tocommunicate high-dimensional dataTraditionally regions of interest are picked and distribution ofgenomic intervals are summarized on those regions

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 8: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Erkek S et al (2013) Molecular determinants of nucleosomeretention at CpG-rich sequences in mouse spermatozoa Nature Structuralamp Molecular Biology

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 9: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Stadler M Murr R Burger L et al (2011) DNA-bindingfactors shape the mouse methylome at distal regulatory regions Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 10: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Does this mean all of the windows (viewpoints) have a similarenrichment profile

minus100 0 50 100

35

40

45

50

average profile around anchor

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 11: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Utility and futility of average profiles

Only 13 of windows have such enrichment Be careful when you areinterpreting the average profiles

05

2 1

0 1

6 2

1 minus100 0 50 100

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 12: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Lister R et al (2009) Human DNA methylomes at baseresolution show widespread epigenomic differences Nature

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 13: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Genomic interval summaries are widely usedExamples from literature

Figure Feng S et al (2010) Conservation and divergence ofmethylation patterning in plants and animals PNAS

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 14: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Issues to keep in mind when developing summarymethods

Genomic data comes in many formats we need a method that isable to work with multiple flat file formatsWe need a method that is not specialized on one type of dataset such as read counts it should also work on other scoringschemes(eg conservation scores) easilyRegions of interest are not always equi-width you should be ableto normalize for length differences by binningMultiple visualization options and fast heatmap generationshould be availableClustering of regions based on multiple summaries (egbinding for different TFs on the same set of regions) on theheatmapEase of use it should not take hours of coding to generate andvisualize summaries

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 15: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Overview of genomation features

BAMBigWigBEDGFFTab txtGRanges

BEDGFFTab txtGRanges

Summarize

Annotation

Genomic Intervals

Annotate

Visualize

Base-pairs bins1 2 3 4 n

ScoreMatrixScoreMatrixList object

region 1

region 2

region 3

region 4

region m

IntergenicIntronExonPromoter409

116

218257

iuml iuml 0 500 1000

00

02

04

06

08

10

base-pairs around anchor

read

per

milli

on

TF4TF3TF2TF1

iuml

iuml

0

500

100

0

0 05 1 15 2

TF 4

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 3

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 2

iuml

iuml

0

500

100

0

0 05 1 15 2 25

TF 1

iuml iuml 0 500 1000

base-pairs around anchor

TF1

TF2

TF3

TF4

007

20

340

60

861

1

meta-region plots meta-region heatmaps

heatmaps for genomic interval sets

Piecharts for annotation

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 16: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

installation of the package and the example data

We can install the package and the data using install_github()

function from the devtools package

install dependencies

installpackages( c(datatableplyrreshape2ggplot2gridBasedevtools))

source(httpbioconductororgbiocLiteR)biocLite(c(GenomicRangesrtracklayerimputeRsamtools))

install the packages

library(devtools)install_github(genomation username = al2na)

install the data package

needed for examples

install_github(genomationData username = al2na)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 17: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Data import

Various file formats can be used in genomation You can read inannotation or your genomic intervals of interest

library(genomation)tabfile1 lt- systemfile(extdatatab1bed package = genomation)readGeneric(tabfile1)

GRanges with 6 ranges and 0 metadata columns seqnames ranges strand ltRlegt ltIRangesgt ltRlegt [1] chr21 [9437272 9439473] [2] chr21 [9483485 9484663] [3] chr21 [9647866 9648116] [4] chr21 [9708935 9709231] [5] chr21 [9825442 9826296] [6] chr21 [9909011 9909218] --- seqlengths chr21 NA

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 18: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Extraction of data over pre-defined genomic regions

ScoreMatrix() and ScoreMatrixBin() are functions used to extractdata over predefined windows

ScoreMatrix is used when all of the windows have the samewidth (eg region around TSS)ScoreMatrixBin is designed for use with windows of unequalwidth (eg enrichment of methylation over exons)

data(cage)data(promoters)sm lt- ScoreMatrix(target = cage windows = promoters)sm

scoreMatrix with dims 1055 2001

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 19: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Visualizing ScoreMatrix summary of genomicinvervals over pre-defined regions

plotMeta()heatMeta() heatMatrix() and multiHeatMatrix()are the visualization functions

oldmar lt- par()$marpar(oma = c(0 0 0 0))heatMatrix(sm xcoords = c(-1000 1000))plotMeta(sm xcoords = c(-1000 1000)linecol=blue)par(oma = oldmar)

00

751

52

2 3

minus1000 minus500 0 500 1000 minus1000 minus500 0 500 1000

000

005

010

015

020

025

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 20: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with BAM files

BAM files can also be used in ScoreMatrix() and ScoreMatrixBin()functions

bamfile = systemfile(teststestbam package=genomation)windows = GRanges(rep(c(12)each=2)

IRanges(rep(c(12) times=2) width=5))scores3 = ScoreMatrix(target=bamfilewindows=windows type=bam)

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 21: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Working with bigWig files

ScoreMatrix() and ScoreMatrixBin() are functions can handlebigWig files Here we use ENCODE DHS scores downloaded fromhttpgooglfEVu0g

mybed12file=systemfile(extdatachr21refseqhg19bedpackage = genomation)

feats=readTranscriptFeatures(mybed12fileupflank=500downflank=500)sm=ScoreMatrix(target=wgEncodeUwDnaseA549RawRep1bw

windows=feats$promoterstype=bigWigstrandaware=TRUE)plotMeta(smxcoords=c(-500500)main=DHS around TSSlinecol=blue)

minus400 0 200

46

810

14

DHS around TSS

bases

aver

age

scor

e

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 22: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple heatmap profiles can be plotted using multiHeatMatrix()which takes in a ScoreMatrixList object Here we used CTCF P300 Suz12 Rad21 Znf143 BAM files from genomationData package

ctcfpeaks=readRDS(ctcfpeaksrds)dataPath = systemfile(extdata package = genomationData)bamfiles = listfiles(dataPath full= Tpattern = bam$)[c(146)]sml = ScoreMatrixList(bamfiles ctcfpeaks binnum = 50type = bam)names(sml)=c(CTCFP300Suz12Rad21Znf143)multiHeatMatrix(sml xcoords = c(-500 500)cexaxis=035commonscale = T

col = c(lightgray blue)winsorize=c(095))

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 23: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

multiHeatMatrix() can also apply K-means clustering Extremevalues are trimmed using with ldquowinsorizerdquo argument

multiHeatMatrix(sml xcoords = c(-500 500)kmeans=TRUEk=3commonscale = Tcexaxis=04col = c(lightgray blue)winsorize=c(095))

1

2

3

minus50

0 minus

250

0

250

500

0 2 4 6 8

CTCF

minus50

0 minus

250

0

250

500

0 2 4 6 8

P300

minus50

0 minus

250

0

250

500

0 2 4 6 8

Suz12

minus50

0 minus

250

0

250

500

0 2 4 6 8

Rad21

minus50

0 minus

250

0

250

500

0 2 4 6 8

Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 24: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can be visualized with heatMeta() Herewe also apply a scaling function to all the matrices

take log2 of all matrices

sml2=scaleScoreMatrixList(smlscalefun=function(x) log2(x+1))heatMeta(sml2legendname=average profilesxcoords=c(-500 500)

xlab=bp around peaks)

minus400 minus200 0 200 400

bp around peaks

Znf143

Rad21

Suz12

P300

CTCF

021

061

11

41

8av

erag

e pr

ofile

s

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 25: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Multiple profiles

Multiple average profiles can also be visualized with plotMeta()

plotMeta(sml2profilenames=names(sml2)xcoords=c(-500 500)main=mult profiles)

minus400 minus200 0 200 400

05

10

15

mult profiles

bases

aver

age

scor

e

CTCFP300Suz12Rad21Znf143

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 26: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Future work

Explore overlap statistics between two genomic data sets DoesTF1 binding site locations overlap with TF2 sites more thanexpectedThis is previously explored with GenometriCorr package Thesefunctionality can be included in the form of a dependencyPerformance improvement on certain functions faster is alwaysbetter

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 27: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Further information

The genomation package is available athttpal2nagithubiogenomation You can find the linkto the vignette on the webpage as wellCode that generated this presentation is available athttpgithubcomal2nagenomation_presentation

Questions and bug reportsYou can viewopen issues in githubhttpsgithubcomal2nagenomationissuesstate=open

You can ask questions by sending an e-mail togenomationgooglegroupscom or using the web interface togoogle groups

Developed by Altuna Akalın and Vedran Franke

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information
Page 28: genomation: summary of genomic intervals

genomation

package

Altuna Akalın

Usage and

ubiquity of

genomic

interval

summaries

Using

genomation

More

information

Session Info

sessionInfo()

R version 302 (2013-09-25) Platform x86_64-apple-darwin1080 (64-bit) locale [1] C attached base packages [1] methods grid stats graphics grDevices utils datasets [8] base other attached packages [1] genomation_09902 knitr_15 loaded via a namespace (and not attached) [1] BSgenome_1300 BiocGenerics_080 Biostrings_2300 [4] GenomicRanges_1143 IRanges_1205 MASS_73-29 [7] RColorBrewer_10-5 RCurl_195-41 Rsamtools_1141 [10] XML_395-02 XVector_020 bitops_10-6 [13] colorspace_12-4 datatable_1810 dichromat_20-0 [16] digest_063 evaluate_051 formatR_010 [19] ggplot2_0931 gridBase_04-6 gtable_012 [22] highr_03 impute_1360 labeling_02 [25] munsell_042 parallel_302 plyr_18 [28] proto_03-10 reshape2_122 rtracklayer_1220 [31] scales_023 stats4_302 stringr_062 [34] tools_302 zlibbioc_180

  • Usage and ubiquity of genomic interval summaries
  • Using genomation
  • More information

Recommended