Date post: | 11-May-2015 |
Category: |
Documents |
Upload: | yintengfei |
View: | 1,878 times |
Download: | 0 times |
A Grammar of Graphics for GenomicsThe ggbio Package
Michael Lawrence
Genentech
August 29, 2012
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 1 / 18
Outline
1 Motivation
2 High-level Plots
3 Grammar Components
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 2 / 18
Outline
1 Motivation
2 High-level Plots
3 Grammar Components
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 3 / 18
Data on the Genome
• Comes in two flavors:• Annotations (genes, TF binding sites, ...)• Experimental measurements (sequence reads)
• Both types are tied to genomic coordinates, providing a common axisthat permits cross-dataset comparison and inference
• Typically stored as a table, with the range as a fundamental variabletype, plus metadata
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
Data on the Genome
• Comes in two flavors:• Annotations (genes, TF binding sites, ...)• Experimental measurements (sequence reads)
• Both types are tied to genomic coordinates, providing a common axisthat permits cross-dataset comparison and inference
• Typically stored as a table, with the range as a fundamental variabletype, plus metadata
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
Data on the Genome
• Comes in two flavors:• Annotations (genes, TF binding sites, ...)• Experimental measurements (sequence reads)
• Both types are tied to genomic coordinates, providing a common axisthat permits cross-dataset comparison and inference
• Typically stored as a table, with the range as a fundamental variabletype, plus metadata
0
10
20
30
40
50
60
120928000 120930000 120932000 120934000 120936000 120938000
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
Data on the Genome
• Comes in two flavors:• Annotations (genes, TF binding sites, ...)• Experimental measurements (sequence reads)
• Both types are tied to genomic coordinates, providing a common axisthat permits cross-dataset comparison and inference
• Typically stored as a table, with the range as a fundamental variabletype, plus metadata
0
10
20
30
40
50
60
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
Data on the Genome
• Comes in two flavors:• Annotations (genes, TF binding sites, ...)• Experimental measurements (sequence reads)
• Both types are tied to genomic coordinates, providing a common axisthat permits cross-dataset comparison and inference
• Typically stored as a table, with the range as a fundamental variabletype, plus metadata
seqnames start end strand exon id tx id10 120927215 120928045 - 129230 14886,1488710 120928689 120928854 - 129229 14886,1488710 120931894 120931997 - 129228 14886,1488710 120933249 120933384 - 129227 14886,1488710 120933963 120934069 - 129226 1488610 120933963 120934104 - 119757 1488710 120936533 120936665 - 119756 1488710 120936552 120936665 - 129225 1488610 120938267 120938345 - 129224 14886,14887
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
ChallengesBig data, wide spaces
• Need summaries that are efficiently computed, communicate morewith less and expose the most interesting aspects of the data
• Need different ways of viewing the data, depending on the density andscale, from whole genome to single basepair
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
ChallengesBig data, wide spaces
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
• Need summaries that are efficiently computed, communicate morewith less and expose the most interesting aspects of the data
• Need different ways of viewing the data, depending on the density andscale, from whole genome to single basepair
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
ChallengesBig data, wide spaces
0
10
20
30
40
50
60
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
• Need summaries that are efficiently computed, communicate morewith less and expose the most interesting aspects of the data
• Need different ways of viewing the data, depending on the density andscale, from whole genome to single basepair
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
ChallengesBig data, wide spaces
0
50000
100000
150000
200000
250000
300000
0 Mb 50 Mb 100 Mb
• Need summaries that are efficiently computed, communicate morewith less and expose the most interesting aspects of the data
• Need different ways of viewing the data, depending on the density andscale, from whole genome to single basepair
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Existing Tools
UCSC IGB IGV Circos GViz
Limitations
• Limited to one type of view (linear or circular)
• Not tightly integrated with an analysis environment throughstandard, abstract data structures (except GViz)
• No low-level toolkit for prototyping new types of graphics
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
Grammars of Graphics
• A grammar of graphics is alanguage for expressing plots
• Graphics are constructedthrough the combination ofvarious types of primitives;like legos for graphics
• The most prominentgrammar was introduced byWilkinson’s book TheGrammar of Graphics
• Wilkinson’s grammar wasextended by Wickham andthe ggplot2 package
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
Grammars of Graphics
• A grammar of graphics is alanguage for expressing plots
• Graphics are constructedthrough the combination ofvarious types of primitives;like legos for graphics
• The most prominentgrammar was introduced byWilkinson’s book TheGrammar of Graphics
• Wilkinson’s grammar wasextended by Wickham andthe ggplot2 package
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
Grammars of Graphics
• A grammar of graphics is alanguage for expressing plots
• Graphics are constructedthrough the combination ofvarious types of primitives;like legos for graphics
• The most prominentgrammar was introduced byWilkinson’s book TheGrammar of Graphics
• Wilkinson’s grammar wasextended by Wickham andthe ggplot2 package
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
Grammars of Graphics
• A grammar of graphics is alanguage for expressing plots
• Graphics are constructedthrough the combination ofvarious types of primitives;like legos for graphics
• The most prominentgrammar was introduced byWilkinson’s book TheGrammar of Graphics
• Wilkinson’s grammar wasextended by Wickham andthe ggplot2 package
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
The ggbio Package
• An R/Bioconductor package that extends the Wilkinson/Wickhamgrammar for applications in genomics
• Integrated with Bioconductor• Operates on standard, abstract genomic data structures• Leverages efficient range-based algorithms
• Programming interface has two levels of abstraction:
autoplot Maps Bioconductor data structures to plotsgrammar Mix and match to create custom plots
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 8 / 18
Outline
1 Motivation
2 High-level Plots
3 Grammar Components
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 9 / 18
Basic Plots
Gene Structures Read Alignments Sequence Multiple
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
Basic Plots
Gene Structures Read Alignments Sequence Multiple
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
Basic Plots
Gene Structures Read Alignments Sequence Multiple
0
10
20
30
40
50
60
120928000 120930000 120932000 120934000 120936000 120938000
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
Basic Plots
Gene Structures Read Alignments Sequence Multiple
120928700 120928750 120928800 120928850
A
C
G
T
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
Basic Plots
Gene Structures Read Alignments Sequence Multiple
CGTAGGAGAATCCGGTGTCCAGTTCGCTGGGCAGACTTCTCCATGTGTTT
120928690 120928700 120928710 120928720 120928730 120928740
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
Basic Plots
Gene Structures Read Alignments Sequence Multiple
0
10
20
30
40
50
60
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
Overview Plots
Grand Linear Karyogram Circular
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
Overview Plots
Grand Linear Karyogram Circular
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
Overview Plots
Grand Linear Karyogram Circular
12
34
56
78
910
1112
1314
1516
1718
1920
2122
X
5.0e+07 1.0e+08 1.5e+08 2.0e+08
seqReg
Exon
Intron
Other
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
Overview Plots
Grand Linear Karyogram Circular
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
0M 50M
100M
150M
200M
0M
50M
100M
150M
200M
0M
50M
100M
150M
0M
50M
100M
150M
0M
50M
100M
150M
0M
50M
100M
150M
0M
50M
100M150M0M50M100M
0M
50M
100M
0M
50M
100M
0M
50M
100M
0M
50M
100M
0M
50M
100M
0M
50M
100M
0M
50M
100M
0M
50M
0M
50M
0M
50M
0M
50M 0M
50M 0M
0M 50M
1
2
3
45
6
7
8910
11
12
1314
1516
17
18
1920
21 22
rearrangements
interchromosomal
intrachromosomal
tumreads●
●
●
●
●
4
6
8
10
12
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
Specialized Plots
Mismatch summary + VCF Edge-linked Intervals
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 12 / 18
Specialized Plots
Mismatch summary + VCF Edge-linked Intervals
0
5
10
15
Cou
nts
read
A
C
G
T
T
A T T A A G A A A G T A C C G T G T G A C A T C A C A G G C T G G G A G C T T G A G A G
25235720 25235725 25235730 25235735 25235740 25235745 25235750 25235755
mis
mat
chsn
pre
fere
nce
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 12 / 18
Specialized Plots
Mismatch summary + VCF Edge-linked IntervalsE
xpre
ssio
n
200
400
600
800
1000
group
GM12878
K562
0
uc002rau.2
uc010yjg.1
uc002rav.2
uc010yjh.1
uc002raw.2
10930000 10940000 10950000 10960000 10970000 10980000
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 12 / 18
Outline
1 Motivation
2 High-level Plots
3 Grammar Components
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 13 / 18
The Wilkinson/Wickham Grammar of Graphics
Geom The shape used for drawing the data
Stat Transforms the data before plotting
Scale Maps data to geom aesthetics, guides like legends and axes
Coord Maps from geom space to device space
Facet Small multiples of data subsets (trellis)Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
The Wilkinson/Wickham Grammar of Graphics
Geom The shape used for drawing the data
Stat Transforms the data before plotting
Scale Maps data to geom aesthetics, guides like legends and axes
Coord Maps from geom space to device space
Facet Small multiples of data subsets (trellis)Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
The Wilkinson/Wickham Grammar of Graphics
Geom The shape used for drawing the data
Stat Transforms the data before plotting
Scale Maps data to geom aesthetics, guides like legends and axes
Coord Maps from geom space to device space
Facet Small multiples of data subsets (trellis)Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
The Wilkinson/Wickham Grammar of Graphics
Geom The shape used for drawing the data
Stat Transforms the data before plotting
Scale Maps data to geom aesthetics, guides like legends and axes
Coord Maps from geom space to device space
Facet Small multiples of data subsets (trellis)Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
The Wilkinson/Wickham Grammar of Graphics
Geom The shape used for drawing the data
Stat Transforms the data before plotting
Scale Maps data to geom aesthetics, guides like legends and axes
Coord Maps from geom space to device space
Facet Small multiples of data subsets (trellis)Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
The Wilkinson/Wickham Grammar of Graphics
Geom The shape used for drawing the data
Stat Transforms the data before plotting
Scale Maps data to geom aesthetics, guides like legends and axes
Coord Maps from geom space to device space
Facet Small multiples of data subsets (trellis)Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
A Grammar of Graphics for GenomicsExtensions are marked in red
1
2
48.245 Mb 48.250 Mb 48.255 Mb 48.260 Mb 48.265 Mb 48.270 Mb
strand
+
−
statistical transformation:
geometric object: chevron
geometric object:alignment
Y scale: discrete from stepping
geometric object: rect
stepping
X scale: sequence
color scale: discretefrom strand
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 15 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
NM_006793(GeneID:10935)
NM_014098(GeneID:10935)
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
NM_006793(GeneID:10935)
NM_014098(GeneID:10935)
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
NM_006793(GeneID:10935)
NM_014098(GeneID:10935)
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
NM_006793(GeneID:10935)
NM_014098(GeneID:10935)
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
NM_006793(GeneID:10935)
NM_014098(GeneID:10935)
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
orig
inal
redu
ced
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
step
ping
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
0
10
20
30
40
50
60
120928000 120930000 120932000 120934000 120936000 120938000
Cov
erag
e
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
0
10
20
30
40
50
60
120928000 120930000 120932000 120934000 120936000 120938000
Cou
nts
A
C
G
T
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
0
10
20
30
40
50
60
Cou
nts
A
C
G
T
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
orig
inal
trun
cate
d
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
0
500
1000
1500
2000
0
500
1000
1500
2000
normaltumor
score
500
1000
1500
novel
FALSE
TRUE
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
10 11
0e+00
2e+05
4e+05
6e+05
0e+00 5e+07 1e+08 0e+00 5e+07 1e+08
Cov
erag
e
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
0e+00
2e+05
4e+05
6e+05
10 11
Cov
erag
e
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
50
100
150
200
1 2 3 4 5 6Samples
Fea
ture
s
−5000
0
5000
value
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
chr10 chr10
chr1
0
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Components of the Genomic Grammar
Geom: alignment chevron arch arrow arrowrectStat: gene reduce stepping coverage mismatch tableScale: sequence genome fold-change giemsaCoord: truncate-gapsLayout: tracks range-facet
chr10 chr10
chr1
0
0
10
20
30
40
50
60
Cou
nts
A
C
G
T
120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
Summary
• The ggbio package is a toolkit for plotting genomic data andannotations
• Available as part of the Bioconductor project
• Easy to use and flexible enough to handle the diverse use casesencountered in genomics
• Useful plots are automatically generated from Bioconductor datastructures using reasonable defaults
• New types of plots can be constructed from grammar primitivesspecially designed for genomics
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 17 / 18
Acknowledgements
Tengfei YinDi Cook
Robert Gentleman
Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 18 / 18