http://cs273a.stanford.edu [Bejerano Fall10/11] 1
http://cs273a.stanford.edu [Bejerano Fall10/11] 2
Lecture 13
Cis-Regulation cont’dGREAT
http://cs273a.stanford.edu [Bejerano Fall10/11] 3
Gene Regulation
•gene (how to)•control region(when & where)
DNA
DNA bindingproteins
RNA geneProtein coding
http://cs273a.stanford.edu [Bejerano Fall10/11] 4
Pol II Transcription
Key components:• Proteins• DNA sequence• DNA epigenetics
Protein components:• General Transcription factors• Activators• Co-activators
http://cs273a.stanford.edu [Bejerano Fall10/11] 5
Enhancers
http://cs273a.stanford.edu [Bejerano Fall10/11] 6
Vertebrate Gene Regulation
gene (how to)control region(when & where)
DNA
proximal: in 103 letters
distal: in 106 letters
DNA bindingproteins
http://cs273a.stanford.edu [Bejerano Fall10/11] 7
Gene Expression Domains: Independent
http://cs273a.stanford.edu [Bejerano Fall10/11] 8
Distal Transcription Regulatory Elements
http://cs273a.stanford.edu [Bejerano Fall10/11] 9
Repressors / Silencers
http://cs273a.stanford.edu [Bejerano Fall10/11] 10
What are Enhancers?What do enhancers encode?Surely a cluster of TF binding sites.[but TFBS prediction is hard, fraught with false positives]What else? DNA Structure related properties?
So how do we recognize enhancers?Sequence conservation across multiple species[weak but generic]
Verifying repressors is trickier [loss vs. gain of function].
How do you predict an enhancer from a repressor? Duh...
repressors
repressorsRepressors
http://cs273a.stanford.edu [Bejerano Fall10/11] 11
Insulators
http://cs273a.stanford.edu [Bejerano Fall10/11] 12
Cis-Regulatory Components
Low level (“atoms”):• Promoter motifs (TATA box, etc)• Transcription factor binding sites (TFBS)Mid Level:• Promoter• Enhancers• Repressors/silencers• Insulators/boundary elements• Cis-regulatory modules (CRM)• Locus control regions (LCR)High Level:• Epigenetic domains / signatures• Gene expression domains• Gene regulatory networks (GRN)
http://cs273a.stanford.edu [Bejerano Fall10/11] 13
Disease Implications: Genes
genome
gene
protein
Limb Malformation
Over 300 genes alreadyimplicated in limb malformations.
http://cs273a.stanford.edu [Bejerano Fall10/11] 14
Disease Implications: Cis-Reg
genome
gene
NO proteinmade
Limb Malformation
Growing number of cases (limb, deafness, etc).
http://cs273a.stanford.edu [Bejerano Fall10/11] 15
Transcription Regulation & Human Disease
[Wang et al, 2000]
http://cs273a.stanford.edu [Bejerano Fall10/11] 16
Critical regulatory sequences
Lettice et al. HMG 2003 12: 1725-35
Single base changes
Knock out
http://cs273a.stanford.edu [Bejerano Fall10/11] 17
Other Positional Effects
[de Kok et al, 1996]
http://cs273a.stanford.edu [Bejerano Fall10/11] 18
Genomewide Association Studies point to non-coding DNA
http://cs273a.stanford.edu [Bejerano Fall10/11] 19
WGA Disease
9p21 Cis effects
http://cs273a.stanford.edu [Bejerano Fall10/11] 20
Follow up study:
http://cs273a.stanford.edu [Bejerano Fall10/11] 21
Cis-Regulatory Evolution: E.g., obile Elements
[Yass is a small town in New South Wales, Australia.]
Gene
Gene
What settings make these“co-option” events happen?
Gene
Gene
http://cs273a.stanford.edu [Bejerano Fall10/11] 22
Britten & Davidson Hypothesis: Repeat to Rewire!
[Britten & Davidson, 1971]
[Davidson & Erwin, 2006]
http://cs273a.stanford.edu [Bejerano Fall10/11] 23
Modular: Most Likely to Evolve?
Chimp Human
24
Human Accelerated Regions• Human-specific substitutions in conserved
sequences
24[Pollard, K. et al., Nature, 2006] [Prabhakar, S. et al., Science, 2008][Beniaminov, A. et al., RNA, 2008]
Human Chimp
http://GREAT.stanford.edu:Generating Functional Hypotheses from Genome-Wide Measurements
of Mammalian Cis-Regulation
25
Gill BejeranoDept. of Developmental Biology &
Dept. of Computer ScienceStanford University
http://bejerano.stanford.edu
http://bejerano.stanford.edu 26
Human Gene Regulation
All these cells have the same Genome.
Gene
Gene
Gene
Gene
20,000 Genes encode how to make proteins.
1,000,000 Genomic “switches” determinewhich and how much proteins to make.
1013 different cells in an adult human.
Hundreds of different cell types.
http://bejerano.stanford.edu 27
Most Non-Coding Elements likely work in cis…
9Mb
“IRX1 is a member of the Iroquois homeobox gene family. Members of this family appear to play multiple roles during pattern formation of vertebrate embryos.”
gene deserts
regulatory jungles
Every orange tick mark is roughly 100-1,000bp long, each evolves under purifying selection, and does not code for protein.
http://bejerano.stanford.edu 28
Many non-coding elements tested are cis-regulatory
http://bejerano.stanford.edu 29
Combinatorial Regulatory Code
Gene
2,000 different proteins can bind specific DNA sequences.
A regulatory region encodes 3-10 such protein binding sites.When all are bound by proteins the regulatory region turns “on”,
and the nearby gene is activated to produce protein.
Proteins
DNA
DNA
Protein binding site
ChIP-Seq: first glimpses of the regulatory genome in action
Cis-regulatory peak
3030http://bejerano.stanford.edu
Peak Calling
Gene transcription start site
What is the transcription factor I just assayed doing?
Cis-regulatory peak
3131http://bejerano.stanford.edu
• Collect known literature of the form• Function A: Gene1, Gene2, Gene3, ...• Function B: Gene1, Gene2, Gene3, ...• Function C: ...
• Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above.• Form hypothesis and perform further experiments.
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile
32
Gene transcription start site
SRF binding ChIP-seq peak
• ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1
• SRF is known as a “master regulator of the actin cytoskeleton”
• In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation.
[1] Valouev A. et al., Nat. Methods, 2008http://bejerano.stanford.edu
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile
33
Existing, gene-based method to analyze enrichment:
• Ignore distal binding events.
• Count affected genes.
• Rank by enrichment hypergeometric p-value.
Gene transcription start site
SRF binding ChIP-seq peakOntology term (e.g. ‘actin cytoskeleton’)
N = 8 genes in genomeK = 3 genes annotated withn = 2 genes selected by proximal peaksk = 1 selected gene annotated with
P = Pr(k ≥1 | n=2, K =3, N=8)
http://bejerano.stanford.edu
We have (reduced ChIP-Seq into) a gene list!What is the gene list enriched for?
34
Microarray tool
Microarray data
Microarray data
Deep sequencing
data
http://bejerano.stanford.edu
Pro: A lot of tools out there for the analysis of gene lists.Cons: These tools are built for microarray analysis.Does it matter ??
SRF Gene-based enrichment results
35
• Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1
[1] Valouev A. et al., Nat. Methods, 2008
SRF
SRF
SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding
35http://bejerano.stanford.edu
Where’s the signal?Top “actin” term is ranked #28 in the list.
Associating only proximal peaks loses a lot of information
36
Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets
Restricting to proximal peaks often leads to complete loss of key enrichments
http://bejerano.stanford.edu
Bad Solution: Associating distal peaks brings in many false enrichments
37
Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream.
http://bejerano.stanford.edu
Term Bonferroni corrected p-valuenervous system development 5x10-9
system development 8x10-9
anatomical structure development 7x10-8
multicellular organismal development 1x10-7
developmental process 2x10-6
SRF ChIP-seq set has 2,000+ binding events.Throw a random set of 2,000 regions at the genome.
What do you get from a gene list analysis?Regulatory jungles are oftennext to key developmental genes
Real Solution: Do not convert to gene list.Analyze the set of genomic regions
38
Gene transcription start siteOntology term ( ‘actin cytoskeleton’)
P = Prbinom(k ≥5 | n=6, p =0.33)
p = 0.33 of genome annotated withn = 6 genomic regionsk = 5 genomic regions hit annotation
http://bejerano.stanford.edu
Gene regulatory domainGenomic region (ChIP-seq peak)
Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments.
GREAT = Genomic RegionsEnrichment of Annotations Tool
How does GREAT know how to assign distal binding peaks to genes?
39
Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms
Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc.
• Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb
• Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks
http://bejerano.stanford.edu
GREAT infers many specific functions of SRF from its binding profile
40
Ontology Term # Genes Binomial Experimental P-value support*
Gene Ontology actin cytoskeletonactin binding
7x10-9
5x10-5
Miano et al. 2007
Miano et al. 2007
* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.
3031
Pathway Commons
TRAIL signalingClass I PI3K signaling
5x10-7
2x10-6
Bertolotto et al. 2000
Poser et al. 20003226
TreeFam 1x10-85 Chai & Tarnawski 2002
TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
ChIp-Seq support
Natesan & Gilman 1995
84284423
Top gene-basedenrichments of SRF
Top GREAT enrichments of SRF
(top actin-related term 28th in list)
FOS gene family
http://bejerano.stanford.edu
Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq
[McLean et al., Nat Biotechnol., 2010]
GREAT data integrated
41Michael Hiller
• Twenty ontologies spanning broad categories of biology• 44,832 total ontology terms tested in each GREAT run
(2,800 terms)(5,215)(834)
(5,781)(427)(456)
(150)(1,253)(288)(706)
(6,700)(3,079)(911)
(615)(19)(222)(9)
(6,857)(8,272)(238)
http://bejerano.stanford.edu
GREAT implementation
• Can handle datasets of hundreds of thousands of genomic regions• Testing a single ontology term takes ~1 ms• Enables real-time calculation of enrichment results for all ontologies
42http://bejerano.stanford.edu Cory McLean
43
GREAT web app: input page
Dave Bristor
Pick a genome assembly
Input BED regions of interest
http://great.stanford.edu
http://bejerano.stanford.edu
44
Additional ontologies,term statistics,multiple hypothesis corrections, etc.
GREAT web app: output summary
Ontology-specific enrichments
http://bejerano.stanford.edu
45
GREAT web app: term details page
Frame holding http://www.geneontology.org definition of “actin binding”
Genes annotated as “actin binding” with associated genomic regions
Genomic regions annotated with “actin binding”
Drill down to explore how a particular peak regulates Plectin and its role in actin binding
http://bejerano.stanford.edu
You can also submit any trackstraight from the UCSC Table Browser
46http://bejerano.stanford.edu
A simple, well documentedprogrammatic interface allowsany tool to submit directly to GREAT.See our Help. Inquiries welcome!
GREAT web app: export data
47
HTML output displays all user selected rows and columns
Tab-separated values also available for additional postprocessing
http://bejerano.stanford.edu
External Web Stats: Catching On
48http://bejerano.stanford.edu
last 500 entries only
• Current technologies identify cis-regulatory sequences• GREAT accurately assesses functional enrichments of cis-
regulatory sequences using a genomic region-based approach [McLean et al., Nat Biotechnol., 2010]
• Online tool available (version 1.5 coming soon, in QA) http://great.stanford.edu• GREAT is immediately applicable to all sets with a significant
cis-regulatory content:• Regulatory Chromatin Markers (e.g., H3K4me1)• Genome Wide Association Studies (GWAS)• Comparative Genomics sets
(e.g., ultraconserved elements)
49
Summary
http://bejerano.stanford.edu
Acknowledgments
GREAT developersCory McLeanDave BristorMichael HillerShoa ClarkeCraig LoweAaron WengerGill Bejerano
50
Other help Fah Sathira Marina Sirota Bruce Schaar Terry Capellini Christopher Meyer Jennifer Hardee
http://great.stanford.edu http://bejerano.stanford.edu