Date post: | 24-Jan-2016 |
Category: |
Documents |
Upload: | merry-charles |
View: | 230 times |
Download: | 0 times |
cDNA Sequencing, SAGE and Microarray Analysis
Outline• Overview of transcription• Construction of cDNA libraries• cDNA sequencing• Expression analysis via SAGE• Microarray construction and their use
in expression analysis.
The “Central Dogma” of Molecular Biology
DNA mRNA protein
cDNA’s
Isolate,Reverse Transcribe,label
We can isolate mRNA and convert it to a stable form (cDNA)
Genome in numbersNucleic acid content of an average human cell
Abundance distribution of mRNA species in a typical mammalian cell
Isolation of mRNA
cDNA library constructionThe big picture
cDNA synthesiscDNA synthesis occurs in 5’ to 3’ direction, requires:– a template– nucleosides (dNTPs)
– reverse transcriptase (retroviral polymerase)– a primer to initiate synthesis
Priming alternatives for cDNA construction
Cloning: Blunt end vs. sticky
cDNA library construction
Ligation of cDNA into vector
cDNA library
Ideally containing at least one copy of every expressed gene
Probablity for the above is a function of:fragment size – the longer the more likely to find gene represented
genome size – smaller genome = increased chance to find gene representedexpression – high expression = high likelihood to find gene represented
For 99% probability, a mammalian cDNA library requires to contain ~800,000 clones
cDNA sequencing
• The advent of cDNA cloning combine with the creating of automated sequencers led to efforts to sequence the entire human transcriptome and to create arrays (on filters) of cDNAs (see reading materials).
• cDNA sequencing was viewed as the fastest way to get at the coding portion of the genome.
• Numerous companies sprung up to sequence and patent cDNA’s.
• cDNA sequencing was also used to measure gene expression levels.
cDNA sequencing --> expression analysis
• Expression level estimates:– Count the number of occurrences of a given cDNA
sequence in a given library - highly expressed genes will have been sequenced more often.
– Use the above (in combination with the total number of sequences in the library) to estimate expression level.
Ex. PEDB (www.PEDB.org)
Serial Analysis of Gene Expression (SAGE)
• Concept– cDNA sequencing is expensive– Can uniquely identify most mRNA species by a
short sequence in a defined location in the gene (9bp tags are unique 95% of the time)
– If we could produce a library of short sequences and ligate them together, then we could sequence the ligated DNA to measure the concentration of gene more efficiently
SAGEdiagram
A
A
A
BB
B
Primer A Primer B
Linker:Primer A/B - TypeII site – Type I site
Primer A Primer B
Sequence these ->
Issues with SAGE (and cDNA sequencing for expression analysis)
• Low abundance clones – SAGE
• in 1995, the estimate was that characterization of genes representing <100mRNA’s/cell would take a few months of work to quantify by a single in investigator (maybe 10 times quicker today)
• Cost - if we assume even a low estimate of $6/sequencing reaction, 96 lanes * 4 runs/day*30 days * $6 = $69,000 to measure 460,000 tags (assume 40 tags/run).
– cDNA sequencing• Same problem costs/time maybe 20-40 times higher
• Hence expression information about low abundance clones is not accurate in cDNA or SAGE data in most cases.
• Leading to the advent of arrays…..
DNA Hybridization
A B
On the surface
4 copies of gene A,1copy of gene B
In solution
A B
After Hybridization
Taking advantage of DNA hybridization
DNA Arrays
• Spots of DNA arranged in a particular spatial arangement on a solid support
• Supports - Filters(nylon, nitrocellulose), glass, silicon
• Types– Spotted or placed - pre-synthesized DNA put onto a
surface– Synthesized - DNA synthesized directly on the
surface
The Original DNA Array
Petri dish withbacterial colonies
Apply membrane and lift to make a filter containing DNA from each clone.
Probe and image to identifyClones homologous to the probe.
Vicki - A manual Gridding tool
Gridding tool modifications by : Michèl Schummer
Vicki and the gridding frame
Frame Design by: Michèl Schummer
Robotic Spotters for Filters
Types of filter based arrays
• PCR products - ORFs or cDNAs• Oligos - some times but generally not used for
short products - oligos do not immobilize well on membranes
• Living clones– Place membrane on Whatman paper soaked in media,
can grow colonies directly on the arrays– Lysis of the colonies followed by cross-linking
produced DNA arrays– Good for screening large libraries
Uses for Filter Based Arrays• In general, filter based arrays were in vogue about 8-13
years ago in the pre-genomic days.• Typically cDNA libraries were spotted as clones and the
arrays were used to perform comparative expression analysis.
• Detection was typically performed with radioactive labeling/film or phosphorimaging.
• “Interesting clones” were identified (via differential expression) and then sequenced.
• For genomes that have not yet been sequenced, this can still be a cost effective approach, but rapid sequencing is changing that.
Selected cDNA arrays• With unselected cDNA libraries, clones for highly
expressed genes are over represented on the arrays.• As time progressed a large number of cDNA’s were
sequenced and hence it became possible selected unique cDNA’s and to make arrays on which each spot represented a single gene.
• Around the same time, coatings for glass were developed that retained spotted DNA well.
• This allowed for arrays to be produced on glass microscope slides which in turn allowed for fluorescence based detection technology.
Typical Path for cDNA clone acquisition
Image Consortium +othersSequence cDNA’s
sequences Gen-Bank
clones
Livermore, ATCC
Commercial distributorsRes. Genetics,InCyte, others
Unigene
Reduceredundancy
Unigenesets
Sequencechecking
Us
Sequenceverified
sets
Spotted Arrays
Reactive surface or coated surface
Drop containingDNA in solution
CAGTTTGA
CAGTTTGA
Spotting “pen”
MD GenIII Arrayer
Plate hotelholds twelve 384-well plates
Gridding head,12 pins
Slide holder36 slides
Features: •36 slides in 8 hours•7680 genes spotted in duplicate•Built-in humidity control
Cell Population #1
Extract mRNA
Make cDNALabel w/ Green Fluor
Extract mRNA
Make cDNALabel w/ Red Fluor
Cell Population #2
……………………….……………………….……………………….……………………….……………………….
……………………….
Slide with DNA from different genes
Co-hybridizeScan
Glass slides enabled fluorescent detection in 2(or more) colors
Spotted arrays
• Initially, most spotted arrays were produced by spotting PCR products produced from selected cDNA clones.
• Issues– Must have the libraries in hand – Must not mix clones up– Must perform high throughput PCR to produce DNA to spot
(again without mixing things up).– LOTS of freezer space to store everything– cDNA’s are long and cross hybridization is a problem
(although it is possible to spot oligo’s)– Quality manufacturing is difficult to maintain.
Oligo Arrays
• Synthesized or spotted arrays of short oligos of chosen sequence. (typically 20-60 base pairs)
• Synthesis methods - ink jet, light directed.• Spotting using reactive coupling.• Used for re-sequencing, genotyping, diagnostics
and expression arrays.• MUCH better than cDNA arrays to distinguish
related sequences• Only have to store the DNA’s OR (better yet) if
you synthesize DNA directly on the surface, you only need to store the sequence information (and a few reagents)
Basic Oligo Synthesis
Base
GlassSupport
Base
P
Base
P
+ProtectingGroup
Base
P Base
GlassSupport
Base
P Base
GlassSupport
+
Remove Protecting Group
Coupling
Base
P Base
GlassSupport
Add Next Nucleotide +
Ink-jets Can be Used to Direct Small Volumes of Liquids to Specific Sites
Resistor Off
LiquidVaporizes
GasExpands
Resistor On Resistor Off
DropBreaks Off Reservoir
Refills
FillReservoir
< 1 msec
Agilent InkJet Array Technology
~ 44,000 Features on 1”x3” Slide
If, instead of using ink, one fills the reservoirs with different nucleotides, inkjets can be used to make DNA on a surface
Glass Can be Treated to Produce Hydrophilic “Wells”
Agilent Printing Facility
Light-directed oligo synthesis
Number of different DNA sequences as a function of
photolithographic resolution
All possible oligos can be made in 4*N steps
Affymetrix Platform
• Each gene is represented by 11 probe pairs of 25 bp oligos
• Each probe pair contains a perfect match and a mismatch to the gene sequence
• Target sample is labeled with a biotinylated nucleotide and detected via a streptavidin-phycoerythrin conjugate
• One sample per array, one-color data
Affymetrix Expression Data
Data from the Data from the 11 probe pairs 11 probe pairs are used to are used to calculated an calculated an aggregate aggregate signal for each signal for each gene gene
Surrogate StrategyMost expression arrays to date
Annotation StrategyExon arraysSplice variants
Tiling strategyUnbiased look at the genome
Strategies For Array Strategies For Array DesignDesign
Known Exons Unknown transcript
• Expression arrays– Human, Mouse, Rat, Yeast, E. coli, Drosophila, C. elegans, Dog,
Soybean, Plasmodium, Anopheles, Pseudomonas, Arabidopsis, Zebrafish, Xenopus, etc.
• Exon arrays– Alternative splicing patterns
• Mapping arrays– SNP analysis, loss of heterozygosity
• Tiling array sets– Transcript mapping
• Custom arrays
Affymetrix Platform
Issues with synthesized oligos
• Repetitive yield - e.g. for each reaction cycle, what percentage of the oligos react as intended - estimated at 95% for light directed method, 98-99% for ink jet method
• (0.95)20 = 35.8%, (0.98)20 = 67% - net result- Affy arrays are usually 25-mers, ink jet arrays are usually 60mers.
• For a single oligo, it can be shown that sensitivity plateaus at 50-70bp.
Relative merits of different methods of making oligo arrays
• Affy: – available first, large catalogue, small feature size
possible
• Inkjet:– much more flexible to design
• Spotted:– less practical for large numbers (>a few 100) of oligo’s,
can be made with std. spotting equipment. Libraries of oligos exist for more common organisms, so oligo deposition is feasible for some organisms.
Illumina’s Bead ArraysACGTGTCTACAGT
CGTGTATGCATGT
Step 1 - synthesize beads inbatches each batch with a sequence on it. Generally, colorcode the beads to keep track ofwhich one has what molecule on it.ATGCACTGTAGT
Step 2 - Etch the ends of optical fibers in a bundle or circular spots on a glass slide to create bead sized depressions.
TGCATCAGTGCA
TGCATCAGTGCA
Illumina’s Bead Arrays (cont)
Step 3 - Allow beads to self assemble an array on the endof the fibers or on the surface
ATGCACTGTAGT
TGCATCAGTGCA
TGCATCAGTGCAACGTGTCTACAGT
CGTGTATGCATGT
•These self assembled arrays can be used for the same applications as other DNA arrays.•Since the assembly is random, one must over represent each desired oligo 10’s of times to assure that each oligo is represented at least n times on the array.•Decoding can also be accomplished by hybridizing short labeled oligos to the oligos on each bead. In practice, this is how it is usually done.
See www.illumina.com
Detection technologies
• Radio labeled probes– Film or phosphorimagers
• Biotin labled– Post hyb with SA labeled with a fluor or an
enzyme
• Fluorescent probes– confocal scanning
Scanning with a confocal microscope
Expression Array Analysis
2- color Microarray Overview2- color Microarray Overview
Prepare FluorescentlyPrepare FluorescentlyLabeled ProbesLabeled Probes
ControlControl
TestTest
Hybridize,Hybridize,WashWash
MeasureMeasureFluorescenceFluorescencein 2 channelsin 2 channels
redred//greengreen
Analyze the dataAnalyze the datato identifyto identifypatterns ofpatterns of
gene expressiongene expression
Slide from John Quackenbush, Dana Farber
1-color Microarray Overview1-color Microarray Overview
Prepare FluorescentlyPrepare FluorescentlyLabeled ProbesLabeled Probes
ControlControl
TestTest
Hybridize,Hybridize,WashWash
MeasureMeasureFluorescenceFluorescencein 1 channelin 1 channel
Analyze the dataAnalyze the datato identifyto identifypatterns ofpatterns of
gene expressiongene expression
WeedWeed
BushBush
Slide adapted from John Quackenbush, Dana Farber
2-color vs. single color
• 2-color was originally designed due to problems in making reproducible arrays - e.g. the ratio on a spot is more reproducible than the absolute intensity if the spot size/concentration changes from array-to-array.
• With 2-colors, you don’t necessarily get twice as much data since it is typically to run an extra array in the inverted color scheme.
• Experimental design and cross experiment comparisons are much more complicated with 2-color arrays.
Expression Arrays are a Natural Extension of Genomic Analysis
• Genome studies provide the source material for the arrays - eg. clones or manufactured DNA’s.
• For completely sequenced genomes, arrays allow a comprehensive survey of gene expression.
• This level of analysis is a revolutionrevolution in biology.
Expression Arrays Have a Broad Range of Applicability
• Pharmaceutical studies - drug treated vs. non-treated.
• Infectious disease studies - host response infection, infectious agent gene expression, viral diversity.
• Environmental - microbial diversity, effects of toxins, effect of growth conditions.
• Cancer Studies - tumor vs. normal.
Expression Arrays Have a Broad Range of Applicability
• Gene specific studies - deletion (“knockout”) vs. normal, over expression vs. normal.
• Agricultural studies - effects of pesticides, growth conditions, hormones.
• Developmental biology - cells from different areas/stages of developing organisms
• Many others - any two samples of interest can be compared.
Challenges for Planning Good Array Experiments
• Experimental Design– Replicates are necessary and expensive
– A simple experiment may not give a simple answer
– What comparisons should be made?
• Data Analysis– How will differentially expressed genes be identified?
– How will errors be estimated?
– What software does this best?
– How will the data be mined?
Where are arrays going?
• As sequencing gets cheaper and cheaper, most assays that are currently done by arrays can be done more effectively by sequencing. Hence, the analytical use of arrays will be replaced by sequencing.
• However, arrays can also be used to enrich for specific genomic regions upstream of sequencing or can be used to create many sequences for the artificial production of genomes or genomic regions.