CDNA Sequencing, SAGE and Microarray Analysis. Outline Overview of transcription Construction of...

Post on 24-Jan-2016

230 views 0 download

Tags:

transcript

cDNA Sequencing, SAGE and Microarray Analysis

Outline• Overview of transcription• Construction of cDNA libraries• cDNA sequencing• Expression analysis via SAGE• Microarray construction and their use

in expression analysis.

The “Central Dogma” of Molecular Biology

DNA mRNA protein

cDNA’s

Isolate,Reverse Transcribe,label

We can isolate mRNA and convert it to a stable form (cDNA)

Genome in numbersNucleic acid content of an average human cell

Abundance distribution of mRNA species in a typical mammalian cell

Isolation of mRNA

cDNA library constructionThe big picture

cDNA synthesiscDNA synthesis occurs in 5’ to 3’ direction, requires:– a template– nucleosides (dNTPs)

– reverse transcriptase (retroviral polymerase)– a primer to initiate synthesis

Priming alternatives for cDNA construction

Cloning: Blunt end vs. sticky

cDNA library construction

Ligation of cDNA into vector

cDNA library

Ideally containing at least one copy of every expressed gene

Probablity for the above is a function of:fragment size – the longer the more likely to find gene represented

genome size – smaller genome = increased chance to find gene representedexpression – high expression = high likelihood to find gene represented

For 99% probability, a mammalian cDNA library requires to contain ~800,000 clones

cDNA sequencing

• The advent of cDNA cloning combine with the creating of automated sequencers led to efforts to sequence the entire human transcriptome and to create arrays (on filters) of cDNAs (see reading materials).

• cDNA sequencing was viewed as the fastest way to get at the coding portion of the genome.

• Numerous companies sprung up to sequence and patent cDNA’s.

• cDNA sequencing was also used to measure gene expression levels.

cDNA sequencing --> expression analysis

• Expression level estimates:– Count the number of occurrences of a given cDNA

sequence in a given library - highly expressed genes will have been sequenced more often.

– Use the above (in combination with the total number of sequences in the library) to estimate expression level.

Ex. PEDB (www.PEDB.org)

Web based expression analysis - www.pedb.orgcounting cDNA frequency

Serial Analysis of Gene Expression (SAGE)

• Concept– cDNA sequencing is expensive– Can uniquely identify most mRNA species by a

short sequence in a defined location in the gene (9bp tags are unique 95% of the time)

– If we could produce a library of short sequences and ligate them together, then we could sequence the ligated DNA to measure the concentration of gene more efficiently

SAGEdiagram

A

A

A

BB

B

Primer A Primer B

Linker:Primer A/B - TypeII site – Type I site

Primer A Primer B

Sequence these ->

Issues with SAGE (and cDNA sequencing for expression analysis)

• Low abundance clones – SAGE

• in 1995, the estimate was that characterization of genes representing <100mRNA’s/cell would take a few months of work to quantify by a single in investigator (maybe 10 times quicker today)

• Cost - if we assume even a low estimate of $6/sequencing reaction, 96 lanes * 4 runs/day*30 days * $6 = $69,000 to measure 460,000 tags (assume 40 tags/run).

– cDNA sequencing• Same problem costs/time maybe 20-40 times higher

• Hence expression information about low abundance clones is not accurate in cDNA or SAGE data in most cases.

• Leading to the advent of arrays…..

DNA Hybridization

A B

On the surface

4 copies of gene A,1copy of gene B

In solution

A B

After Hybridization

Taking advantage of DNA hybridization

DNA Arrays

• Spots of DNA arranged in a particular spatial arangement on a solid support

• Supports - Filters(nylon, nitrocellulose), glass, silicon

• Types– Spotted or placed - pre-synthesized DNA put onto a

surface– Synthesized - DNA synthesized directly on the

surface

The Original DNA Array

Petri dish withbacterial colonies

Apply membrane and lift to make a filter containing DNA from each clone.

Probe and image to identifyClones homologous to the probe.

Vicki - A manual Gridding tool

Gridding tool modifications by : Michèl Schummer

Vicki and the gridding frame

Frame Design by: Michèl Schummer

Robotic Spotters for Filters

Types of filter based arrays

• PCR products - ORFs or cDNAs• Oligos - some times but generally not used for

short products - oligos do not immobilize well on membranes

• Living clones– Place membrane on Whatman paper soaked in media,

can grow colonies directly on the arrays– Lysis of the colonies followed by cross-linking

produced DNA arrays– Good for screening large libraries

Uses for Filter Based Arrays• In general, filter based arrays were in vogue about 8-13

years ago in the pre-genomic days.• Typically cDNA libraries were spotted as clones and the

arrays were used to perform comparative expression analysis.

• Detection was typically performed with radioactive labeling/film or phosphorimaging.

• “Interesting clones” were identified (via differential expression) and then sequenced.

• For genomes that have not yet been sequenced, this can still be a cost effective approach, but rapid sequencing is changing that.

Selected cDNA arrays• With unselected cDNA libraries, clones for highly

expressed genes are over represented on the arrays.• As time progressed a large number of cDNA’s were

sequenced and hence it became possible selected unique cDNA’s and to make arrays on which each spot represented a single gene.

• Around the same time, coatings for glass were developed that retained spotted DNA well.

• This allowed for arrays to be produced on glass microscope slides which in turn allowed for fluorescence based detection technology.

Typical Path for cDNA clone acquisition

Image Consortium +othersSequence cDNA’s

sequences Gen-Bank

clones

Livermore, ATCC

Commercial distributorsRes. Genetics,InCyte, others

Unigene

Reduceredundancy

Unigenesets

Sequencechecking

Us

Sequenceverified

sets

Spotted Arrays

Reactive surface or coated surface

Drop containingDNA in solution

CAGTTTGA

CAGTTTGA

Spotting “pen”

MD GenIII Arrayer

Plate hotelholds twelve 384-well plates

Gridding head,12 pins

Slide holder36 slides

Features: •36 slides in 8 hours•7680 genes spotted in duplicate•Built-in humidity control

Cell Population #1

Extract mRNA

Make cDNALabel w/ Green Fluor

Extract mRNA

Make cDNALabel w/ Red Fluor

Cell Population #2

……………………….……………………….……………………….……………………….……………………….

……………………….

Slide with DNA from different genes

Co-hybridizeScan

Glass slides enabled fluorescent detection in 2(or more) colors

Spotted arrays

• Initially, most spotted arrays were produced by spotting PCR products produced from selected cDNA clones.

• Issues– Must have the libraries in hand – Must not mix clones up– Must perform high throughput PCR to produce DNA to spot

(again without mixing things up).– LOTS of freezer space to store everything– cDNA’s are long and cross hybridization is a problem

(although it is possible to spot oligo’s)– Quality manufacturing is difficult to maintain.

Oligo Arrays

• Synthesized or spotted arrays of short oligos of chosen sequence. (typically 20-60 base pairs)

• Synthesis methods - ink jet, light directed.• Spotting using reactive coupling.• Used for re-sequencing, genotyping, diagnostics

and expression arrays.• MUCH better than cDNA arrays to distinguish

related sequences• Only have to store the DNA’s OR (better yet) if

you synthesize DNA directly on the surface, you only need to store the sequence information (and a few reagents)

Basic Oligo Synthesis

Base

GlassSupport

Base

P

Base

P

+ProtectingGroup

Base

P Base

GlassSupport

Base

P Base

GlassSupport

+

Remove Protecting Group

Coupling

Base

P Base

GlassSupport

Add Next Nucleotide +

Ink-jets Can be Used to Direct Small Volumes of Liquids to Specific Sites

Resistor Off

LiquidVaporizes

GasExpands

Resistor On Resistor Off

DropBreaks Off Reservoir

Refills

FillReservoir

< 1 msec

Agilent InkJet Array Technology

~ 44,000 Features on 1”x3” Slide

If, instead of using ink, one fills the reservoirs with different nucleotides, inkjets can be used to make DNA on a surface

Glass Can be Treated to Produce Hydrophilic “Wells”

Agilent Printing Facility

Light-directed oligo synthesis

Number of different DNA sequences as a function of

photolithographic resolution

All possible oligos can be made in 4*N steps

Affymetrix Platform

• Each gene is represented by 11 probe pairs of 25 bp oligos

• Each probe pair contains a perfect match and a mismatch to the gene sequence

• Target sample is labeled with a biotinylated nucleotide and detected via a streptavidin-phycoerythrin conjugate

• One sample per array, one-color data

Affymetrix Expression Data

Data from the Data from the 11 probe pairs 11 probe pairs are used to are used to calculated an calculated an aggregate aggregate signal for each signal for each gene gene

Surrogate StrategyMost expression arrays to date

Annotation StrategyExon arraysSplice variants

Tiling strategyUnbiased look at the genome

Strategies For Array Strategies For Array DesignDesign

Known Exons Unknown transcript

• Expression arrays– Human, Mouse, Rat, Yeast, E. coli, Drosophila, C. elegans, Dog,

Soybean, Plasmodium, Anopheles, Pseudomonas, Arabidopsis, Zebrafish, Xenopus, etc.

• Exon arrays– Alternative splicing patterns

• Mapping arrays– SNP analysis, loss of heterozygosity

• Tiling array sets– Transcript mapping

• Custom arrays

Affymetrix Platform

Issues with synthesized oligos

• Repetitive yield - e.g. for each reaction cycle, what percentage of the oligos react as intended - estimated at 95% for light directed method, 98-99% for ink jet method

• (0.95)20 = 35.8%, (0.98)20 = 67% - net result- Affy arrays are usually 25-mers, ink jet arrays are usually 60mers.

• For a single oligo, it can be shown that sensitivity plateaus at 50-70bp.

Relative merits of different methods of making oligo arrays

• Affy: – available first, large catalogue, small feature size

possible

• Inkjet:– much more flexible to design

• Spotted:– less practical for large numbers (>a few 100) of oligo’s,

can be made with std. spotting equipment. Libraries of oligos exist for more common organisms, so oligo deposition is feasible for some organisms.

Illumina’s Bead ArraysACGTGTCTACAGT

CGTGTATGCATGT

Step 1 - synthesize beads inbatches each batch with a sequence on it. Generally, colorcode the beads to keep track ofwhich one has what molecule on it.ATGCACTGTAGT

Step 2 - Etch the ends of optical fibers in a bundle or circular spots on a glass slide to create bead sized depressions.

TGCATCAGTGCA

TGCATCAGTGCA

Illumina’s Bead Arrays (cont)

Step 3 - Allow beads to self assemble an array on the endof the fibers or on the surface

ATGCACTGTAGT

TGCATCAGTGCA

TGCATCAGTGCAACGTGTCTACAGT

CGTGTATGCATGT

•These self assembled arrays can be used for the same applications as other DNA arrays.•Since the assembly is random, one must over represent each desired oligo 10’s of times to assure that each oligo is represented at least n times on the array.•Decoding can also be accomplished by hybridizing short labeled oligos to the oligos on each bead. In practice, this is how it is usually done.

See www.illumina.com

Detection technologies

• Radio labeled probes– Film or phosphorimagers

• Biotin labled– Post hyb with SA labeled with a fluor or an

enzyme

• Fluorescent probes– confocal scanning

Scanning with a confocal microscope

Expression Array Analysis

2- color Microarray Overview2- color Microarray Overview

Prepare FluorescentlyPrepare FluorescentlyLabeled ProbesLabeled Probes

ControlControl

TestTest

Hybridize,Hybridize,WashWash

MeasureMeasureFluorescenceFluorescencein 2 channelsin 2 channels

redred//greengreen

Analyze the dataAnalyze the datato identifyto identifypatterns ofpatterns of

gene expressiongene expression

Slide from John Quackenbush, Dana Farber

1-color Microarray Overview1-color Microarray Overview

Prepare FluorescentlyPrepare FluorescentlyLabeled ProbesLabeled Probes

ControlControl

TestTest

Hybridize,Hybridize,WashWash

MeasureMeasureFluorescenceFluorescencein 1 channelin 1 channel

Analyze the dataAnalyze the datato identifyto identifypatterns ofpatterns of

gene expressiongene expression

WeedWeed

BushBush

Slide adapted from John Quackenbush, Dana Farber

2-color vs. single color

• 2-color was originally designed due to problems in making reproducible arrays - e.g. the ratio on a spot is more reproducible than the absolute intensity if the spot size/concentration changes from array-to-array.

• With 2-colors, you don’t necessarily get twice as much data since it is typically to run an extra array in the inverted color scheme.

• Experimental design and cross experiment comparisons are much more complicated with 2-color arrays.

Expression Arrays are a Natural Extension of Genomic Analysis

• Genome studies provide the source material for the arrays - eg. clones or manufactured DNA’s.

• For completely sequenced genomes, arrays allow a comprehensive survey of gene expression.

• This level of analysis is a revolutionrevolution in biology.

Expression Arrays Have a Broad Range of Applicability

• Pharmaceutical studies - drug treated vs. non-treated.

• Infectious disease studies - host response infection, infectious agent gene expression, viral diversity.

• Environmental - microbial diversity, effects of toxins, effect of growth conditions.

• Cancer Studies - tumor vs. normal.

Expression Arrays Have a Broad Range of Applicability

• Gene specific studies - deletion (“knockout”) vs. normal, over expression vs. normal.

• Agricultural studies - effects of pesticides, growth conditions, hormones.

• Developmental biology - cells from different areas/stages of developing organisms

• Many others - any two samples of interest can be compared.

Challenges for Planning Good Array Experiments

• Experimental Design– Replicates are necessary and expensive

– A simple experiment may not give a simple answer

– What comparisons should be made?

• Data Analysis– How will differentially expressed genes be identified?

– How will errors be estimated?

– What software does this best?

– How will the data be mined?

Where are arrays going?

• As sequencing gets cheaper and cheaper, most assays that are currently done by arrays can be done more effectively by sequencing. Hence, the analytical use of arrays will be replaced by sequencing.

• However, arrays can also be used to enrich for specific genomic regions upstream of sequencing or can be used to create many sequences for the artificial production of genomes or genomic regions.