Microarray Technology and Applications
Introduction to NanotechnologyFoothill College
Outline
• Gene expression
• Microarray technology
• Microarray process
• Applications in research / medicine
• Haplotyping (SNP) genotyping projects
• Future directions– New technologies / open source science
Several Advances in Technology Make Microarrays Possible
SurfaceChemistry
Robotics
ComputingPower & Bioinformatics
Genomics
Gene Expression
• Cells are different because of differential gene expression.
• About 40% of human genes are expressed at any one time.
• Gene is expressed by transcribing DNA exons into single-stranded mRNA
• mRNA is later translated into a protein• Microarrays measure the level of mRNA
expression by analyzing cDNA binding
Analysis of Gene Expression
• Examine expression during development or in different tissues
• Compare genes expressed in normal vs. diseased states
• Analyze response of cells exposed to drugs or different physiological conditions
Monitoring Changes in Genomic DNA
• Identify mutations • Examine genomic instability such as in
certain cancers and tumors (gene amplifications, translocations, deletions)
• Identify polymorphisms (SNPs)• Diagnosis: chips have been designed to
detect mutations in p53, HIV, and the breast cancer gene BRCA-1
Applications in Medicine
• Gene expression studies– Gene function for cell state change in
various conditions (clustering, classification)
• Disease diagnosis (classification)
• Inferring regulatory networks
• Pathogen analysis (rapid genotyping)
Applications in Drug Discovery
• Drug Discovery– Identify appropriate molecular targets for therapeutic
intervention (small molecule / proteins)– Monitor changes in gene expression in response to
drug treatments (up / down regulation)– Analyze patient populations (SNPs) and response
• Targeted Drug Treatment– Pharmacogenomics: individualized treatments– Choosing drugs with the least probable side effects
Many Genes Have Unknown Function
• Of the 25,498 predicted Arabidopsis genes:
• 30% have unknown function
• Only 9% are experimentally verified
The Arabidopsis Genome Initiative, Nature 2000
Generating DNA Sequence
base callingquality clippingvector clippingcontig assembly
chromatogram files
automatedsequencer
software pipeline
>GENE01
ACCTGTCAGTGTCAACTGCTTCAATAGCTAATGCTAGGCTCGATAATCGCTGGCCTCAGCTCAGTCTAGCATTACGATTACGGAGACCTATGCTTTAGCTAGTAGGAACCTCAGCTCAGTACCTGTCAGTGTCAACTGCTTCAATAGCTAATGCTACTC
>GENE01
ACCTGTCAGTGTCAACTGCTTCAATAGCTAATGCTAGGCTCGATAATCGCTGGCCTCAGCTCAGTCTAGCATTACGATTACGGAGACCTATGCTTTAGCTAGTAGGAACCTCAGCTCAGTACCTGTCAGTGTCAACTGCTTCAATAGCTAATGCTACTC
output
What Is Microarray Technology?
• Different ApproachesStanford/
Pat Brown
Affymetrix
How DNA sequences are laid down
Spotting Photolithography
Length of DNA sequences
cDNA(Complete sequences)
Oligonucleotides
Oligoarrays vs. Spotted Arrays
• Oligoarrays– Shorter nucleotides– Higher feature density
• Used for SNP detection– Perfect match and mismatch (A,T,C,G)
• More expensive to prepare– Higher per unit cost production– Not manufactured in as high numbers
Spotted Array Experiment1. Prepare sample.
Test Reference
2. Label with fluorescent dyes.
3. Combine cDNAs.
4. Print microarray.
5. Hybridize to microarray.
6. Scan.
cDNA Array Sample Preparation
Axon Instruments Scanner
• GenPix 4000
Choice of Microarray System
1. cDNA arrays (Affymetrix)2. Oligonucleotide chips3. cRNA arrays (Applied
Biosystems)
4. SNP arrays Applied Biosystems)
cDNA Arrays: Advantages
• Non-redundant clone sets are available for numerous organisms (humans, mouse, rats, drosophila, yeast, c.elegans, arabidopsis)
• Prior knowledge of gene sequence is not necessary: good choice for gene discovery
• Large cDNA size is great for hybridization• Glass or membrane spotting technology is
readily available
Membrane cDNA microarray
cDNA Arrays: Disadvantages
• Processing cDNAs to generate “spotting-ready” material is cumbersome
• Low density compared to oligonucleotide arrays• cDNAs may contain repetitive sequences (like
Alu in humans)• Common sequences from gene families (ex: zinc
fingers) are present in all cDNAs from these genes: potential for cross-hybridization
• Clone authentication can be difficult
cDNA Microarray Slide
Affymetrix gene chip
Affymetrix Microarrays
50um
1.28cm
~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM)Raw gene expression is intensity difference: PM - MM
Raw image
Affymetrix
• Probe Array (Photolithography)– Synthesis of probe
Affymetrix System
Microarrays: An Example
• Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999– 72 examples (38 train, 34 test), about 7,000 genes– well-studied (CAMDA-2000), good test example
ALL AML
Visually similar, but genetically very different
Tumor Cell Analysis
Heatmap Visualization of Selected Fields
ALL AML Heatmap visualizationis done by normalizing each gene to mean 0, std. 1 to get a picture like this.
Good correlation overall
AM
L-r
ela
ted
AL
L-re
late
dPossible outliers
Analysis Tasks
• Identify up- and down-regulated genes.• Find groups of genes with similar
expression profiles (++ / -- , fold change).• Find groups of experiments (tissues) with
similar expression profiles (++ / -- genes).• Find genes that explain observed
differences among tissues (feature selection), and new pathways.
Types of Analysis
• Unsupervised learning: learn from data only– visualization: find structure in data– clustering: find clusters / classes in data
• Supervised learning: learn from data plus prior knowledge– classification: predict discrete classes– regression: predict a real value
Learning Gene Classes
Predictor
Learner Model
Labels
labels
experiments
experiments
Genes
Genes
Training set
Test set
Microarray Data Life Cycle
BiologicalQuestion
SamplePreparation
MicroarrayReaction
MicroarrayDetection
Data Analysis& Modeling
Spotted cDNA Array Production
Hybridization Process
Sources of Error
• Systematic
• Random
lo
g s
ign
al i
nte
nsi
ty
log RNA abundance
Microarray Data Processing
quality & intensity
filtering
normalizationbackground correction
expression ratios (treated / control)
Data Filtering -- Intensitye
xpt
2
expt 1
all sigs sigs > 6
sigs > 10sigs > 8
Data Normalization Issues
• Normalization of data from different chips– MGED normalization standards –
http://www.mged.org/
• Natural biological variation is large • Technical variation is small ~ 98% auto-
correlation • MIT approach -- raw gene expression values• Stanford approach -- ratios
Data Preparation
• Thresholding: usually min 20, max 16,000– For older Affy chips (new Affy chips do not have
negative values)
• Filtering - remove genes with insufficient variation– e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5– biological reasons– feature reduction for algorithmic
• For clustering, normalize each gene separately to – Mean = 0, Std. Dev = 1
Clustering
Goals
• Find natural classes in the data
• Identify new classes / gene correlations
• Refine existing taxonomies
• Support biological analysis / discovery
• Different Methods– Hierarchical clustering, SOM's, etc
Identifying Co-Expressed Genes
• Cluster
• Treeview
• Eisen et al., PNAS 1998
• http://rana.lbl.gov
SOM Clustering
• SOM - self organizing maps• Preprocessing
– filter away genes with insufficient biological variation
– normalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately.
• Run SOM for many iterations• Plot the results
Microarray Analysis
• Gene discovery • Pattern discovery• Inferences about biological processes• Classification of biological processes
What genes are up-regulated, down-regulated, co-regulated,
not-regulated?
Identifying Differential Expression
SAMSignificance
Analysis of
Microarrays
Tusher et al., PNAS 2001
http://www-stat.stanford.edu/~tibs/SAM/index.html
• Diauxic shiftrespiration <- fermentation
Monitoring Cell Differentiation
Gene Prediction and Annotation
...TTCTCCTAGTTCTAGAGCGTTCGGCACGAGCAAATCTTAAACGTTTTACTTTGTGCT...raw sequence
predicted gene
predicted mRNAor EST
protein sequence database
presumed identity/function
Functional Genomics: High Throughput Platforms
• Microarrays– spotted cDNAs/ESTs– spotted oligonucleotides– in situ synthesized oligos
• SAGE• GeneCalling® www.curagen.com
• Megasort www.lynxgen.com
• MAPs (High Throughput Genomics, Tucson)
Advanced Questions in Microarray Analysis
• e.g. can we correlate patterns in other types of data with the microarray results?
- cis-elements– protein domains– protein-protein interactions– orthologous genes– gene ontologies– textual associations
Procedures
• Preparation– Target DNA (reference and test samples)– Slides
• Reaction(Droplet or Pin Spotting)– Hybridization
• Scanning• Analysis
– Image processing– Data mining– Modeling
Arrayer
ScannerHardware
Software
The Basic System Computational Tasks
• Gene set selection
• Probe design
• Image analysis
• Normalization of chip, sample….
• …more….
A Typical GeneChip Array Experiment
Target Preparation
Isolate total RNA from tissue
Synthesis of cDNA & addition of T7 promoter
Biotin-labeling of cRNA & purification
Hybridization to GeneChip
Deposition of oligo probeson GeneChip
Image acquisition& analysis
Data MiningInteresting genes
Pattern Recognition
Pathways
Probe Design System
• A method of hybridization of short synthetic oligonucleotide probes to cloned DNA sequences to derive genetic sequence information.
• Application field – oligonucleotide array– Polymerase Chain Reaction : primer design
Probe Design System
• Probe design issue– Unique probe for each gene– Different probe sets for each genes :
fingerprints– Probes (or a unique probe) of each gene
should have the ability of representing the gene.
– The similarities between probes should very low.
The Analysis - Computational Tasks
• Clustering genes: which genes seem to be regulated together
• Classifying genes: which functional class does a given gene fall into
• Classifying cell samples: does this patient have ALL or AML
• Inferring regulatory networks: what is the “circuitry” of the cell
Oligo Probe Pairs (SNPs)
• Oligos are selected from a region of the gene that has low similarity to other genes.
Perfect match: ATGTTTGACGCAGCGTAGATCCGAGMismatch: ATGTTTGACGCAACGTAGATCCGAG
SNPs – Single Nucleotide Polymorphisms
A GeneChip “Spot” is Composed of Numerous Cells and Includes Controls
SNPs and Alleles
SNPs and Intervals
Haplotype Mapping of Chromosome 21
Seven haplotypes make up 93% of the genetic sampling
Summary
• Microarray process
• Microarray applications
• From genes to pathways
• Haplotyping and SNP mapping
• Using microarrays and medicine