Single Nucleotide PolymorphismCopy Number Variations
and SNP Array
Xiaole Shirley Liu and
Jun Liu
2
Outline• Definition and motivation• SNP distribution and characteristics
– Allele frequency, LD, population stratification• SNP discovery (unknown) and genotyping
(known)– CNV detection
3
Polymorphism• Polymorphism: sites/genes with “common”
variation, less common allele frequency ≥1%, otherwise called rare variant and not polymorphic
• First discovered (early 1980): restriction fragment length polymorphism
• Some definitions: – Locus: position on chromosome where sequence
or gene is located– Allele: alternative form of DNA on a locus
4
Polymorphism• Single Nucleotide Polymorphism
– Occasionally short (1-3 bp) indels are considered SNPs too
– Come from DNA-replication mistake individual germ line cell, then transmitted
– ~90% of human genetic variation• Copy number variations
– May or may not be genetic
5
Why Should We Care• Disease gene discovery
– Association studies, certain SNPs are susceptible for diabetes
– Chromosome aberrations, duplication / deletion might cause cancer
• Personalized Medicine– Drug only effective if you have one allele
6
7
8
SNP Distribution• Most common, 1 SNP / 100-300 bp
– Balance between mutation introduction rate and polymorphism lost rate
– Most mutations lost within a few generations• 2/3 are CT differences• In non-coding regions, often less SNPs at
more conserved regions• In coding regions, often more synonymous
than non-synonymous SNPs
9
SNP Characteristics: Allele Frequency Distribution
• Most alleles are rare (minor allele frequency < 10%)
10
Mode of inheritance
12
SNP Characteristics:Hardy-Weinberg equilibrium (HWE)– In a population with genotypes BB, bb, and Bb, if p =
freq(B), q =freq(b), the frequencies of BB, bb and Bb will be p2, q2, and 2 pq respectively at equilibrium, and will not change.
– Assumptions for HWE: no mutation, no migration or emigration, infinite population size, no selective pressure, random mating. Could derivate from HWE if violated
– It provides a baseline against which to measure change, e.g., inbreeding index:
– More than 2 alleles:
13
SNP Characteristics:Linkage Disequilibrium
• Equilibrium Disequilibrium
• LD: If Alleles occur together more often than can be accounted for by chance, then indicate two alleles are physically close on the DNA– In mammals, LD is often lost at ~100 KB– In fly, LD often decays within a few hundred
bases
14
SNP Characteristics:Linkage Disequilibrium
• Statistical Significance of LD– Chi-square test with 1 df– eij = ni. n.j / nT
ji ij
ijij
een
,
22 )(
B1 B2 TotalA1 n11 n12 n1.A2 n21 n22 n2.Total n.1 n.2 nT
15
SNP Characteristics:Linkage Disequilibrium
• Three ways to calculate LD
11 1 1
1 2 2 1max max
1 1 2 2
22
1 2 1 2
max( , ) 0' / , where
max( , ) 0
D p p q
p q p q if DD D D D
p q p q if D
Drp p q q
ObservedExpected
16
SNP Characteristics:Linkage Disequilibrium
• Haplotype block: a cluster of linked SNPs• Haplotype boundary: blocks of sequence
with strong LD within blocks and no LD between blocks, reflect recombination hotspots
• Haplotype size distribution
17
SNP Characteristics:Linkage Disequilibrium
• Can see haplotype block: a cluster of linked SNPs
18
SNP Characteristics:Linkage Disequilibrium
• [C/T] [A/G] T X C [A/C] [T/A]– Possible haplotype: 24
– In reality, a few common haplotypes explain 90% variations
• Tagging SNPs: – SNPs that capture
most variations in haplotypes
– removes redundancy
Redundant
19
SNP Characteristics:Population Stratification
• Population stratification: individuals selected from two genetically different populations, stratification may be environmental, cultural, or genetic
• Could give spurious results in case control association studies – the example of “chopstick genes”
20
Using genetic variation to study populations
21
SNP Discovery Methods• Sequencing individuals for difference: too costly • First check whether big regions have SNPs
– Basic idea: denature and re-anneal two samples, detect heterduplex
– Can pool samples (e.g. 10 African with 10 Caucasians) to speed screening
• Resequence to verify• dbSNP: 12M RefSNP, 6M validated
22
SNP Genotyping• For a known locus TT C/A AG, does this individual
have CC, AA or AC? Many methods• Hybridization-based methods
– Dynamic allele-specific hybridization– Molecular beacons– SNP-array chip (simultaneously genotype thousands of SNPs)
• Enzyme-based methods– RFLP– PCR-based methods– Flap endonuclease– Primer extension– Oligonucleotide ligase assay
• Other methods (based on physical properties of DNA)
23
SNP Array• One SNP at a time or genome-wide (SNP array)
2.5kb5.8kb0.30
24
40 Probes Used Per SNP• Allele call
– AA, BB, AB• Signal
– Theoretically 1A+1B, 2A, 2B– But couldhave 1A+3BAmplified!
25
T
SNP Chip for LOH• Loss of Heterozygosity: tumor suppressor
gene inactivation by allelic loss in cancers
T T
Normal First genetic hit Cancer
XOR
T T X TX TXA B A A AA B
LOH
27
SNP Array for CNV• Collect normal / diseased samples on SNP arrays• Probe normalization, background subtraction
• Use HMM to infer CNV
28
Integrate CNV with Expression toIdentify oncogene MITF in melanoma
29
Summary• SNP and CNV• SNP distribution and characteristics
– Allele frequency (minor allele > 1%)– LD: linkage ~ physical proximity– Population stratification
• SNP discovery: heteroduplex• SNP genotyping
– SNP array– CNV detection: HMM
30
Acknowledgement• Stefano Monti• Tim Niu• Kenneth Kidd, Judith Kidd and Glenys
Thomson• Joel Hirschhorn• Greg Gibson & Spencer Muse