Post on 29-Dec-2015
transcript
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Introduction to SNP and Haplotype Analysis
Algorithms and Computational Biology Lab,Department of Computer Science & Information Engineering,
National Taiwan University, Taiwan.
Lecturer: Kun-Mao Chao
Assistant: Yao-Ting Huang
Thank Yao-Ting for preparing this wonderful lecture note.
2
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Genetic Variations The genetic variations in DNA sequences (e.g.,
insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences. All humans share 99% the same DNA sequence. The genetic variations in the coding region may change
the codon of an amino acid and alters the amino acid sequence.
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Single Nucleotide Polymorphism
A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity. SNP: Single DNA base variation found >1% Mutation: Single DNA base variation found <1%
C T T A G C T T
C T T A G T T T
SNP
C T T A G C T T
C T T A G T T T
Mutation
94%
6%
99.9%
0.1%
4
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Mutations and SNPs
Common Ancestor
time present
Observed genetic variationsMutationsSNPs
5
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Single Nucleotide Polymorphism
SNPs are the most frequent form among various genetic variations.90% of human genetic variations come from
SNPs.SNPs occur about every 300~600 base pairs.Millions of SNPs have been identified (e.g.,
HapMap and Perlegen). SNPs have become the preferred markers for
association studies because of their high abundance and high-throughput SNP genotyping technologies.
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Single Nucleotide Polymorphism
A SNP is usually assumed to be a binary variable. The probability of repeat mutation at the same SNP
locus is quite small. The tri-allele cases are usually considered to be the
effect of genotyping errors. The nucleotide on a SNP locus is called
a major allele (if allele frequency > 50%), or a minor allele (if allele frequency < 50%).
A C T T A G C T T
A C T T A G C T C C: Minor allele
94%
6%
T: Major allele
7
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Haplotypes A haplotype stands for a set of linked SNPs on the
same chromosome. A haplotype can be simply considered as a binary
string since each SNP is binary.
SNP1 SNP2 SNP3
-A C T T A G C T T-
-A A T T T G C T C-
-A C T T T G C T C-
Haplotype 2
Haplotype 3
C A T
A T C
C T CHaplotype 1
SNP1 SNP2 SNP3
8
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Genotypes The use of haplotype information has been
limited because the human genome is a diploid. In large sequencing projects, genotypes instead of
haplotypes are collected due to cost consideration.
AC
GT
A T
SNP1 SNP2
C G
Haplotype data
SNP1 SNP2
Genotype data
AC
GT
SNP1 SNP2
A T
C G
SNP1 SNP2
9
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Problems of Genotypes Genotypes only tell us the alleles at each SNP locus.
But we don’t know the connection of alleles at different SNP loci.
There could be several possible haplotypes for the same genotype.
AC
GT
SNP1 SNP2
Genotype data
orA T
C GSNP1 SNP2
A G
C TSNP1 SNP2
AC
GT
SNP1 SNP2
We don’t know which haplotype pair is real.
10
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Research Directions of SNPs and Haplotypes in Recent Years
HaplotypeInference
Tag SNPSelection
MaximumParsimony
PerfectPhylogeny
StatisticalMethods
Haplotypeblock
LD binPredictionAccuracy
SNPDatabase
…
11
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Haplotype Inference The problem of inferring the haplotypes from a set of
genotypes is called haplotype inference. This problem is already known to be not only NP-hard
but also APX-hard. Most combinatorial methods consider the maximum
parsimony model to solve this problem. This model assumes that the real haplotypes in natural
population is rare. The solution of this problem is a minimum set of
haplotypes that can explain the given genotypes.
12
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Maximum Parsimony
A Gh3
C Th4
A Th1
C Gh2
A Th1
A Th1
orG1
AC
SNP1 SNP2
GT
G2A
A
SNP1 SNP2
TT
A G
C T
A T
A T
C G
Find a minimum set of haplotypes to explain the given genotypes.
13
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Related Works Statistical methods:
Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER.
Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE.
Combinatorial methods: Gusfield (2003) proposed an integer linear programming
algorithm. Wang and Xu (2003) developed a branching and bound
algorithm called HAPAR to find the optimal solution. Brown and Harrower (2004) proposed a new integer
linear formulation of this problem.
14
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Our Results We formulated this problem as an integer quadratic
programming (IQP) problem. W proposed an iterative semidefinite programming
(SDP) relaxation algorithm to solve the IQP problem. This algorithm finds a solution of O(log n) approximation.
We implemented this algorithm in MatLab and compared with existing methods. Huang, Y.-T., Chao, K.-M., and Chen, T. “An
approximation algorithm for haplotype inference by pure parsimony,” To appear in Journal of Computational Biology, 2005.
15
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Problem Formulation Input:
A set of n genotypes and m possible haplotypes. Output:
A minimum set of haplotypes that can explain the given genotypes.
A Th1
C Gh2
A Th1
A Th1
G1
AC
SNP1 SNP2
GT
G2A
A
SNP1 SNP2
TT
A Th1
C Gh2
16
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Integer Quadratic Programming (IQP)
selected.not is haplotypeth - theif 04
)11( and
selected; is haplotypeth - theif 14
)11( since
2
2
i
i
Define xi as an integer variable with values 1 or -1. xi = 1 if the i-th haplotype is selected.
xi = -1 if the i-th haplotype is not selected.
Minimizing the number of selected haplotypes is to minimize the following integer quadratic function:
m
i
ix
1
2
4
)1( Minimize
17
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Integer Quadratic Programming (IQP)
14
)1)(1(
4
)1)(1(
4
)1)(1( 4321
)}4,3(),2,1{(),(
xxxxxx
tr
tr
Each genotype must be resolved by at least one pair of haplotypes. For genotype G1, the following integer quadratic function
must be satisfied.
G1
AC
SNP1 SNP2
GT
A Th1
C Gh2
A Gh3
C Th4or
1 1Suppose h1 and h2 are selected
18
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Integer Quadratic Programming (IQP)
Maximum parsimony:
We use the SDP-relaxation technique to solve this IQP problem.
m
i
ix
1
2
4
)1( Minimize Objective
Function
]. ,1[ },1 ,1{
,14
)1)(1( Subject to
),(
njx
xx
i
Shh
tr
jtr
Constraint Functions
to resolve all genotypes.
Find a minimum set of haplotypes
19
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Research Directions of SNPs and Haplotypes in Recent Years
HaplotypeInference
Tag SNPSelection
MaximumParsimony
PerfectPhylogeny
StatisticalMethods
Haplotypeblock
LD binPredictionAccuracy
SNPDatabase
…
20
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Problems of Using SNPs for Association Studies The number of SNPs is still too large to be used for
association studies. There are millions of SNPs in a human body. To reduce the SNP genotyping cost, we wish to use as
few SNPs as possible for association studies. Tag SNPs are a small subset of SNPs that is sufficient
for performing association studies without losing the power of using all SNPs. There are many definitions of tag SNPs. We will first study one definition of tag SNPs based on
haplotype blocks model.
21
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Haplotype Blocks and Tag SNPs Recent studies have shown that the chromosome can be
partitioned into haplotype blocks interspersed by recombination hotspots (Daly et al, Patil et al.). Within a haplotype block, there is little or no recombination
occurred. The SNPs within a haplotype block tend to be inherited
together. Within a haplotype block, a small subset of SNPs (called tag
SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. We only need to genotype tag SNPs instead of all SNPs
within a haplotype block.
22
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Recombination Hotspots and Haplotype Blocks
Recombinationhotspots
Chromosome
Haplotypeblocks
P1 P2 P3 P4S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP loci
Haplotype patterns
: Major allele
: Minor allele
23
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
A Haplotype Block Example
The Chromosome 21 is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001). Blue box: major allele Yellow box: minor allele
24
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Examples of Tag SNPs
P1 P2 P3 P4S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP loci
Haplotype patterns
Suppose we wish to distinguish an unknown haplotype sample.
We can genotype all SNPs to identify the haplotype sample.
An unknown haplotype sample
: Major allele
: Minor allele
25
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Examples of Tag SNPs
P1 P2 P3 P4S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP loci
Haplotype pattern
In fact, it is not necessary to genotype all SNPs.
SNPs S3, S4, and S5 can form a set of tag SNPs.
P1 P2 P3 P4
S3
S4
S5
26
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Examples of Wrong Tag SNPs
P1 P2 P3 P4S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP loci
Haplotype pattern
SNPs S1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous.
P1 P2 P3 P4S1
S2
S3
27
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Examples of Tag SNPs
P1 P2 P3 P4S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
SNP loci
Haplotype pattern
SNPs S1 and S12 can form a set of tag SNPs.
This set of SNPs is the minimum solution in this example.
P1 P2 P3 P4S1
S12
28
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Problems of Finding Tag SNPs The problem of finding the minimum set of tag SNPs is
known to be NP-hard. This problem is the minimum test set problem. A number of methods have been proposed to find the
minimum set of tag SNPs (Bafna et al., Zhang, et al.). In reality, we may fail to obtain some tag SNPs if
they do not pass the threshold of data quality. In the current genotyping environment, the missing rate of
SNPs is around 5~10%. We proposed two greedy algorithms and one linear
programming relaxation algorithm to solve this problem.
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Introduction to Linkage Disequilibrium and Programming
Assignment
Algorithms and Computational Biology Lab,Department of Computer Science & Information Engineering,
National Taiwan University, Taiwan.
Speaker: Yao-Ting Huang
30
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Research Directions of SNPs and Haplotypes in Recent Years
HaplotypeInference
Tag SNPSelection
MaximumParsimony
PerfectPhylogeny
StatisticalMethods
Haplotypeblock
LD binPredictionAccuracy
SNPDatabase
…
31
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Linkage Disequilibrium The problem of finding tag SNPs can be also solved
from the statistical point of view. We can measure the correlation between SNPs and
identify sets of highly correlated SNPs. For each set of correlated SNPs, only one SNP need
to be genotyped and can be used to predict the values of other SNPs.
Linkage Disequilibrium (LD) is a measure that estimates such correlation between two SNPs. We will formally introduce the detailed information
of LD later.
32
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Linkage Disequilibrium Bins The statistical methods for finding tag SNPs are
based on the analysis of LD among all SNPs. An LD bin is a set of SNPs such that SNPs within the
same bin are highly correlated with each other. The value of a single SNP in one LD bin can predict the
values of other SNPs of the same bin. These methods try to identify the minimum set of LD
bins.
33
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
An Example of LD Bins (1/3) SNP1 and SNP2 can not form an LD bin.
e.g., A in SNP1 may imply either G or A in SNP2.
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
34
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
An Example of LD Bins (2/3) SNP1, SNP2, and SNP3 can form an LD bin.
Any SNP in this bin is sufficient to predict the values of others.
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
35
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
An Example of LD Bins (3/3) There are three LD bins, and only three tag SNPs are
required to be genotyped (e.g., SNP1, SNP2, and SNP4).
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
36
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Difference between Haplotype Blocks and LD bins Haplotype blocks are based on the assumption that
SNPs in proximity region should tend to be correlated with each other. The probability of recombination occurs in between is
less. LD bins can group correlated of SNPs distant from
each other. A disease is usually affected by multiple genes instead of
single one. The SNPs in one LD bin can be shared by other bins.
The SNPs in a haplotype block do not appear in another block.
37
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Introduction to Linkage Disequilibrium
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
A BA ba Ba b
A, B: major alleles
a, b: minor alleles
PA: probability for A alleles at SNP1
Pa: probability for a alleles at SNP1
PB: probability for B alleles at SNP2
PB: probability for b alleles at SNP2
PAB: probability for AB haplotypes
Pab: probability for ab haplotypes
SNP1 SNP2
38
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Linkage Equilibrium PAB = PAPB
PAb = PAPb = PA(1-PB)
PaB = PaPB = (1-PA) PB
Pab = PaPb = (1-PA) (1-PB)B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
SNP2
39
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Linkage Disequilibrium PAB ≠ PAPB
PAb ≠ PAPb = PA(1-PB)
PaB ≠ PaPB = (1-PA) PB
Pab ≠ PaPb = (1-PA) (1-PB)B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
SNP2
40
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
An Example of Linkage Disequilibrium
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
Suppose we have three haplotypes: AG, CG, and CC. There is no AC haplotype, i.e., PAC = 0.
Note that PAC =0, PAPC =1/9, and PAC ≠ PAPC. These two SNPs are linkage disequilibrium.
PA=1/3PC=2/3
PG=2/3PC=1/3
41
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
An Example of Linkage Equilibrium
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
-- A -- -- -- C -- -- --
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
Before recombination After recombination
PA=1/2PC=1/2
PG=1/2PC=1/2
After recombination, PAG = PAPG = 1/4,
PCG = PCPG = 1/4,
PCC = PCPC = 1/4, and
PAC = PAPC = 1/4.
These two SNPs are linkage equilibrium.
42
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Linkage Disequilibrium There are many formulas to compute LD
between two SNPs, and most of them are usually normalized between -1~1 or 0~1.LD = 1 (perfect positive correlation)LD = 0 (no correlation or linkage equilibrium)LD = -1 (perfect negative correlation)LD = 0.8 (strong positive correlation)LD = 0.12 (weak positive correlation)
43
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Linkage Disequilibrium Formulas
Mathematical formulas for computing LD: r2 or Δ2:
D’:
Chi-square Test. P value.
)1()1(
)( 22
BBAA
BAAB
PPPP
PPPr
.0 if ,),min(
;0 if ,),min('
DPPPP
D
DPPPP
D
D
BabA
baBA
44
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Correlation Coefficient The correlation between two random variables A and
B can be measured by the correaltion coefficient:
)1()1(
)(
)(Var)(Var
),(Cov
2
22
BBAA
BAAB
PPPP
PPP
BA
BAr
BAAB PPP
BEAEABEBA
][][][),(Cov
)1(
][][)(V2
22
AA
AA
PP
PP
AEAEAar
45
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Examples of Computing LD
375.0)
52
53
51
54
)53
54
53
(
)1()1(
)(
2
22
12
BBAA
BAAB
PPPP
PPPr
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A T A A G T
2 G T C C T T
3 G A C A G T
4 G A C C T T
5 G A C A G C
1)
51
54
51
54
)54
54
54
(
)1()1(
)(
2
22
13
BBAA
BAAB
PPPP
PPPr
46
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Minimum Clique Cover Problem
This problem asks for a minimum set of LD bins. The minimum LD value required between two SNPs in
one bin is usually set to 0.8. This problem is known to be the minimum clique
cover problem (by Chao, K.-M., 2005). Consider each SNP as nodes on the graph. There exists an edge between two nodes iff the LD of
these two SNPs ≥ 0.8.
47
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Relaxation of This Problem The minimum clique cover problem is not easy to be
approximated. The relaxed problem asks for a minimum set of LD bins
such that at least one SNP in an LD bin has r2 ≥ 0.8 with other SNPs in the same bin.
The relaxed problem is known to be the minimum dominating set problem. The minimum dominating set problem is still NP-hard
but is easier to be approximated.
48
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Minimum Dominating Set Problem
Given a graph G(V, E), the minimum dominating set C is the minimum set of nodes, such that each node in V has at least one edge connecting to nodes in C.
Consider each node as a SNP and each edge as strong LD (r2 ≥ 0.8) between two SNPs. The minimum dominating set of this graph is the set of
tag SNPs. We can only use this set of SNPs to predict other SNPs.
49
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Experimental Data Sets Hinds et al. (2005)
identified 1,586,383 SNPs across three human populations. African, Americans of
European, and Asian. The database provides
both genotype data and inferred haplotype data.
50
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
The Programming Assignment Conduct an experiment on the Perlegen SNP
database. http://www.perlegen.com
Find the minimum set of LD bins, such that at least one SNP has strong LD (r2 ≥ 0.8) with other SNPs in the same bin. Please use r2 ≥ 0.8 as the threshold to identify strong
correlation between two SNPs. The focus of this project is to design algorithms for
solving the minimum dominating set problem.
51
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
Haplotype Data Format
local_id Local unique identifier for this SNP
accession NCBI Build 34 sequence accession number
position Position within the specified Build 34 sequence
alleles The two SNP alleles: order is arbitrary
NA?????_A,NA?????_B
Two inferred haploid alleles.Columns 5-50: African American haplotypesColumns 51-98: European American haplotypesColumns 99-146: Han Chinese haplotypes
Download phased haplotype data from http://genome.perlegen.com/browser/download.html. Please use the 24 phased haplotype data sets.
52
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
The Programming Assignment Teamwork with up to 5 people in a team.
The program can be written in any programming language.
Exact or approximate algorithms are both welcome (more methods, higher grades). Please provide the analysis of proposed algorithms (e.g.,
the time complexity). If using some existing method, please add appropriate
citations or references.
53
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
The Project Report The project report should include at least the
following contents (more information, higher grades). (1) Team member information, (2) description of your algorithms, (3) analysis of your algorithms (e.g., time complexity,
approximation ratio), (4) summary of experimental results, and (5) contributions of each team member.
54
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
The Experimental Setup The summary of your experimental results should
at least include some statistics of the LD bins found by your algorithm.
We encourage you to conduct a comprehensive experiment and analysis.
All Africa European Chinese
1-10 SNPs 15123 12134 13123 11134
≥10 SNPs 1234 1111 1111 1111
Total bins 16357 13245 14234 12245
55
National Taiwan UniversityDepartment of Computer Science
and Information Engineering
The Programming Assignment Due date: 12/14 Email your program (with detailed running
procedure) and project report to TA.Yao-Ting Huang : d92023@csie.ntu.edu.twWe may ask you to come to demo your program if
necessary. Important messages will be announced on the
following web page. http://www.csie.ntu.edu.tw/~kmchao/seq05fall/