Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | paolo-missier |
View: | 577 times |
Download: | 0 times |
Accelerating Disease Gene Identification Through Integrated
SNP Data Analysis
Paolo Missier, S. Embury, C. Hedeler, M. GreenwoodSchool of Computer Science, University of Manchester, UK
J. Pennock, A. BrassSchool of Biological Sciences, University of Manchester, UK
DILS ’07, Philadelphia, USA
Combining the strengths of UMIST andThe Victoria University of Manchester
Overall goal
– Add value to existing public SNP databases
– Support multiple experimental added-value SNP analysis packages
Core application:
improving the search for candidate gene selection in quantitative trait analysis
– Analysis of genetic factors in observed quantitative phenotypes
• resistance / susceptibility to a certain disease
• life span, weight, …
Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)
Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: study on a parasite worm
Trichuris trichiura Trichuris muris
•Same life cycle•Natural parasite of mice
Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254
‘’…….28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’
Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254
‘’…….28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’
0
5
10
15
20
25
30
4 5 6 7 8 9 10 11 12 13 14 15
Age (years)
Inte
nsi
ty (
epg
) x
1000
Combining the strengths of UMIST andThe Victoria University of Manchester
Finding candidate genes
• Candidate gene determination is an area of active research
(A. Chakravarti. Population genetics – making sense out of sequence. Nature Genetics, 21(Suppl. 1), January 1999)
• Current methodology involves QTL mapping
– Experimental method to correlate quantitative phenotype with genotype
– Associates a region on the chromosome to a specific phenotype through complex in-breeding schemes
Mixed responders
ResistantSusceptible
Combining the strengths of UMIST andThe Victoria University of Manchester
0
2
4
6
8
10
12
14
16
18
K s
tatis
tic
D12M
it63
D12M
it136
D12M
it285
D12M
it201
D12M
it156
D12M
it52
D12M
it144
cM
The challenge
• A QTL may contain hundreds or thousands of genes
• Quantitative phenotypes are often polygenic
• Determination of candidate genes is a difficult and slow process
Example QTL (chr 12)
Automation is needed to narrow the scope of the search to a manageable size
Automation is needed to narrow the scope of the search to a manageable size
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPs and their role in QT analysis
SNP: Single Nucleotide Polymorphism
– single-base change in a strain relative to a reference strain (mus musculus)
• Strategy
– Identify areas of greatest difference between resistant / susceptible strains
– Prioritize candidate gene search using the density of highly differentiated gene regions
0
100
200
300
400
500
600
700
800
900
31
-32
32
-33
33
-34
34
-35
35
-36
36
-37
37
-38
38
-39
39
-40
40
-41
41
-42
42
-43
43
-44
44
-45
45
-46
46
-47
47
-48
48
-49
49
-50
50
-51
51
-52
52
-53
53
-54
54
-55
1Mbp divisions Chr 12
No
ind
ivid
ua
l SN
Ps
Priority region
Combining the strengths of UMIST andThe Victoria University of Manchester
SNP informativeness
• Rank SNPs according to strain differences
• Strain allele: nucleotide base replacement for a SNP observed in a single strain
Strain group 1(resistant)
Strain group 2(susceptible)
Group score model:
• Compare susceptible strains vs resistant strains
Perfect score:• Disjoint sets of alleles• No missing alleles
Group score model:
• Compare susceptible strains vs resistant strains
Perfect score:• Disjoint sets of alleles• No missing alleles
Combining the strengths of UMIST andThe Victoria University of Manchester
Group strain score model
mnAAgs
1),( 210
nssS ,...,11
''12 ,..., maaA naaA ,...,11
21 AA
m
Strains
Corresponding alleles
For each SNP:
Common, distinct non-null alleles
Distinct non-null alleles in A1, A2 : n
1
1
1 A
N
jj aAap
21210211 ),(),( ppAAgsAAgs
nssS ,...,12
Penalties:
2
2
2 A
N
jj aAap
Combining the strengths of UMIST andThe Victoria University of Manchester
Example
3
1
2
11
3
11
1),( 21211
pp
mnAAgs
1,2,1 mn
Combining the strengths of UMIST andThe Victoria University of Manchester
Score model performance
• No standard test dataset
• Criteria: evaluate ranking of polymorphic genes
– Based on known candidate genes for HDL (cholesterol) QTL regions
• From SNP scores to gene scores:
High-score SNP density / total SNP density
Combining the strengths of UMIST andThe Victoria University of Manchester
Score selectivity
HDL data on Perlegen
{ CAST/EiJ, C57BL/6J } vs { C3H/HeJ, FVB/NJ }
7090 / 101,896 = 6.9%Translates to < 20 candidate genes
Combining the strengths of UMIST andThe Victoria University of Manchester
The SNPit project
• A "lightweight" SNP database designed to support genetic research
– gene identification in QTLs is one application
– Hope to answer a broader array of research questions beyond QTL analysis
• SNPit is a secondary DB
– Primary sources: Ensembl (EBI), dbSNP (NCBI), Perlegen
• Others available, not considered in this study MGD – see Nucl. Acids Res., 35(Database issue), 2007UCSC – see Nucl. Acids Res., 35(Database issue), 2007Wellcome-CTC Mouse Strain SNP Genotype SetMPD – Mouse Phenome Database
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPit application challengesSupports interactive exploratory analysis over large regions
• Over 50Mb –200K SNPs/source/session
Typical flow:
1. Region selection (or gene set)
2. Source selection (multiple)
3. Strain group selection – per-session basis
4. Compute score for each SNP in the region – on the fly
5. (filter by gene polymorphism)
6. Rank SNPs by score, gene polymorphism – in-memory sorting
7. Plot density of high-score SNPs over the selected region
• Change parameters and repeat…
Response times typically within 30secs on a Tomcat deployment, high-end server with co-located DBMS (mySQL)
Combining the strengths of UMIST andThe Victoria University of Manchester
Why multiple SNP DBs
• SNP databases differ
– Partially overlap in structure and content
– Different update policy and frequency
• Biologists like to choose their sources
– Based on experience, prior usage, confidence
• The SNPit application offers an explicit choice
• It exploits complementary features and content of the DBs
Combining the strengths of UMIST andThe Victoria University of Manchester
Data architecture
SNPitDB
SNPitDB
EnsemblSNP
EnsemblSNP dbSNPdbSNP PerlegenPerlegen
SNPitWeb app
SNPitWeb app
SNPitWeb Service
SNPitWeb Service
loadload loadload loadload
rsId ssId
PerlegendbSNPEnsembl
Interdependent materialized views
Interdependent materialized views
• no single global schema
• queries against the views
• some queries can be directedto more than one DB
• End-user web app
• Web Service accessible as a workflow processor (Taverna)
• no single global schema
• queries against the views
• some queries can be directedto more than one DB
• End-user web app
• Web Service accessible as a workflow processor (Taverna)
Periodic updatesPeriodic updates
Core Data processing
Core Data processing Score 2Score 2
……
Score 1
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPit access from a workflow
Combining the strengths of UMIST andThe Victoria University of Manchester
EnsemblMouse(407,000)
NCBIdbSNP
Perlegen
Public submissionfrom multiple sources
joinrsId rsId ssId ssId
join SNPs
Strainalleles
SNPProvenance
MultipleSNPsstrains
LoadLoad
Load
Sangerinstitute
PrimarysourcesUpdates
Updates
Tot 407,000Tot 420,000
147,000 146,00014,000
133,000
132,000
(420,000)
SNP DB dependencies
(all figures relative to chromosome 12)
Combining the strengths of UMIST andThe Victoria University of Manchester
Qualitative differencesStrengths Weaknesses Strain info
Ensembl • Curated SNPs
• Evolving
• SNP location info (exonic, intronic)
• Multiple reputable sources
•Controlled submission
Low timeliness About 60 strains
Not very complete
dbSNP • Submitter info
• Update history (provenance)
•Multiple sources
•Low quality control on public submission
•Timely
Not used
Perlegen • Good quality control
• High reputation
•No SNP location
•Not evolving
16 strains (ref + 15)
Fairly complete
Combining the strengths of UMIST andThe Victoria University of Manchester
Missing strains – chr 17% SNPs with available strain allele - PERLEGEN
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
C57BL/6
J
KK/HlJ
BTBR T+ tf
NZW/La
cJ
AKR/J A/J
WSB/E
iJ
DBA/2J
BALB/cByJ
C3H/HeJ
129S
1/SvIm
NOD/L
tJ
FVB/NJ
CAST/EiJ
MOLF
/EiJ
PWD/P
hJ
strain
% a
vail
able
all
ele
% SNPs with available allele strain - ENSEMBL
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
A/J
C57BL/6
J
DBA/2J
129X
1/SvJ
129S
1/SvIm
J
AKR/J
C3H/HeJ
KK/HlJ
NZW/La
cJ
BALB/cByJ
BTBR T+ tf
/J
WSB/E
iJ
NOD/L
TJ
MOLF
/EiJ
FVB/NJ
PWD/P
hJ
CAST/EiJ
MSM/M
s
NOD/D
IL
CZECHII/Ei
SPRET/Ei
strain
avai
lab
le a
llel
e in
fo
Combining the strengths of UMIST andThe Victoria University of Manchester
Effect of source selection
EnsemblPerlegen
Combining the strengths of UMIST andThe Victoria University of Manchester
Summary• SNPit complements current methodologies for candidate gene
discovery in QTL regions– Helps focusing on promising genes
– Automates SNP analysis over large regions
• View-based, loose integration of three prominent DBs
• Original score models– More study needed to exploit other features
• SNP location, submitter info, revision frequency…
• Can be invoked from workflows– As part of larger in silico experiments
• Plan to release SNPit as a public Web Service
Combining the strengths of UMIST andThe Victoria University of Manchester
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPs and their role in QT analysis
• SNP: Single Nucleotide Polymorphism
– single-base change in a strain relative to a reference strain (mus musculus)
• Inbred strains are genetically similar
• The arrangement of SNPs across the mouse genome falls into blocks which are common among strains (haplotypes)
• ex.: C57 strain (susceptible) different from A/J and BALBc strains (resistant)
Combining the strengths of UMIST andThe Victoria University of Manchester
DB overlaps
Perlegen291,718
Ensembl253,862
dbSNP
50,564
122,938
105,265243,702
(Chromosome 17)