Paper presentation @DILS'07

Accelerating Disease Gene Identification Through Integrated

SNP Data Analysis

Paolo Missier, S. Embury, C. Hedeler, M. GreenwoodSchool of Computer Science, University of Manchester, UK

J. Pennock, A. BrassSchool of Biological Sciences, University of Manchester, UK

DILS ’07, Philadelphia, USA

Combining the strengths of UMIST andThe Victoria University of Manchester

Overall goal

– Add value to existing public SNP databases

– Support multiple experimental added-value SNP analysis packages

Core application:

improving the search for candidate gene selection in quantitative trait analysis

– Analysis of genetic factors in observed quantitative phenotypes

• resistance / susceptibility to a certain disease

• life span, weight, …

Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)

Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)


Example: study on a parasite worm

Trichuris trichiura Trichuris muris

•Same life cycle•Natural parasite of mice

Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254

‘’…….28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’

Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254

‘’…….28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’

0

5

10

15

20

25

30

4 5 6 7 8 9 10 11 12 13 14 15

Age (years)

Inte

nsi

ty (

epg

) x

1000


Finding candidate genes

• Candidate gene determination is an area of active research

(A. Chakravarti. Population genetics – making sense out of sequence. Nature Genetics, 21(Suppl. 1), January 1999)

• Current methodology involves QTL mapping

– Experimental method to correlate quantitative phenotype with genotype

– Associates a region on the chromosome to a specific phenotype through complex in-breeding schemes

Mixed responders

ResistantSusceptible


0

2

4

6

8

10

12

14

16

18

K s

tatis

tic

D12M

it63

D12M

it136

D12M

it285

D12M

it201

D12M

it156

D12M

it52

D12M

it144

cM

The challenge

• A QTL may contain hundreds or thousands of genes

• Quantitative phenotypes are often polygenic

• Determination of candidate genes is a difficult and slow process

Example QTL (chr 12)

Automation is needed to narrow the scope of the search to a manageable size

Automation is needed to narrow the scope of the search to a manageable size


SNPs and their role in QT analysis

SNP: Single Nucleotide Polymorphism

– single-base change in a strain relative to a reference strain (mus musculus)

• Strategy

– Identify areas of greatest difference between resistant / susceptible strains

– Prioritize candidate gene search using the density of highly differentiated gene regions

0

100

200

300

400

500

600

700

800

900

31

-32

32

-33

33

-34

34

-35

35

-36

36

-37

37

-38

38

-39

39

-40

40

-41

41

-42

42

-43

43

-44

44

-45

45

-46

46

-47

47

-48

48

-49

49

-50

50

-51

51

-52

52

-53

53

-54

54

-55

1Mbp divisions Chr 12

No

ind

ivid

ua

l SN

Ps

Priority region


SNP informativeness

• Rank SNPs according to strain differences

• Strain allele: nucleotide base replacement for a SNP observed in a single strain

Strain group 1(resistant)

Strain group 2(susceptible)

Group score model:

• Compare susceptible strains vs resistant strains

Perfect score:• Disjoint sets of alleles• No missing alleles

Group score model:

• Compare susceptible strains vs resistant strains

Perfect score:• Disjoint sets of alleles• No missing alleles


Group strain score model

mnAAgs

1),( 210

nssS ,...,11

''12 ,..., maaA naaA ,...,11

21 AA

m

Strains

Corresponding alleles

For each SNP:

Common, distinct non-null alleles

Distinct non-null alleles in A1, A2 : n

1

1

1 A

N

jj aAap

21210211 ),(),( ppAAgsAAgs

nssS ,...,12

Penalties:

2

2

2 A

N

jj aAap


Example

3

1

2

11

3

11

1),( 21211

pp

mnAAgs

1,2,1 mn


Score model performance

• No standard test dataset

• Criteria: evaluate ranking of polymorphic genes

– Based on known candidate genes for HDL (cholesterol) QTL regions

• From SNP scores to gene scores:

High-score SNP density / total SNP density


Score selectivity

HDL data on Perlegen

{ CAST/EiJ, C57BL/6J } vs { C3H/HeJ, FVB/NJ }

7090 / 101,896 = 6.9%Translates to < 20 candidate genes


The SNPit project

• A "lightweight" SNP database designed to support genetic research

– gene identification in QTLs is one application

– Hope to answer a broader array of research questions beyond QTL analysis

• SNPit is a secondary DB

– Primary sources: Ensembl (EBI), dbSNP (NCBI), Perlegen

• Others available, not considered in this study MGD – see Nucl. Acids Res., 35(Database issue), 2007UCSC – see Nucl. Acids Res., 35(Database issue), 2007Wellcome-CTC Mouse Strain SNP Genotype SetMPD – Mouse Phenome Database


SNPit application challengesSupports interactive exploratory analysis over large regions

• Over 50Mb –200K SNPs/source/session

Typical flow:

1. Region selection (or gene set)

2. Source selection (multiple)

3. Strain group selection – per-session basis

4. Compute score for each SNP in the region – on the fly

5. (filter by gene polymorphism)

6. Rank SNPs by score, gene polymorphism – in-memory sorting

7. Plot density of high-score SNPs over the selected region

• Change parameters and repeat…

Response times typically within 30secs on a Tomcat deployment, high-end server with co-located DBMS (mySQL)


Why multiple SNP DBs

• SNP databases differ

– Partially overlap in structure and content

– Different update policy and frequency

• Biologists like to choose their sources

– Based on experience, prior usage, confidence

• The SNPit application offers an explicit choice

• It exploits complementary features and content of the DBs


Data architecture

SNPitDB

SNPitDB

EnsemblSNP

EnsemblSNP dbSNPdbSNP PerlegenPerlegen

SNPitWeb app

SNPitWeb app

SNPitWeb Service

SNPitWeb Service

loadload loadload loadload

rsId ssId

PerlegendbSNPEnsembl

Interdependent materialized views

Interdependent materialized views

• no single global schema

• queries against the views

• some queries can be directedto more than one DB

• End-user web app

• Web Service accessible as a workflow processor (Taverna)

• no single global schema

• queries against the views

• some queries can be directedto more than one DB

• End-user web app

• Web Service accessible as a workflow processor (Taverna)

Periodic updatesPeriodic updates

Core Data processing

Core Data processing Score 2Score 2

……

Score 1


SNPit access from a workflow


EnsemblMouse(407,000)

NCBIdbSNP

Perlegen

Public submissionfrom multiple sources

joinrsId rsId ssId ssId

join SNPs

Strainalleles

SNPProvenance

MultipleSNPsstrains

LoadLoad

Load

Sangerinstitute

PrimarysourcesUpdates

Updates

Tot 407,000Tot 420,000

147,000 146,00014,000

133,000

132,000

(420,000)

SNP DB dependencies

(all figures relative to chromosome 12)


Qualitative differencesStrengths Weaknesses Strain info

Ensembl • Curated SNPs

• Evolving

• SNP location info (exonic, intronic)

• Multiple reputable sources

•Controlled submission

Low timeliness About 60 strains

Not very complete

dbSNP • Submitter info

• Update history (provenance)

•Multiple sources

•Low quality control on public submission

•Timely

Not used

Perlegen • Good quality control

• High reputation

•No SNP location

•Not evolving

16 strains (ref + 15)

Fairly complete


Missing strains – chr 17% SNPs with available strain allele - PERLEGEN

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

C57BL/6

J

KK/HlJ

BTBR T+ tf

NZW/La

cJ

AKR/J A/J

WSB/E

iJ

DBA/2J

BALB/cByJ

C3H/HeJ

129S

1/SvIm

NOD/L

tJ

FVB/NJ

CAST/EiJ

MOLF

/EiJ

PWD/P

hJ

strain

% a

vail

able

all

ele

% SNPs with available allele strain - ENSEMBL

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

A/J

C57BL/6

J

DBA/2J

129X

1/SvJ

129S

1/SvIm

J

AKR/J

C3H/HeJ

KK/HlJ

NZW/La

cJ

BALB/cByJ

BTBR T+ tf

/J

WSB/E

iJ

NOD/L

TJ

MOLF

/EiJ

FVB/NJ

PWD/P

hJ

CAST/EiJ

MSM/M

s

NOD/D

IL

CZECHII/Ei

SPRET/Ei

strain

avai

lab

le a

llel

e in

fo


Effect of source selection

EnsemblPerlegen


Summary• SNPit complements current methodologies for candidate gene

discovery in QTL regions– Helps focusing on promising genes

– Automates SNP analysis over large regions

• View-based, loose integration of three prominent DBs

• Original score models– More study needed to exploit other features

• SNP location, submitter info, revision frequency…

• Can be invoked from workflows– As part of larger in silico experiments

• Plan to release SNPit as a public Web Service



SNPs and their role in QT analysis

• SNP: Single Nucleotide Polymorphism

– single-base change in a strain relative to a reference strain (mus musculus)

• Inbred strains are genetically similar

• The arrangement of SNPs across the mouse genome falls into blocks which are common among strains (haplotypes)

• ex.: C57 strain (susceptible) different from A/J and BALBc strains (resistant)


DB overlaps

Perlegen291,718

Ensembl253,862

dbSNP

50,564

122,938

105,265243,702

(Chromosome 17)

Date post:	11-May-2015
Category:	Technology
Upload:	paolo-missier
View:	577 times
Download:	0 times

Paper presentation @DILS'07

Technology