+ All Categories
Home > Technology > Drablos Composite Motifs Bosc2009

Drablos Composite Motifs Bosc2009

Date post: 28-Nov-2014
Category:
Upload: bosc
View: 948 times
Download: 2 times
Share this document with a friend
Description:
 
17
1 Finn Drabløs [tare.medisin.ntnu.no] Computational discovery of composite motifs in DNA Geir Kjetil Sandve, Osman Abul and Finn Drabløs
Transcript
Page 1: Drablos Composite Motifs Bosc2009

1

Finn Drabløs [tare.medisin.ntnu.no]

Computational discovery of composite motifs in DNA

Geir Kjetil Sandve, Osman Abul and Finn Drabløs

Page 2: Drablos Composite Motifs Bosc2009

2

Finn Drabløs [tare.medisin.ntnu.no]

Basic gene regulationIntroduction

• Proteins (transcription factors, TFs) recognise binding sites (sequence motifs) in gene regulatory regions

• The transcription factors stabilise the transcription complex

• Distal promoters (enhancers) interact through DNA looping

Michael Lones

Page 3: Drablos Composite Motifs Bosc2009

3

Finn Drabløs [tare.medisin.ntnu.no]

De novo prediction of binding sites• Make a set of co-regulated genes

– E.g. from microarray experiments, normally imperfect sets

• Extract assumed regulatory regions– Normally a fixed region upstream from TSS of each gene

• Search for overrepresented patterns in these regions– Use a model for what a motif should look like

• Consensus sequence with mismatches• Position Weight Matrix (PWM) based on log odds scores for occurrences

– Use a strategy to find (local) optima for this model• E.g. Gibbs sampling, expectation maximisation …

• Problem: More than 100 different methods– Which methods are reliable?

Motivation

Page 4: Drablos Composite Motifs Bosc2009

4

Finn Drabløs [tare.medisin.ntnu.no]

Benchmarking of de novo tools• Tompa et al, Nature Biotech 23, 137-144 (2005)• Tested 14 different tools for motif discovery• Used 52 data sets from fly (6), human (26), mouse (12)

and yeast (8)• Used data sets with real (Transfac) binding sites in

different sequence contexts– ”real” – The actual promoter sequences– ”generic” – Randomly chosen promoter sequences from same genome– ”markov” – Sequences generated by Markov chain of order 3

• Measured performance at nucleotide level

Motivation

Page 5: Drablos Composite Motifs Bosc2009

5

Finn Drabløs [tare.medisin.ntnu.no]

Average benchmark performanceMethod TP FP FN TN

AlignAce 477 3789 8186 436048

ANN-Spec 754 7799 7909 432038

Consensus 178 1394 8485 438443

GLAM 223 5619 8440 434218

Improbizer 594 7942 8069 431895

MEME 581 4836 8082 435001

MEME3 673 6726 7990 433111

MITRA 272 4092 8391 435745

MotifSampler 520 4344 8143 435493

Oligo/dyad 345 1891 8318 437946

QuickScore 151 4856 8512 434981

SeSiMCMC 530 13813 8133 426024

Weeder 748 1748 7915 438089

YMF 554 3492 8109 436345

TP FNFP TN Pred_P Pred_N

Real_P 471 8192Real_N 5167 434670

nCC = 0.053

Performance is close to random!

Too many FP, FN

Motivation

Page 6: Drablos Composite Motifs Bosc2009

6

Finn Drabløs [tare.medisin.ntnu.no]

Can we improve performance?• Use better motif representations

– Hidden Markov Models

• Use better algorithms– More exhaustive searching– Discriminative motif discovery

• Use better background models– Real sequences (not Markov models)

• Filter out false positives– Identify “motif-like” solutions– Identify regulatory regions– Use co-occurrence of motifs

• Modules, composite motifs

Motivation

TODAY!

TODAY!

TODAY!

Page 7: Drablos Composite Motifs Bosc2009

7

Finn Drabløs [tare.medisin.ntnu.no]

Composite motif discovery

• TFs act together as modules• Modules are not completely unique

Approach

Page 8: Drablos Composite Motifs Bosc2009

8

Finn Drabløs [tare.medisin.ntnu.no]

Basic definitions• Frequent modules

– Modules (and motifs) can be ranked by support• Fraction of sequences where the module (or motif) is found

– Support is monotonous• Adding a motif to a module can never increase module support

• Specific modules– Modules can be ranked by hit probability

• Probability that a sequence supports the module– Hit probability is monotonous (as for support)– Specific modules have low hit probability in background sequences

• Significant modules– Modules can be ranked by significance

• Probability that support in sequence ≠ background

Algorithm

Page 9: Drablos Composite Motifs Bosc2009

9

Finn Drabløs [tare.medisin.ntnu.no]

Search tree• Discretized single motifs

{1, 2, 3, …} organised as an implicit search tree

• Support set H and hit probability P is iteratively computed (monotonicity)– Initially H is full sequence set and

P is 1)

• Search tree is efficiently pruned (indicated with X) based on H and P

• Final output can be ranked by module significance

Algorithm

Page 10: Drablos Composite Motifs Bosc2009

10

Finn Drabløs [tare.medisin.ntnu.no]

Module significance• Position-level probability in background

– Probability of single motif at specific location– Estimated from real DNA background sequences

• Sequence-level probability in background– Probability of single motif at least once in given background sequence– Estimated as union of position-level probabilities

• Hit-probability in background– Probability of composite motif at least once in background sequence– Estimated as product of individual motif components

• Significance p-value of observed support– Probability of seeing at least observed support in background set– Estimated as right tail of binomial distribution

• At least k out of n successes given hit-probability

Implementation

p

Page 11: Drablos Composite Motifs Bosc2009

11

Finn Drabløs [tare.medisin.ntnu.no]

Problem specification• Frequent and specific modules

– Use thresholds on support and specificity

– Complete solutions but multi-objective optimization

• Top-ranking modules– Combine objectives into single

measure, e.g. p-value

• Pareto-optimal modules– Each objective is a separate

dimension of optimality– Return Pareto front of composite

motifs

Implementation

http://en.wikipedia.org/wiki/Pareto_efficiency

Page 12: Drablos Composite Motifs Bosc2009

12

Finn Drabløs [tare.medisin.ntnu.no]

Motif prediction flowchartImplementation

Page 13: Drablos Composite Motifs Bosc2009

13

Finn Drabløs [tare.medisin.ntnu.no]

Benchmark data set

• Known composite motifs from the TransCompel database• Tests performance by adding “noise matrices” to input

– Matrices for TFs assumed not to bind in sequence set• Will have random (false positive) hits

– Selected at random from Transfac• Max noise level includes all Transfac matrices

– Similar to actual usage• Searching for motifs consisting of unknown TFs

Benchmarking

Page 14: Drablos Composite Motifs Bosc2009

14

Finn Drabløs [tare.medisin.ntnu.no]

General performance (nCC)

• Compo compared to several other tools– TransCompel benchmark set

• Compo has clearly best performance, in particular at realistic settings (high noise level)

Benchmarking

Page 15: Drablos Composite Motifs Bosc2009

15

Finn Drabløs [tare.medisin.ntnu.no]

Background and support• Compo gains performance from realistic background (real

DNA) and support– Random DNA based on multinomial sequence model

• Performance without real DNA background or support comparable to other tools

Benchmarking

Page 16: Drablos Composite Motifs Bosc2009

16

Finn Drabløs [tare.medisin.ntnu.no]

Pareto front• Pareto front on support,

max motif distance and significance (colour)

• Compo prediction not optimal– Compo predicted Ets and

GATA– Annotated motif is AP1 and

NFAT

• Explore alternative solutions

• Explore parameter interactions

Future development

X – NFATO – AP1

Page 17: Drablos Composite Motifs Bosc2009

17

Finn Drabløs [tare.medisin.ntnu.no]

The research groupBiGR

Drabløs, Finn

Postdocs / ResearchersSætrom, PålKusnierczyk, WacekRye, MortenKlein, JörnAnderssen, EndreWang, Xinhui (ERCIM)Capatana, Ana (ERCIM, starting 2009)

PhDsBratlie, Marit SkyrudKlepper, KjetilSaito, TakayaLundbæk, MarieHåndstad, Tony

Programmers / TechniciansJohansen, JosteinThomas, LaurentOlsen, Lene C.

OthersSolbakken, Trude

Master studentsBolstad, KjerstiMuiser, IweSponberg, BjørnBrands, StefSkaland, Even

Former membersSandve, Geir KjetilAbul, OsmanSchwalie, PetraLones, Michael

Acknowledgements


Recommended