+ All Categories

Slides4

Date post: 10-May-2015
Category:
Upload: bioinformaticsinstitute
View: 841 times
Download: 2 times
Share this document with a friend
Popular Tags:
33
GraphClust: alignment-free structural clustering of local RNA secondary structures. Yakovlev Pavel St-Petersburg Academic University March 16, 2013 Yakovlev Pavel Graph Clust 1/33
Transcript
Page 1: Slides4

GraphClust: alignment-free structural clustering oflocal RNA secondary structures.

Yakovlev Pavel

St-Petersburg Academic University

March 16, 2013

Yakovlev Pavel Graph Clust 1/33

Page 2: Slides4

Abstract

Non-protein coding RNAs (ncRNAs)have moved to a key research inmolecular biology, fundamentallychallenging established paradigms andrapidly transforming our perception ofgenome complexity. There is mountingevidence that the eukaryotic genome ispervasively transcribed and that ncRNAsfunction at various levels regulating geneexpression and cell biology.

Yakovlev Pavel Graph Clust 2/33

Page 3: Slides4

Motivation

Clustering algorithms are applied to store, compare and predictncRNAs, because:

in contrast to protein-coding genes, ncRNAs belong to adiverse array of classes with vastly different structures,functions, and evolutionary patternsncRNAs can be divided into RNA families according toinherent functional, structural, or compositional similarities

Yakovlev Pavel Graph Clust 3/33

Page 4: Slides4

MotivationPart II

Question:So, what is the problem with «old» alignment-based algorithms?

Yakovlev Pavel Graph Clust 4/33

Page 5: Slides4

MotivationPart II

Answer:They are awfull and do not work.

Why?

Multiple Sequence Alignment (MSA) is an NP-completeproblem.Approximation algorithms have very high computationalcomplexity too (dynamic programming):

Sankoff Algorithm (1985) - O(n4)Carrillo-Lipman Algorithm (1989)

Thus, the tools, based on these algorithms, are limited torelatively small datasets.

LocARNA (2007)FoldAlign (2007)

Yakovlev Pavel Graph Clust 5/33

Page 6: Slides4

Smart quote

Gorodkin: ‘Even using substantially more sophisticated techniques,genome-scale ncRNA analyses often consume tens to hundreds ofcomputer years. These high computational costs are one reasonwhy ncRNA gene finding is still in its infancy.’1

1Gorodkin,J. et al. (2010) De novo prediction of structured RNAs fromgenomic sequences. Trends Biotechnol, 28, 9–19.

Yakovlev Pavel Graph Clust 6/33

Page 7: Slides4

Solutions

Postulate:ncRNAs can be also grouped by RNA classes whose members, incontrast with RNA families, have no discernible homology at thesequence level, but still share common structural and functionalproperties.

Two main classes of clustering approaches:1 Class uses simplifications in the representation of structures.

heavily depend on the correctness of the structurescomputational prediction of secondary structures are known tobe error prone

2 Class uses sequence information as prior knowledge to speedup the computation.

sequences are first clustered by sequence-alignmentfamily assignments of structured RNAs obtained fromsequence alignments at pairwise sequence identities below 60%are often wrong

Yakovlev Pavel Graph Clust 7/33

Page 8: Slides4

GraphClustAlignment-free solution

GraphClust (University of Freiburg, Germany):

comparing and clustering RNAs according to both sequencesand structuresalignment-freelinear-timescales to datasets of hundreds of thousands of sequencesextended fast graph kernel technique designed forchemoinformaticsready-to-use pipeline, that is available on request for free

Yakovlev Pavel Graph Clust 8/33

Page 9: Slides4

GraphClustApproach

Approach pipeline:1 Try a small number of probable, but sufficiently different,

structures for each RNA sequence.2 Encode each structure as a labeled graph preserving all

information about the nucleotide type and the bond type -sequence’s structure is represented as a graph with severaldisconnected components.

3 Compute the similarity between the representative graphsusing a [chemoinformatics] graph kernel.

4 Evaluate each neighborhood and select as candidate clustersthose that contain very similar elements.

5 Dataset scanning: associating missing sequences to clustersand removing them from set.

6 Reiterating the procedure

Yakovlev Pavel Graph Clust 9/33

Page 10: Slides4

GraphClustQuestions

Obscure:How structures are encoded as graphs?How to compare graphs or graph substructures?Why we have not quadratic number of comparisons of graphs?

Yakovlev Pavel Graph Clust 10/33

Page 11: Slides4

Answer 1Graph encoding for structures

Minimum free energy-based secondary structure prediction has beenshown to be error prone (Gardner et al., 2005). So, we want to userepresentation of the entrie ensemble of low-energy conformations.

Problem: Resorting to a complete enumeration of near-optimalstructures would yield a tremendously large number of structures.Solution: We opt for abstract shape analysis methods.

Yakovlev Pavel Graph Clust 11/33

Page 12: Slides4

Answer 1Graph encoding for structures

We consider subsequencesof our ncRNA sequence, asoverlapping windows.Procedure depends on twoparameters:W - window sizes (30 and 150nt in tests)O - overlap parameter forwindow start positions

Than we take the set lmost representativestructures of eachsubsequence and encodethem as disconnectedcomponents.

Yakovlev Pavel Graph Clust 12/33

Page 13: Slides4

Answer 2Graph comparisons

But how to compare these graphs?

Graph similarity is the central problem for all learning tasks such asclustering and classification on graphs.Subgraphs polymorphism is NP-complete problem.

Yakovlev Pavel Graph Clust 13/33

Page 14: Slides4

Answer 2Graph kernel

Graph kernels:

Compare substructures of graphs that are computable inpolynomial timeExamples: walks, paths, cyclic patterns, trees

Good graph kernel criteria:

ExpressiveEfficient to computePositive definiteApplicable for wide range of graphs

Yakovlev Pavel Graph Clust 14/33

Page 15: Slides4

Answer 2Graph kernel

NSPDK - Neighborhood Subgraph Pairwise Distance Kernel (Costaand Grave, 2010)

Key fetures:

very fast kernelsuitable for large datatsetsworks with sparse graphs with discrete vertex and edge labelsworks with special pair of subgraphs, called «neighborhood»subgraphsinstance of decomposition kernel (Haussler, 1999)

Yakovlev Pavel Graph Clust 15/33

Page 16: Slides4

Answer 2Decomposition kernel

Definitions:

G (V ,E ) - graph, r ≥ 0. Than Nvr (G ) - neighborhood

subgraph of G .Where Nv

r (G) rooted in v and induced by the set of vertices withindistance not exceeding r .

Rr ,d - neighborhood-pair relation.Where Rr,d - indicates that the distance between the roots of twoneighborhood subgraphs of radius r is exactly equal to d .

kr ,d (G ,G ′) =∑

A,B∈R−1r,d (G)

A′,B′∈R−1r,d (G ′)

1(A ∼= A′)1(B ∼= B ′) - decomposition

kernel.R−1

r,d - indicates all possible pairs of neighborhood subgraphs or radius r ,whose roots are at distance d .

Yakovlev Pavel Graph Clust 16/33

Page 17: Slides4

Answer 2NSPDK

Than, NSPDK (non-normalized form) is:

K (G ,G ′) =∑

r

∑d

kr ,d (G ,G ′)

For efficiency reasons, the "upper bound" form:

Kr∗,d∗(G ,G ′) =r∗∑

r=0

d∗∑d=0

kr ,d (G ,G ′)

Normalization like decomposition kernel:

k̂r ,d (G ,G ′) =kr ,d (G ,G ′)√

kr ,d (G ,G )kr ,d (G ′,G ′)

Yakovlev Pavel Graph Clust 17/33

Page 18: Slides4

Answer 2NSPDK

Full isomorphism test iscomputationally expensive.Reducing the task:

1 String encoding: uses a techniquebased on the distance informationbetween pairs of vertices and canbe computed in O(|V |).

2 Hashing procedure: encodingstring to integer code (Costa andGrave, 2010).

3 Isomorphism test: equality testbetween thier integer codes.

Yakovlev Pavel Graph Clust 18/33

Page 19: Slides4

Answer 3Explicit feature representation

NSPDK limits the number of features to O(r∗d∗|V |2) pairs ofneighborhood subgraphs (one feature for each pair of vertices timeseach possible combination of values for the radius and the distance).r∗ ∈ [0; 5] d∗ ∈ [0; 10]For sparse graphs, the number of vertices that are reachable withinfixed small distance is typically small ⇒ dependency on the vertexset size can be more tightly approximated by O(|V |).

Conclusion: each graph is mapped into a sparse vector in Rm -high-dimensional feature space. Number of non-zero features inthis space is linear in |V |.

Yakovlev Pavel Graph Clust 19/33

Page 20: Slides4

Alignment-free clusteringMinHash

Some definitions:

P - instances [graphs] set

x , z , p - any instance

xj - element of instance

Nk(x) - the set of the k-closest instances.

So, how to cluster graphs using NSPDK? Compare all the possible pairs ofgraphs is too expensive. That‘s why let us compare only similar graphs.

Dk(x) = 1k

∑z∈Nk (x)

Kr∗,d∗(x , z) - k-neighborhood density is good metrics for

clustering.

To make it as fast, as possible, we want to count Nk(x) very fast. It is perfectto do it in constant time.

MinHash algorithm will help us with this task.

Yakovlev Pavel Graph Clust 20/33

Page 21: Slides4

Alignment-free clusteringMinHash

MinHash algorithm:

1 Binarize all instances P = {p1, . . . , pn} from Rm 7→ {0, 1}r .2 Choose some fi : N 7→ N - random hash functions with terms:∀xj 6= xk , fi (xj ) 6= fi (xk) and ∀xj 6= xk , P(fi (xj ) ≤ fi (xk)) = 1/2.

3 Get hash functions hi (x) = argminxj∈x fi (xj )1. So, they return the first

feature indicator under a random permutation of the features oreder.Useful facts:

P(hi (x) = hi (z)) =|x∩z||x∪z| = s(x , z) - Jaccard similarity, the fraction of

common x and z features over the total number of none-null features of xand z.Error rate γ =

[1

N2

], where N - number of hash functions.

4 Build a tuple < h1(x), . . . , hN(x) >5 Collect the set of instances for h∗ = hi (x) as

Zi (h∗) = {z ∈ P : hi (z) = h∗}.6 Z = {Zi}Ni=1 - approximate neighborhood, that can be used to find Nk(x).

These steps can be performed in constant time (N is fixed, Z independent ofthe dataset size |P|).

1These hash functions are called min-hash function (Broder, 1997).Yakovlev Pavel Graph Clust 21/33

Page 22: Slides4

MinHashDemo Time

def generic_hash( s, seed ):result = 1for char in s:

result = int(seed*result + ord(char)) & 0xFFFFFFFF;return result

def hashgen( size ):seed = math.floor( random.random() * size ) + 32return functools.partial(generic_hash, seed = seed)

def signature( set, functions ):return [min(map(f, set)) for f in functions]

def similarity( sig_a, sig_b, num_funcs ):return sum(x == y for x, y in zip(sig_a, sig_b)) / num_funcs

def minhash( max_error ):num_funcs = int(1 / max_error ** 2)functions = map(hashgen, xrange(num_funcs))return [functools.partial(signature, functions = functions),

functools.partial(similarity, num_funcs = num_funcs)]

Yakovlev Pavel Graph Clust 22/33

Page 23: Slides4

GraphClust pipelinePhase 1

Phase 1. Preprocessing

Get RNA sequences from RNA-seq or computational methods,mask all repeats to avoid clusters made of genomic repeats.Split long sequences into smaller fragments (windows)1.Near identical sequences are removed using blastclust2.

1to enable detection of small signals2The program begins with pairwise matches and places a sequence in a

cluster if the sequence matches at least one sequence already in the cluster.Megablast algorithm is used.

Yakovlev Pavel Graph Clust 23/33

Page 24: Slides4

GraphClust pipelinePhase 2

Phase 2. Structure determinationTry set of structures for each window with RNAShape tool.

150 nt sequence processing into set of disconnected graphs of≈ 2500 total vertices.O = 20%; W = 30, 150; l = 3

Yakovlev Pavel Graph Clust 24/33

Page 25: Slides4

GraphClust pipelinePhase 3

Phase 3. Parallel encodingEncoding set of structures into sparse vectors, by counting Kr∗,d∗ .

150 nt sequence processing into a sparse vector with ≈ 8000features.r∗ = 2; d∗ = 4

Yakovlev Pavel Graph Clust 25/33

Page 26: Slides4

GraphClust pipelinePhase 4

Phase 4. Candidate clusterci - candidate clustersci sorted as ∀i < j , D(ci ) > D(cj)Building the union of candidate clusters:

Cj =j⋃

i=1ci with greedily discarded ck if |Cj ∩ ck | > threshold

Yakovlev Pavel Graph Clust 26/33

Page 27: Slides4

GraphClust pipelinePhase 5

Phase 5. Parallel cluster refinementCandidate clusters contains sequences, similar under NSPDKmeasure.Each cluster adjusts using sequence-structure alignment toolLocARNA (Will et al., 2007). We choose only top ranked RNAs ineach cluster.

Yakovlev Pavel Graph Clust 27/33

Page 28: Slides4

GraphClust pipelinePhase 6

Phase 6. Parallel cluster modelSelected top ranked RNA candidates are realigned with toolLocARNA-P (Will et al., 2012). so, we identfy high trustedalignment region and estimate the borders of the common localmotif.At last, we create a covariance model (CM) by applying INFERNAL(Nawrocki et al., 2009) tool on the identfied subsequence. Everycluster member gets its CM bitscore1.

1A log-scaled version of score.Yakovlev Pavel Graph Clust 28/33

Page 29: Slides4

GraphClust pipelinePhase 7

Phase 7. Model scanningCovariance model of each cluster is used to find additional clustermembers in full residual dataset of sequences by CM bitscore orE-value1.Some sequences can be assigned to multiplie clusters.

1In terms of the authors, the number of instances with a score equivalent orbetter, than in current instance.

Yakovlev Pavel Graph Clust 29/33

Page 30: Slides4

GraphClust pipelinePhase 8

Phase 8. Iteration and removalAll cluster members found in the previous phase are removed fromthe dataset. A new iteration starts from Phase 4.The termination conditions are:

predetermined max number of iterationstime limitempty dataset

Yakovlev Pavel Graph Clust 30/33

Page 31: Slides4

GraphClust pipelinePhase 9

Phase 9. Postprocessing

Merge clustersFor every pair of clusters, we compute the relative overlap andmerge them if it is more, than 50%.

Make in-cluster postprocessingSequences in a cluster finally ranked by their CM bitscore.

Yakovlev Pavel Graph Clust 31/33

Page 32: Slides4

Results

Species Type Method Input Time ClusterBenchmarkBacteria Small ncRNAs Misc 363 6.8h 39Human Predicted RNA elemets EvoFam 699 0.3h 37Misc Small ncRNAs Rfam 3 900 36h 130

De-novo discoveryFugu LincRNAs RNA-seq 5 877 10.3h 99Fugu Predicted RNA elements RNAz 11 287 13.3h 97Fruit fly Predicted RNA elements RNAz 17 765 20.4h 95Human LincRNAs RNA-seq 31 418 3.6d 95Human Predicted RNA elements EvoFold 37 258 5.7d 117Human 3’UTRs RefSeq 118 514 12.8d 106∑

227 081 25.7d 815

Clustering 3901 sequences using LocARNA takes ∼ 370 days.

Yakovlev Pavel Graph Clust 32/33

Page 33: Slides4

Q&A

Thank you for your attention!Q&A

Yakovlev Pavel Graph Clust 33/33


Recommended