+ All Categories
Home > Documents > 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...

1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...

Date post: 21-Dec-2015
Category:
View: 217 times
Download: 0 times
Share this document with a friend
36
1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation
Transcript
Page 1: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

1

Parallel EST Clusteringby

Kalyanaraman, Aluru, and Kothari

Nargess MemarsadeghiCMSC 838 Presentation

       

   

Page 2: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 2

Talk Overview

Overview of talk Motivation Background Techniques Evaluation Related work Observations

Page 3: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 3

Motivation: EST Clustering

Problem: EST Clustering Cluster fragments of cDNA

Related to ‘fragment assembly’ problem Detecting overlapping fragments

Overlaps can be computed: Pairwise alignment algorithm Dynamic programming

Alternative: Approximate overlap detection algorithms Dynamic programming

Page 4: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 4

Motivation

Common Tools: Takes too long

Days for 100,000 ESTs Runs out of memory

This paper: PaCE:

Parallel Clustering of ESTs Efficient parallel EST Clustering

Space efficient algorithm Reduce total work Reduce run-time

Page 5: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 5

Background: EST Clustering Tools

Three traditional software: Originally designed for fragment assembly:

TIGR Assembler Phrap CAP3

One parallel software: UICLUSTER: assumes EST’s from 3’ end

Page 6: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 6

EST Clustering Tools

Basic approach Find pairs of similar sequences Align similar pairs

Dynamic programing

Quality of EST clustering Phrap: Fastest

avoids dynamic programming Relies on approximation, lower quality

CAP: Least # of erroneous clusters

Page 7: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 7

EST Clustering Tools’ Performance

With 50,000 maize ESTs Using PC with dual Pentium 450MHZ , 512 RAM :

TIGR: ran out of memory

Phrap: 40 min

CAP: > 24 hours

With 100,000 maize ESTs all ran out of memory

CAP would require 4 days

Page 8: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 8

Goal

Space efficient algorithm Space requirement linear in the size of the input data set

Reduce total work Without sacrificing quality of clustering

Reduce run-time and facilitate the clustering of large data sets Through parallel processing Scale memory with # of processors

Page 9: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 9

Approach

Expense: Pairwise alignment (time + memory) Promising pairs ≈

Common string: |s|= w Cost: if common |s|=l > w , then repeats l-w+1 times

2)(# EST

Page 10: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 10

Approach (Cont ..)

Approach: Use trie structure Identify promising pairs

Merge clusters with strong overlaps Avoid storing/testing all similar pairs

Parallel EST Clustering Software: Generalized Suffix Tree (GST) Multiple processors:

Maintain and updates EST Clusters Others generate batches of promising pairs, perform

alignment

Page 11: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 11

Approach (Cont …)

Page 12: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 12

Tries

1) Index for each char2) N leaves3) Height N

Page 13: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 13

Suffix Tries (Cont ..)

1) TRIM suffix trie

Page 14: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 14

Suffix Tries (Cont ..)

1) Indicies2) Storage O(n), constant is high though3) Common string4) Longest common substring

Page 15: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 15

Suffix Tries (Cont ..)

12

ab

ab

$

ab$

b

3

$4

$

5

Given a pattern P = ab we traverse the tree according to the pattern.

Page 16: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 16

Parallel Generation of GST

GST: Generalized Suffix Tree Compacted trie Longest common prefix found in constant time Used for on-demand pair generation Sequential: O(nl) Parallel: O(nl/p)

Page 17: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 17

Parallel Generation of GST (Cont …)

Previous implementations: CRCW/CREW PRAM model Work-optimal

Involves alphabetical ordering of characters Unrealistic assumptions

synchronous operation of processors infinite network bandwidth no memory contention Not practically efficient

Page 18: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 18

Parallel Generation of GST (Cont …)

Paper’s approach: EST’s equally distributed among processors Each processor

Partitions suffixes of ESTs into buckets Distribute buckets to the processors:

All suffixes in a bucket allocated to the same processor Total # of suffixes allocated to a processor ≈ O ( )

w||

p

nl

Page 19: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 19

Parallel Generation of GST (Cont …)

Each bucket’s processor: Compute compacted trie of all its suffixes Cannot use sequential construction

Suffixes of a string – not in the same bucket

Each bucket: Subtree in the GST

Nodes: Depth first search traversal of the trie Pointer to the right most child

Page 20: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 20

On-demand Pair Generation

A pair should be generated if Share substring of length ≥ treshhold Maximal Leaves in a common node

Share a substring of length = depth of node

Parallel algorithm Each processor works with its trie if

Depth of its root in GST < threshhold

Page 21: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 21

On-demand Pair Generation

To process Sort internal nodes

Decreasing order of depth Lists of a node

Generated after process Removed after parent is processed Limits space O(nl) Run time ≈ # pairs generated + cost of sorting Rejected pairs increase run-time by a factor of 2 Eliminating duplicates reduce run-time

Page 22: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 22

Parallel Clustering

Master-Slave paradigm: Master processor:

Maintains and updates clusters Using union-find data structure Receives messages from slave processors

– A batch of next promising pairs generated by slave– Results of the pairwise alignment

Determines which ones to explore Determines if merging should occur

Slave processors: Generate pairs on demand Perform pairwise alignments of pairs dispatched by the

master processor

Page 23: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 23

Parallel Clustering (Cont…)

Organization of Parallel Clustering Software

MasterP

SlaveP

SlaveP

slaveP

• Batch of promising pairs generated + results of pairwise alignment

• Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair

Page 24: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 24

Parallel Clustering (Cont..)

To start: Slave P starts with 3× batchsize pairs

Sends the 3rd batch to Master P Starts alignment on 1st batch Sends results on 1st + a newly generated batch While waiting to receive results from Master P, aligns 2nd batch

Processor always has the next batch to work between: – Submitting the results of previous batch– Receiving another set of pairs

Page 25: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 25

Parallel Clustering (Cont..)

Improve and control quality Parameters:

Match and mismatch scores Gap penalties

Post processing: Detection of alternating splicing Consulting protein databases Organism specific

Page 26: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 26

Experimental environment

Used C and MPI Tested

Quality of software: Arabidopsis thaliana (due to availability of its genome)

Run-time behavior: 50,000 Maize ESTs with 32-processor IBM SP # of processors Data size (# of Promising pairs) vs data size Batchsize vs (# processors) # of Clusters Master processor’s time

Page 27: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 27

Quality Assessment

To asses quality A data set and its correct clustering ESTs from plant Arabidopsis thaliana Splice program

Align ESTs to the genome Discard ESTs that

Don’t align Aligned in multiple spots

Page 28: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 28

Quality Assessment (Cont …)

False negative: A pair in correct clustering is not paired in the output 5%

False positive: A pair not in correct clustering appears in results Negligible (< 0.04%) Due to conservative nature of algorithm

Page 29: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 29

Quality Assessment

Cluster results

Number of singleton clusters

Number of non-singleton clusters

Benchmark 10,803 18,727

CAP3 17,930 17,556

PaCE 14,802 19,536

Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.

Page 30: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 30

Quality Assessment (Cont..)

Page 31: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 31

Run-time Assessment

-Experiment with 50,000 maize ESTs:-32-processor IBM SP-2-16 minutes

Page 32: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 32

Run-time Assessment (Cont …)

p Preprocessing Clustering Total

4 273 102 375

8 119 50 169

16 61 26 87

32 38 15 53

64 29 10 39

Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors.

Page 33: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 33

Run-time Assessment (Cont ..)

Run-time as a function of batchsize Small batchsize

Increase in communication overhead Large batchsize

Slaves less responsive to the need of generating pairs Slave does not use latest clustering results

Optimal batchsize Determined by experiment

Master processor’s time Fixed batchsize, increase in # of processors

Gradual increase in Master P’s time With 32 processors, increase < 1% Using 1 Master Processor in not bottleneck

Page 34: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 34

Results

Space Linear in size of the input data set Reduced total work without sacrificing quality Reduced run-time

Parallel processors Eliminating pairs

Faciliate clustering Scale memory with # Processors

Page 35: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 35

Observations

PaCE: Approaches EST clustering problem directly Better than

CAP3 Phrap TIGR Assembler

Compare time/quality TIGICL (TIGR Indices Clustering Tool)

Support for PVM MegaBlast STACK

Large data sets Lots of Processors

Can improve clustering time? Clustering algorithm

Page 36: 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

CMSC 838T – Presentation 36

References

http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf

Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.


Recommended