+ All Categories
Home > Documents > Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer...

Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer...

Date post: 20-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
18
AGENT BASED PARALLELIZATION OF BIOLOGICAL NETWORK MOTIF DETECTION Saranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project Committee: Dr. Munehiro Fukuda, Committee Chair Dr. Wooyoung Kim, Committee Member Dr. Clark Olson, Committee Member
Transcript
Page 1: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

AGENT BASED PARALLELIZATION OF BIOLOGICAL NETWORK MOTIF

DETECTION

Saranya Duraisamy

Capstone Progress Report

Master of Science in Computer Science & Software Engineering

University of Washington, Bothell

March 18, 2020

Project Committee:

Dr. Munehiro Fukuda, Committee Chair

Dr. Wooyoung Kim, Committee Member

Dr. Clark Olson, Committee Member

Page 2: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

ii

TABLE OF CONTENTS

List of Figures ................................................................................................................................ iii

List of Tables ................................................................................................................................. iii

Chapter 1. Introduction ................................................................................................................... 1

Chapter 2. Related works ................................................................................................................ 1

2.1 Sequential Network Motif Detection .............................................................................. 1

2.2 Parallel Network Motif Detection ................................................................................... 2

2.3 MASS-based Parallel Network Motif Detection ............................................................ 2

Chapter 3. METHODS.................................................................................................................... 2

3.1 Agent-Based Network Motif Detection .......................................................................... 3

3.2 System Flow.................................................................................................................... 4

3.3 Performance Improvement.............................................................................................. 9

Chapter 4. RESULTS.................................................................................................................... 10

4.1 Execution Environment ................................................................................................ 10

4.2 Performance Analysis ................................................................................................... 11

Chapter 5. Conclusion & Future work .......................................................................................... 14

Bibliography ................................................................................................................................. 14

Page 3: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

iii

LIST OF FIGURES

Figure 3.1. Network Motif Detection Process. ................................................................... 3

Figure 3.2. Maximum Motifs for Graphs upto 10 vertices[1]. ............................................. 3

Figure 3.3. Subgraph Enumeration Algorithm[7]. ............................................................... 4

Figure 3.4. Subgraph Enumeration Tree[7]. ......................................................................... 4

Figure 3.5. System Flow for Agent-Based Network Motif Detection. ............................... 5

Figure 3.6. Graph Ordering Visualization. ......................................................................... 6

Figure 3.7. Execution screenshot of MASS Network Motif Detection. ............................. 8

Figure 4.1. Dolphin Network ............................................................................................ 10

Figure 4.2. Comparison of MASS Motif Synchronous vs MASS Network Motif ........... 11

Figure 4.3. MASS Parallel Performance Analysis............................................................ 12

Figure 4.4. MASS Performance Tuning Evaluation ......................................................... 12

Figure 4.5. Parallel I/O Graph........................................................................................... 13

Figure 4.6. Sequential vs Parallel Performance ................................................................ 14

LIST OF TABLES

Table 3.1. Graph vertices rearrangement based on vertex degree ...................................... 6

Table 4.1. Real Network Datasets..................................................................................... 10

Table 4.2. Input graph size for different formats .............................................................. 13

Page 4: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

1

Chapter 1. INTRODUCTION

‘Network Motifs’ are defined as the recurrent and statistically significant patterns or subgraphs in

the biological networks [1]. k-sized network motifs are the k-vertices induced subgraphs that occur

more frequently in the target network than any other k-vertices subgraphs in the network. Network

motif detection and analysis led to the discovery of unidentified biological interactions. Network

motif detection involves computationally intense subgraph enumeration, random graph generation,

NP-complete subgraph isomorphic testing, and statistical testing. In this work, MASS [2] agent-

based parallelization is applied to the computationally expensive subgraph enumeration process.

Unlike sequential motif detection tools that are restricted to single machine resources, parallel

motif detection could benefit from collective memory and compute power offered by cluster

machines and aid in the detection of large motif size as well as analysis of large networks. Current

MASS Network Motif [3] detection gained 30x speedup than previous MASS Motif Synchronous

implementation [4] and execution time reduced by a factor of 2 with the increase in the number of

nodes utilized for the parallel execution.

The rest of this paper is organized as follows: Chapter 2 reviews the existing sequential and parallel

network motif detection tools. Chapter 3 explains the architecture of agent-based motif detection.

Chapter 4 presents the experimental results and comparative analysis of parallel implementations.

Finally, Chapter 5 concludes the progress with future work.

Chapter 2. RELATED WORKS

2.1 SEQUENTIAL NETWORK MOTIF DETECTION

M-Finder [5] performs an exhaustive network motif search in a brute force manner that runs for

longer time and consumes more memory. Owing to the computational complexity, M-Finder can

only detect motifs up to size 6. Fast Network Motif Detection (FANMOD) [6] employs the most

efficient Enumerate Subgraph (ESU) algorithm [7], which breaks symmetry with vertex

identifiers. Unlike M-Finder, ESU finds a motif only once, and hence it is faster. FANMOD can

detect motifs up to size 8 in both undirected and directed networks. Network Motif

(NemoLib) Java [8] is a general-purpose library used to find motif frequency, motif concentration,

Page 5: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

2

and motif to instance mapping. Similar to FANMOD, NemoLib [9] uses ESU algorithm to

enumerate subgraphs and relies on nauty labelg [10] to detect graph isomorphisms. Consequently,

the motif size limitation imposed by these sequential tools [5], [6] necessitates the development of

parallel tools to discover large motifs and reduce the detection complexity in large graphs.

2.2 PARALLEL NETWORK MOTIF DETECTION

MPI based Parallel Motif Detection [11] proposed by Wang et al. partitions network and

broadcast to workers, which then detect potential motifs in parallel. Master process gathers results

from all workers and deduce the actual motifs with isomorphism check. This performed faster than

sequential version only up to motif size 4. Parallel ESU [12] parallelized recursive subgraph

extension calls of ESU algorithm and achieved linear speedup for gene and metabolic networks.

But parallel ESU did not result in linear performance for neural and protein-protein interaction

networks due to the long time taken to combine final results. Iterative MapReduce ESU [13]

parallelization achieved upto 37 times speedup than the sequential version.

2.3 MASS-BASED PARALLEL NETWORK MOTIF DETECTION

MASS Motif Synchronous [14], Kipps et al. parallelized biological network motif enumeration in

three different ways, MASS agent-based, MASS places-based, and MPI based enumeration.

Experimental results demand MASS agent management feature to avoid agent explosion caused

by the enumeration of 5.5 million agents. Another research work [15], MASS

NemoProfile construction extended MASS places-based parallelization [14] to map individual

vertices to the motif reveals the possibility to achieve better parallelism using MASS.

Chapter 3. METHODS

Network motif detection process involves subgraph enumeration and frequency computation of

non-isomorphic patterns in the input graph, random graph generation, subgraph enumeration and

frequency computation of non-isomorphic patterns in random graphs, and statistical testing to

determine significant network motifs as depicted in Figure 3.1.

Page 6: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

3

Figure 3.1. Network Motif Detection Process.

3.1 AGENT-BASED NETWORK MOTIF DETECTION

As evident from Figure 3.2, the number of subgraph patterns increases exponentially with increase

in the graph vertices for both undirected and directed graphs. This drastic increase cause subgraph

enumeration task to consume more time for large graph size or motif size. Agent-based network

motif detection approach intuitively parallelize time-consuming subgraph enumeration task.

Similar to sequential tools [6], [9], this parallel approach employs ESU algorithm for target graph

and Randomized-ESU algorithm for random graphs to improve speed of the motif detection.

Figure 3.2. Maximum Motifs for Graphs upto 10 vertices[1].

Figure 3.3 demonstrates the ESU algorithm that utilizes vertex identifiers to generate unique

subgraphs. This algorithm computes all subgraphs recursively from each vertex by traversing a

limited set of neighbors whose values are higher than the current enumerated vertex identifier.

Page 7: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

4

Figure 3.3. Subgraph Enumeration Algorithm[7].

In MASS implementation, each graph vertex can be mapped to a MASS place using zero-indexed

vertex identifier. Agent-based subgraph enumeration from each vertex (place) can be parallelized

by migrating mobile agents to the neighbor vertex since enumeration operations from each vertex

are independent of each other as seen in Figure 3.4.

Figure 3.4. Subgraph Enumeration Tree[7].

3.2 SYSTEM FLOW

Figure 3.5 demonstrates the system flow to detect network motifs for the given input size in an

undirected target network. This system comprises of six distinct modules, graph parser, optional

graph ordering module, target graph analyzer, random graph generator, random graph analyzer,

and statistical analyzer.

Page 8: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

5

Figure 3.5. System Flow for Agent-Based Network Motif Detection.

Page 9: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

6

3.2.1 Graph Parser and Graph Ordering Module

Graph parser parses input graph represented in edge list format and constructs graph in adjacency

list representation to initialize all vertices with their neighbors. Graph parser invokes a graph

ordering module that can be enabled or disabled at runtime. If enabled, it rearranges graph vertices

in increasing order of vertex degree (number of neighbors) and then split ordered vertices evenly

for the number of computing machines utilized in the current execution. Figure 3.6 illustrates the

original input graph and corresponding reordered graph for four computing machines.

(a) Original Graph

(b) Reordered Graph

Figure 3.6. Graph Ordering Visualization.

Table 3.1 captures the mapping of graph vertices to cluster machines in the original graph and

reordered graph. In contrast to the original graph’s non-uniform total degree distribution of 8-4-6-

2, the reordered graph has total degree distribution of 4-4-6-6. Thus, graph ordering module

attempts to reduce load imbalance by reordering and distributing vertices with an approximately

equivalent degree to all the computing machines.

Table 3.1. Graph vertices rearrangement based on vertex degree

Original Graph Machines Allocation Reordered Graph Machines Allocation

0 {2, 6, 4, 1}

1 {0, 3, 7, 5}

Machine 1

Total Degree: 8

0 {5}

1 {4, 3, 5}

Machine 1

Total Degree: 4

2 {0, 4}

3 {1, 5}

Machine 2

Total Degree: 4

2 {7}

3 {7, 6, 1}

Machine 2

Total Degree: 4

4 {0, 2, 5}

5 {4, 1, 3}

Machine 3

Total Degree: 6

4 {1, 5}

5 {0, 1, 7, 4}

Machine 3

Total Degree: 6

6 {0}

7 {1}

Machine 4

Total Degree: 2

6 {7, 3}

7 {3, 2, 6, 5}

Machine 4

Total Degree: 6

Page 10: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

7

3.2.2 Target Graph Analyzer

Target graph analyzer performs full enumeration of the input graph to identify all candidate motifs

for the given motif size. It runs in parallel across all the computing machines utilized for the

execution. The target graph analyzer first instantiates all vertices (MASS Places) with their

corresponding neighbor vertices information obtained from graph parser. Then, it populates

crawler (MASS Agents) at each place. These crawlers execute the ESU algorithm shown in Figure

3.3 simultaneously from all the vertices. Crawlers spawn child crawlers to traverse all neighbors

and migrate itself to one of the neighbors. Crawler terminates when no valid neighbor exists for

the vertex. Once crawler traversed motif sized subgraph, crawler deposits subgraph structure in

compact graph6 representation. After the termination of all crawlers, the target graph analyzer

gathers all deposited motif sized subgraphs from all places. Finally, isomorphic occurrences are

grouped together by passing graph6 motif representation to Labelg program and resultant

canonical label of candidate motifs are saved along with respective frequencies.

3.2.3 Random Graph Generator

Random graphs are generated from the input graph by preserving the degree distribution of the

vertices in the input graph. This work generates degree-preserving random graphs using the

configuration model described in [16]. Random graph generator fetches degree distribution

sequence from graph parser. Degree distribution sequence contains a list of vertex identifiers

created by repeatedly adding each vertex identifier up to its degree value. Random graph generator

shuffles degree distribution list and repeatedly picks a random pair of vertices as an edge for the

random graph. Consequently, generated random graph may be connected or disconnected graph

with lesser degree distribution than expected due to the presence of self-loops and parallel edges.

3.2.4 Random Graph Analyzer

Random graph analyzer employs RAND-ESU algorithm to perform approximate enumeration

based on the input sampling probabilities. Instead of traversing all neighbors, crawlers selectively

traverse the limited set of neighbors at each ESU tree level. RAND-ESU algorithm reduces the

time taken to compute the frequency of candidate motifs in a large number of random graphs.

Page 11: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

8

Similar to the target graph analyzer, random graph analyzer also executes simultaneously in all

computing machines used for the execution.

3.2.5 Statistical Analyzer

Statistical analyzer performs final computation to determine the significance of the candidate

motifs. Z-score is the ratio of the difference between the original frequency and the mean random

frequency to the standard deviation. Z-score may be undefined when the standard deviation is zero.

𝑍(𝑚) =𝐹𝐺(𝑚) − 𝑀𝑒𝑎𝑛(𝐹𝑅(𝑚))

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑅(𝑚)

p-value is the number of random networks in which network motif occurred more often than in the

original network, divided by the number of random networks ‘N’. Hence, p-value will be in the

range between 0 and 1 inclusive. Smaller the p-value, the more significant is the network motif.

𝑝(𝑚) = 1

𝑁∑𝑛=1

𝑁 𝑐(𝑛) 𝑤ℎ𝑒𝑟𝑒 𝑐(𝑛) = 1, 𝑖𝑓 𝐹𝑅(𝑚) ≥ 𝐹𝐺(𝑚)

Generally, Z(m)>2 and p(m)<0.01 are statistically significant subgraph patterns and motifs with

values in this range are recognized as network motifs [1]. Statistical analyzer computes Z-Score

and p-value for all the candidate motifs using the above mathematical relations. This executes

sequentially in the master computing machine and displays result to the user as seen in Figure 3.6.

Figure 3.7. Execution screenshot of MASS Network Motif Detection.

Page 12: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

9

3.3 PERFORMANCE IMPROVEMENT

Initial MASS implementation suffered from the over usage of heap memory due to the creation of

a large number of non-primitive Java objects. As a result, it caused 'out of memory' errors for large

motif size (9) in the smaller graph (dolphin network) and small motif size (3) in the larger graph

(DIP dataset). The following performance improvements are incorporated to reduce memory usage

and improve execution speed.

• Reduced HashMap with String Key. The initial version maintained data uniquely for each

motif in multiple hash maps (8 hash maps) with motif’s canonical label string as a key.

These hashmaps are redesigned to ‘Motif’ class to reduce memory space occupied by the

recurrent canonical label string objects stored in multiple maps.

• MASS Asynchronous Agent Migration. Replaced Agent’s callAll followed by manageAll

with doAll for ‘motif size’ iterations to reduce time incurred by returning control to driver

program in between the successive function calls.

• Changed Non-primitive Java Objects to Primitive Types to reduce memory space and avoid

autoboxing and unboxing performed during primitive to non-primitive type conversions

and vice versa.

• Replaced Built-in Java Collection with FastUtil’s Primitive Collection. Built-in Java

collections such as HashMap, HashSet, and ArrayList consume enormous memory with an

increase in the collection size and tightly couples the internal data structure used. To reduce

memory and benefit from using different internal data structures such as array, AVL tree,

RB tree, open hash, and custom hash, FastUtil [21] primitive collections are used. In MASS

implementation, the number of agents increases exponentially for large motif size and large

graph size. With primitive collections, each agent carried much lesser data than before.

• Moved Agent’s data to Places. Input motif size and sampling probabilities are stored in

agents initially. These input data consumed huge memory with the creation of millions of

agents. Input data used by agents are stored in all places and agents fetch data from place

upon arrival to the place, thereby reducing memory utilized during the agent expansion.

These performance tuning has reduced execution time significantly, as explained in section 4.2.3.

This fine-tuning enabled the detection of small motif sizes in large graphs as well as large motif

sizes in small graphs that are previously infeasible.

Page 13: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

10

Chapter 4. RESULTS

4.1 EXECUTION ENVIRONMENT

Experiments are conducted in a cluster of 8 computing nodes made available by the University of

Washington Bothell. Among 8 computing nodes, 4 nodes have 8-core 2.33GHz CPU (Intel Xeon

E5410) with 16GB memory and the remaining 4 nodes have 4-core 2.66GHz CPU (Intel Xeon

5150) with 16GB memory. The latest stable version of software libraries used in this work are as

follows, MASS Java [2] core version 1.2.1, NemoLib Java [9] version 2, and Nauty [10] version

2.6 (r12). All experiments are executed with 4GB initial heap and 12GB maximum heap space.

4.1.1 Input Datasets

Table 4.1 lists three different undirected real datasets used in the experiments. These downloaded

input datasets are in different graph formats such as Graph Modeling Language (GML), Pajek, and

Edge-List format. Different input graphs formats are converted to the Edge-List format expected

by the current implementation using a python script. This script uses open-source python library

NetworkX [17], for format conversion and graph visualization. Figure 4.1 illustrates dolphin and

power network datasets visualized using python script.

Table 4.2. Real Network Datasets

Real Datasets Vertices Edges Highest

Degree

Connected

Components

Dolphin [18] 62 159 12 1

Power [19] 4,941 6,594 19 1

DIP 2016 [20] 27,876 76,108 289 2,385

DIP Modified 26,695 73,085 289 1,204

Figure 4.1. Dolphin Network

As seen from Table 4.1 and Figure 4.1, dolphin and power datasets are fully-connected networks

while DIP dataset is a disconnected network with multiple connected components.

Page 14: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

11

4.1.2 Input Graph Preprocessing

The current MASS implementation supports only 0-indexed integer identifier for graph vertices to

reduce memory usage and enable easier vertex mapping to MASS places. Dolphin network [18]

has string-based vertex identifiers, and DIP 2016 network [20] has 3019 self-loops and 4 parallel

edges. A utility program has been developed in Java to clean input graphs with self or parallel

edges and construct a graph by mapping string-based or non-zero indexed vertex identifiers to

zero-indexed vertex identifiers. Table 4.1 depicts the original and modified DIP dataset with a

reduction of 1181 vertices and 3023 edges. This preprocessing reduces memory allocation for

vertices with no valid neighbor as well as time spent in traversing self or parallel edges. All

experiments are conducted using the same input graph generated by the utility program.

4.2 PERFORMANCE ANALYSIS

4.2.1 Comparison with Kipps et al. MASS Motif Synchronous Implementation

MASS Network Motif [3] and MASS Motif Synchronous [4] implementations were tested in 8

computing nodes with 1 thread per node for dolphin and power graphs. Both implementations took

similar execution time for small motif size up to 5. But for larger motif sizes, current

implementation achieved a maximum of 3.5-4x speedup for dolphin network (motif size 8), and

power network (motif size 6 and 7) as seen in Figure 4.2.

Figure 4.2. Comparison of MASS Motif Synchronous vs MASS Network Motif

Page 15: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

12

4.2.2 MASS Parallel Performance Analysis

To evaluate MASS parallel performance, current implementation [3] was tested with 4 and 8

computing nodes. As evident from Figure 4.3, 8 nodes execution decreased execution time by a

factor of 1.7-2 for large motif sizes. In conclusion, the parallel performance improved with more

computing nodes utilized for the parallel execution.

Figure 4.3. MASS Parallel Performance Analysis

4.2.3 MASS Performance Tuning Evaluation

To assess the performance improvement described in section 3.3, experiments were conducted

with pre-tuned and fine-tuned versions. 7x speedup achieved in dolphin data (for motif size 8) and

13x speedup attained in power data (for motif size 7) demonstrated in Figure 4.4 signify the

benefits gained from fine-tuning.

Figure 4.4. MASS Performance Tuning Evaluation

Page 16: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

13

4.2.4 MASS Parallel I/O Evaluation

MASS Parallel I/O feature expects each line in the input graph to have the same alignment so that

the file can be partitioned and read in parallel from the computing nodes. To meet even alignment

constraint, each neighbor data need to be filled with -1 and spaces up to maximum neighbors as

seen in Figure 4.5 and end up creating a huge input file. As clearly visualized from Table 4.2,

MASS Parallel I/O increases input file size for large graphs. Though MASS Parallel I/O provides

great parallelization potential for complete graphs wherein every vertex has a connection to all

other vertices in the graph. Current MASS Parallel I/O is not well suited for biological networks

that exhibit network property of fewer vertices with high degree and more vertices with low degree.

Hence, sequential I/O was preferred over parallel I/O in the current MASS implementation.

Table 4.3. Input graph size for different formats

Dataset Edge List File Size Parallel I/O File Size

Dolphin 1 KB 8 KB

Power 62 KB 923 KB

DIP 761 KB 75,370 KB

Figure 4.5. Parallel I/O Graph

4.2.5 MASS Agent Population Control Evaluation

MASS agent population control feature enabled large motif size detection (motif size 9 in dolphin

network, motif size 8 in power network and motif size 4 in DIP) up to 14.9 million agents by

serializing agents exceeding the specified maximum population limit. However, with further

increase in motif size (motif size 5 for DIP dataset enumerates 5.1 billion agents), agent population

control consumes more heap space to store serialized inactive agent objects and slows down the

execution. Observation reveals that one of cluster machine heap usage reached 11.8 GB (out of

max 12GB heap) and full garbage collection was triggered for 91 times which took 1116 seconds

out of overall execution of 1380 seconds. When serialized agent objects grow up to maximum

heap size, Processor spent most of the time in garbage collection rather than any useful

computation. Thus, current MASS implementation is limited by the maximum heap availability

on the cluster machines.

Page 17: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

14

4.2.6 Overall Performance Comparison

Figure 4.6. Sequential vs Parallel Performance

Figure 4.6 compares the performance of

sequential tools against parallel 8-nodes

MASS execution for power network.

Though current MASS implementation

[3] gained 30x speedup than the previous

MASS implementation [4], it lags behind

the sequential FANMOD [6] and

NemoLib [9]. Due to the memory

limitation faced by MASS version, very

large graphs (exceeding single machine

memory) that are infeasible to analyze

with sequential tools couldn’t be tested

with MASS as well.

Chapter 5. CONCLUSION & FUTURE WORK

Agent-based network motif detection enhanced detection up to motif size 9, impractical with

sequential tools [5], [6]. MASS Network Motif version achieved at maximum 30x speedup than

Motif Synchronous version by using MASS asynchronous agent migration and agent population

control feature, eliminating object creation at the application level, using primitive types and

primitive-type specific collections, and carrying minimal data within each agent. However, very

large graphs couldn’t be tested in the existing cluster due to the heap size limitation faced by MASS

implementation. Future work involves Spark-based parallel network motif detection to evaluate

parallelization intuitiveness, ease of programming, and fitness of MASS to parallelize graph

problems. Additionally, focus on possible performance improvements in the MASS

implementation to test large graph sizes and large motif sizes.

BIBLIOGRAPHY [1] Junker, B. H., & Schreiber, F. (2011). Analysis of biological networks (Vol. 2). John Wiley & Sons.

[2] Fukuda, M. (2010). Mass: Parallel-computing library for multi-agent spatial simulation. Distributed

Systems Laboratory, Computing & Software Systems, University of Washington Bothell, Bothell, WA.

[3] https://bitbucket.org/mass_application_developers/mass_java_appl/src/1ce403d59c772efac7ac41bc2

92e735301126114/?at=feature%2FNetworkMotif. MASS Network Motif Implementation.

Page 18: Capstone Progress ReportSaranya Duraisamy Capstone Progress Report Master of Science in Computer Science & Software Engineering University of Washington, Bothell March 18, 2020 Project

Agent-based Parallelization of Biological Network Motif Detection

15

[4] https://bitbucket.org/mass_application_developers/mass_java_appl/src/master/Applications/MotifSy

nchronous. MASS Motif Synchronous Implementation.

[5] Mfinder, https://www.weizmann.ac.il/mcb/UriAlon/download/network-motif-software

[6] Wernicke, Sebastian, and Florian Rasche. "FANMOD: a tool for fast network motif detection."

Bioinformatics 22, no. 9 (2006): 1152-1153.

[7] Wernicke, Sebastian, Efficient detection of network motifs, IEEE/ACM Trans. Comput. Biol.

Bioinformatics, vol. 3, no. 4, pp. 347-359, 2006.

[8] Andersen, Andrew, and Wooyoung Kim. NemoLib: A Java Library for Efficient Network Motif

Detection. In International Symposium on Bioinformatics Research and Applications, pp. 403-407.

Springer, Cham, 2017.

[9] NemoLib Java version 2, https://github.com/Kimw6/NemoLib-Java-V2

[10] McKay, B. D., & Piperno, A. (2014). Practical Graph Isomorphism, II. Journal of Symbolic

Computation, 60, 94-112.

[11] Wang T, Touchman JW, Zhang W, Suh EB, Xue G (2005) A parallel algorithm for extracting

transcription regulatory network motifs. In Proceedings of the IEEE international symposium on

bioinformatics and bioengineering, IEEE Computer Society Press, LosAlamitos, CA,USA, pp 193–

200.

[12] Ribeiro, P., Silva, F., & Lopes, L. (2010, January). A parallel algorithm for counting subgraphs in

complex networks. In International Joint Conference on Biomedical Engineering Systems and

Technologies (pp. 380-393). Springer, Berlin, Heidelberg.

[13] Verma, Vartika, Paul Park Kwon, Anand Joglekar, and Wooyoung Kim. Network motif analysis in

clouds-subgraph enumeration with iterative hadoop mapreduce. vol 4, 28-40.

[14] Matthew Kipps, Wooyoung Kim, and Munehiro Fukuda. Agent and Spatial Based Parallelization of

Biological Network Motif Search. In Proc. 17th IEEE International Conference on High Performance

Computing and Communications - HPCC 2015, pages 786–791, New York, NY, August 2015.

[15] Andrew Andersen, Wooyoung Kim, and Munehiro Fukuda. Mass-based nemoprofile construction for

an efficient network motif search. In IEEE International Conference on Big Data and Cloud

Computing in Bioinformatics - BDCloud 2016, pages 601–606, Atlanta, GA, October 2016.

[16] Newman, M. E. (2003). The structure and function of complex networks. SIAM review, 45(2), 167-

256.

[17] Hagberg, A., Swart, P., & Schult, D. Exploring Network Structure, Dynamics, and Function using

NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux,

Travis Vaught, and Jarrod Millman (Eds.), (Pasadena, CA USA), pp. 11–15.

[18] Lusseau, D., Schneider, K., Boisseau, O. J., Haase, P., Slooten, E., & Dawson, S. M. (2003). The

bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting

associations. Behavioral Ecology and Sociobiology, 54(4), 396-405.

[19] Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks. Nature,

393(6684), 440-442.

[20] Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. M., & Eisenberg, D. (2002). DIP, the

Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions.

Nucleic Acids Research, 30(1), 303-305.

[21] Fastutil: Fast & compact type-specific collections for Java, http://fastutil.di.unimi.it/


Recommended