[IEEE 2011 2nd International Conference on Computer and Communication Technology (ICCCT) -...

A Modified Parallel Approach to Single Source Shortest Path Problem for Massively Dense Graphs

Using CUDA Sumit Kumar1, Alok Misra2, Raghvendra Singh Tomar3

Department of Computer Science and Engineering Shri Ramswaroop Memorial Group of Professional Colleges

Gautam Buddh Technical University Lucknow, India.

[email protected], [email protected], [email protected]

Abstract—Today’s Graphics Processing Units (GPUs) possess enormous computation power with highly parallel and multithreaded architecture, and the most attractive feature being their low costs. NVIDIA’s CUDA provides an interface to the developers to use the GPUs for General Purpose Parallel Computing. In this paper, we present a modified algorithm of Single Source Shortest Path Problem on GPU using CUDA. First, we modify the standard BELLMAN-FORD algorithm to remove its drawbacks and make it suitable for parallel implementation, and then implement it using CUDA. For dense graphs, our Algorithm gains a speedup of 10x – 12x over the previously proposed parallel algorithm. Our Algorithm also accept graphs with negative weighted edges and can detect any reachable Negative Weighted Cycle, which widens its scope to accept generalized problems.

Keywords-Bellman-Ford algorithm; Complete graphs; Compute Unified Device Architecture (CUDA); Graphics Processing Unit (GPU); Negative Weighted Cycle (NWC); Parallel Processing.

I. INTRODUCTION Graphs are a common way to represent data in many

engineering and scientific applications. Single Source Shortest Path (SSSP) problem is one of the operations on graphs to find the minimum distances of all the vertices of graph from a single vertex (called source). Many efficient serial algorithms exist which solve this problem in linear time in V or E. But for a graph having a large number of vertices, say 1 million, these algorithms would be impractical to use. Today’s Graphics Processing Units (GPUs) have enormous computation power as compared to the traditional multicore CPUs. NVIDIA’s Compute Unified Device Architecture (CUDA) provides a platform for data parallelism on multicore GPUs which can be employed to various graph algorithms.

In this paper, we provide a modified algorithm which computes the SSSP and also detects any Negative Weighted Cycle (NWC) reachable from source, if present. First, we present a modification in the serial BELLMAN-FORD Algorithm used for SSSP and then implement it using CUDA to exploit parallelism to greater extent than the one given by P. Harish et al. [2]. A NWC is a cycle in a graph, the sum of whose edge weights is negative. Our main concern is to enhance the performance of Single Source Shortest Path Algorithm for dense graphs. Dense graphs are

explicitly used in modeling of “correlation networks” in which the data may represent DNA microarrays [13, 14], Time Series [15] or any other multivariate feature vector. These graphs are generally fully connected as they represent pairwise similarities between objects. G. Potamias [12] used fully connected graph to simplify the genetic structure with the objects (genes) as nodes and the distances as the edge weights. Guy Melancon [7] worked on tuning the edge density of randomly generated graphs and also studied some real world graphs where the number of vertices is less but the graphs are highly populated with edges.

II. PREVIOUS WORK P. J. Narayanan [6] presented an efficient approach for

solving SSSP on processor arrays. Edmonds et al. [8] use Parallel Boost graph library while Crobak et al. [9] use CRAY MTA-2 for implementing SSSP. A. Crauser et al. [10] also provide parallel implementation of Dijikstra’s Algorithm. Various researches have been made to perform SSSP on GPUs but most of them provide great speedup only for sparse graphs and can’t perform better for dense graphs. P. Harish et al. [2] and P. Harish et al. [3] provide efficient parallel implementations of SSSP on GPUs using CUDA. Their algorithm calculates the SSSP of a million vertices graph in 1-2 seconds. A. Buluc et al. [16] provide optimized algorithms for Path problems on GPU to exploit resources to a grater extent. G. J. Katz et al. [4] provide an algorithm for the All Pair Shortest Path Problem on large graphs by using multiple devices (GPUs). But these algorithms do not detect the presence of any Negative Weighted Cycle (NWC) at all. Rather they pre-assume that all the edges have positive weights or, there is no NWC present in the graph.

III. COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) NVIDIA’s Compute Unified Device Architecture

(CUDA) provides a massively multithreaded architecture for General Purpose Parallel Computing. CUDA was introduced in November 2006 providing a new parallel programming model and instruction set architecture. Since then, the Instruction sets and APIs are upgraded at periodic intervals by NVIDIA. Apart from Graphics Processing, GPUs are, now-a-days, also used for general purpose parallel computing. General Purpose Parallel Computing on GPUs has shown a rapidly increasing trend in research and programming field.

International Conference on Computer & Communication Technology (ICCCT)-2011

978-1-4577-1386-611$26.00©2011 IEEE 635

CUDA provides a data-parallel approach to parallelism where all threads execute the same code (called kernel) on different data items on the GPU. A Graphics Processing Unit (GPU) has many Streaming Multiprocessors (SM). Each SM consists of 8 Scalar Processors (cores). NVIDIA’s latest series Fermi contains more than 512 cores. CUDA doesn’t need the programmer to manage the threads, rather it manages itself. Threads can be arranged to form blocks in 1, 2 or 3 Dimensions. The blocks can also be arranged in 1D or 2D, as per requirement. The arrangement pattern doesn’t alters the performance i.e., whether you launch 16x16 threads in 2D or 256 threads in 1D, the performance are the same. Modern GPUs also support atomic operations to avoid concurrent read/write clashes which are discussed later.

The all we have to do, is to write a block of code (kernel), copy the data from host (CPU) to device (GPU), call the kernel and copy back the data from device (GPU) to host (CPU) after execution of the kernel. However, the approach differs slightly in case of concurrent copy and execution.

IV. GRAPH REPRESENTATION There are many representation methods of graphs e.g.,

Adjacency List, Adjacency Matrix, Incidence Matrix, etc., most common of which are Adjacency Matrix and List. P. Harish et al. [2] use Compact Adjacency List for minimizing the memory requirement of the graph. We use Adjacency Matrix representation because our focus is on dense graphs (where, E = O (V2)), where the List representation exceeds the memory requirement than the Matrix representation. With the help of Adjacency matrix representation, we are also able to launch larger number of simultaneous threads to be sure that none of the cores are idle during the execution of the program.

V. SERIAL ALGORITHM In serial version, we have slightly modified the

BELLMAN-FORD Algorithm [5] for SSSP which also detects NWC. If there is no NWC reachable from source, our algorithm correctly computes SSSP and returns true; otherwise returns false. Here is the pseudo code for the two procedures of BELLMAN-FORD Algorithm: INITIALIZE-SINGLE-SOURCE (ISS) and RELAX. SERIAL_SSSP_ISS (G, C, S)

1. for i = 1 to |V[G]| do 2. C[i] = ∞ 3. end for 4. C[S] = 0

SERIAL_SSSP_RELAX (W, C, u, v, flag)

1. if C[v] > C[u] + W[u, v] then 2. C[v] = C[u] +W[u, v] 3. flag = false 4. end if

In SERIAL_SSSP_ISS, the costs of all the vertices are

initialized to ‘infinity’, except the source vertex which is initialized to 0. In SERIAL_SSSP_RELAX procedure, we have introduced a boolean variable ‘flag’, which is set to false (line 3) if the edge in process updates the cost of its

terminal vertex, the reason behind this will be explained shortly. Pseudo code for the main procedure is as follows. SERIAL_SSSP (G, W, C, S)

1. Create a boolean variable flag. 2. SERIAL_SSSP_ISS (G, C, S) 3. for i =1 to |V[G]| do 4. flag = true 5. for v = 1 to |V[G]| do 6. for u = 1 to |V[G]| do 7. SERIAL_SSSP_RELAX (W, C , u, v, flag) 8. end for 9. end for 10. if flag == true then 11. break 12. end if 13. end for 14. return flag

Since we are using Matrix representation, two nested for

loops of line 5 and 6 are used for covering all the edges. In the standard BELLMAN-FORD Algorithm [5], all the edges are relaxed exactly |V| - 1 times and then, a for loop checks the presence of Negative Weighted Cycle. This for loop checks each edge (u,v) for the condition C[v] > C[u] + W[u,v]. If the condition is true for any one of the edges, the algorithm returns false. We use a boolean variable ‘flag’ to eliminate this loop. Firstly, the number of iterations of the main for loop of line 3-13 is increased by 1 and then, the statement ‘flag = false’ is inserted inside the block of if condition of SERIAL_SSSP_RELAX procedure. Hence, the last iteration of the for loop of line 3-13 will be used for checking the presence of NWC in the graph. If there is no NWC in the graph and the for loop completes |V|-1 iterations, the SSSP is computed correctly (Lemma 24.2 [5]). Hence, in the last iteration (i = |V[G]|), the Cost of any of the vertices will not change and thus the flag, as initialized in the beginning of the loop (line 4), will remain true after the complete execution of the nested for loops. Since the value of flag is true, algorithm will return true. However, this is neither the main objective of introducing the ‘flag’ variable, nor it adds to any reduction in running time.

The main purpose of using the ‘flag’ variable is to check whether the SSSP is calculated before the completion of |V|-1 iterations of the main for loop of line 3-13. If so, the algorithm terminates earlier than the standard BELLMAN-FORD Algorithm. Consider the case where |V| = 10K and the shortest path is successfully computed in the first few iterations (say 8). Such cases are often encountered if the edge weights are small. To avoid a large amount of useless computation in those cases, we check at the end of each iteration whether any vertex has its cost updated inside the SERIAL_SSSP_RELAX procedure call of this iteration (then, flag will be set to false). If yes, then the condition in line 10 becomes false and for loop iterates again, otherwise, the for loop is left (line 11). We have added this condition because, if no Cost is modified in this iteration then obviously, none will be modified in any more successive iterations, i.e., the SSSP has been computed.

Hence, there are two conditions of exit from the for loop of line 3-13, either i == |V|+1 or flag == false. We have seen that the SSSP, if exists, is surely computed after the end of


636

|V|-1 iterations. So, if there is a change in the Cost of any of the vertices in |V|th iteration, it means that there is a reachable NWC in the graph which causes the Costs to still reduce. Thus, the flag, as set inside SERIAL_SSSP_RELAX, will remain false and the algorithm will return false.

The main reason for the better performance of our algorithm is the early exit condition inside the for loop of line 3-13. Our algorithm will not get any speedup for the graphs that take exactly |V|-1 iterations for computation of SSSP, but for all our sample complete graphs, it takes no more than 20 iterations (edge weights in the range [-100,100]).

VI. PARALLEL IMPLEMENTATION WITH CUDA In CUDA_SSSP_ISS kernel, |V| 1D threads are launched

and the ith thread initializes the cost of ith vertex to ‘infinity’, in parallel with other threads. CUDA_SSSP_ISS (G, C)

1. u = getThreadID 2. C[u] = ∞

In CUDA_SSSP_RELAX kernel, |V| x |V| 2D threads are

launched and assigned to each edge. CUDA_SSSP_RELAX (G, W, C, flag)

1. v = getThreadIDx 2. u = getThreadIDy 3. tmp = C[v] 4. Begin Atomic 5. if C[v] > C[u] + W[u, v] then 6. C[v] = C[u] + W[u, v] 7. end if 8. End Atomic 9. if C[v] < tmp then 10. flag = false 11. end if

In CUDA_SSSP_RELAX procedure, atomicMin

function (line 4-9), supported on CUDA 1.1 and above, is used to avoid concurrent read/write conflicts as suggested by P. Harish et al. [2]. We can’t put the statement of line 10 inside the if condition of line 5 because line 4-8 are statements of the predefined function atomicMin() which can’t be altered, hence, we have to check the condition separately. P. Harish et al. [2] launch 1D threads (1 thread per vertex) which update the Current Cost (Cua), using previous cost (Ca), only if the vertex is masked (or, currently discovered). This work is done in CUDA_SSSP_KERNEL1 and then the Updated Costs are copied into the Current Costs in second kernel CUDA_SSSP_KERNEL2. This causes the threads assigned to unmasked vertices (that are large in number) to remain idle till the threads for masked vertices do their work.

The relaxation phase of our algorithm is similar to that in BELLMAN-FORD Algorithm i.e., all the edges are relaxed in each iteration. However, the approach of P. Harish et al. [2] is similar to DIJIKSTRA’s algorithm i.e., only those edges are relaxed which are discovered till now. The latter one is better in serial execution (O(V2)) than the former that takes O(V3) for dense graphs. But the former approach hinders the power of parallelism when run on GPUs.

Here is the pseudo code for main procedure which is quite similar to SERIAL_SSSP. CUDA_SSSP (G, W, S, C)

1. Create a boolean variable flag. 2. for each vertex V do in parallel (1D threads) 3. CUDA_SSSP_ISS (V, C) 4. end for 5. C [S] = 0 6. for i = 1 to |V| do 7. flag = true 8. for each entry in W[G] do in parallel (2D threads) 9. CUDA_SSSP_RELAX (G, W, C, flag) 10. end for 11. if flag == true then 12. break 13. end if 14. end for 15. return flag

Instead of using two kernels, we use a single kernel for

relaxation and termination check, so that every single thread has more work to do and multiple kernel launches can be avoided. The nested for loops of SERIAL_SSSP are replaces by 2D threads in CUDA_SSSP. P. Harish et al. [2] uses the variable ‘terminate’ instead of ‘flag’ that does the same work but they do so in the second kernel CUDA_SSSP_KERNEL2. The comparison with P. Harish et al. [2] is shown in Fig. 2.

VII. LIMITATION ON THE SIZE OF GRAPH Though having enormous computation power, GPUs are

limited in memory. We run our program on a low cost GPU NVIDIA’s GeForce GT 240 (1GB RAM). A part of its memory is also needed for the Graphics display of the system. We are able to run our Algorithm for (|V|, |E|) = (8.7K, 75M) graph by using integer data type (4 bytes) for storing the Weighted Matrix. If we assume that the edge weights are in the range [-100, 100], we can use the character data type (1 byte) to store the matrix and now the program runs even for (|V|, |E|) = (30K, 900M) graphs (we used ASCII value of characters similar to the edge weights, for instance, A=-100, B = -99 and so on). If approximate results are allowed, scaling of edge weights within the limit may be another option. To handle graphs of even more vertices, various other approaches are also present. G. J. Katz et al. [4] use multiple devices and divide the matrix of the graph across the devices. P. Harish et al. [2] use multiple streams (supported by CUDA 1.1 and higher), for concurrent copy and execution. Thus, any of the methods can be applied to handle large graphs by dividing the graph into small parts and assigning each part to a device or a stream.

VIII. PERFORMANCE ANALYSIS We applied our algorithm on sparse, general, almost

complete and complete directed graphs of 1K to 30K vertices having up to 900M edges. Our algorithm takes equal time for all of these four categories, because no matter what the edge weight is, it must be relaxed. For complete graphs, 30K is approximately the maximum number of vertices that can be executed on a single graphics card with 1GB RAM using character data type. We have also written a 4 thread parallel


637

CPU version of the algorithm using Pthread library of C. We compare the running times of serial and parallel versions of our algorithm for complete graphs. We also compare our parallel version with P. Harish et al. [2] on the basis of edges, not vertices. We do so because our algorithm is well-suited only for dense graphs where the significance of number of edges is more than that of vertices. Also, a comparison on the basis of vertices would not be able to reflect the details of the graph e.g., a graph with 30K vertices and average degree 12 has 360K edges, while a complete graph of 30K vertices has 900M edges.

A. Types of Graphs For all our implementations, we use completely

connected and weighted graphs (well-suited for DNA Microarrays [13]). To maximize the number of vertices, the edge weights are assumed in the range [-100,100]. All the graphs are assumed to be directed i.e., the entries W[u,v] and W[v,u] may be different. Since we use only complete graphs for comparison, we don’t require any graph generation tool. The graphs are generated by using two nested for loops and a rand function (for random edge weights) of C library.

B. Experimental Setup We use a single NVIDIA’s GeForce GT 240 graphics

card with 1GB memory and 12 multiprocessors (96 CUDA cores) controlled by an Intel Core i3-350M processor @ 2.26 GHz (2 cores, 4 threads) with 4GB RAM, running Redhat LINUX 6.0. Serial and parallel CPU versions of the algorithm are compiled using gcc compiler. For parallel CPU version of the algorithm, Pthread library is used. In the parallel CPU version, all the four threads of the CPU are scheduled in the algorithm.

C. Results Fig. 1 shows the comparison between Serial CPU (CPU-

S), Parallel CPU (CPU-P) and GPU versions of our algorithm. The parallel CPU version takes nearly equal time upto 15K vertices but after that some speedup is obtained. The speedup is not sufficiently large because the number of cores is still 2; the fact is that they are used efficiently in the parallel version than in the serial one. However, GPU version gains significant speedup over the CPU versions.

Figure 1. Single Source Shortest Path timings for complete graphs, edge weights in the range [-100,100]. CPU-S is for serial CPU version while

CPU-P is for the parallel CPU version of the algorithm.

Comparison with P. Harish et al. [2] implementation is shown in Fig. 2. The number of edges for Harish implementation is calculated from Table 1 of [2] for a 100K vertex graph. For a 60M edge graph, their algorithm takes nearly 1.45 sec (100K vertices with average degree 600, Table 1 of [2]) while our algorithm takes just 0.11 sec for the same number of edges (7.8K vertices with average degree 7800). Additional parallelism improves the performance of our algorithm and the early exit of CUDA_SSSP removes the drawback of traditional BELLMAN-FORD Algorithm, providing extra performance benefit. Detailed timings of Fig. 1 and Fig. 2 are listed in Table I and Table II, respectively.

Unlike the previous approaches, our program can even accept negative weighted edges and can also detect any reachable NWC. This makes the Algorithm more general so that even if, by mistake, some negative weights are taken and if they form a NWC, it is detected and the program halts, while the previous algorithms go on an infinite loop. Also, the number of iterations in our algorithm is surely less than that given by P. Harish et al. [2] because they use Previous Costs (Ca) to update the Current Costs (Cua) while we use the Current Costs (C) themselves for updating. So, if any thread has currently updated its vertex’s cost, it may be utilized by its adjacent vertices in the same iteration. This is possible because, practically, all the threads do not run simultaneously (as the number of cores in the GPU, though large, is limited). The thread scheduler (present per SM) performs scheduling and context switching of the threads on its own.

Figure 2. Single Source Shortest Path timings. GPU-H represents P.

Harish et al. [2] GPU implementation on a 100K vertex graph with varying degree, and edge weights in the range [1,100] while GPU-O represents our GPU implementation on complete graphs with edge weights in the range [-

100,100].

TABLE I. SSSP TIMINGS OF SERIAL CPU, PARALLEL CPU AND PARALLEL GPU VERSIONS

Number of vertices

Time (milliseconds) CPU-S CPU-P GPU

5K 712 690 59 10K 2851 2771 209 15K 6411 6318 493 20K 11427 10561 880 25K 17816 16153 1412 30K 28123 23261 2265

0

5000

10000

15000

20000

25000

30000

5K 10K 15K 20K 25K 30K

Tim

e (

mill

iseco

nds)

Number of vertices

CPU-SCPU-PGPU

0

500

1000

1500

2000

2500

20M 40M 60M 80M 100M

Tim

e (m

illise

cond

s)

Number of edges

GPU-H

GPU-O


638

TABLE II. SSSP TIMINGS OF P. HARISH ET AL. [2] AND OUR GPU IMPLEMENTATION

Number of edges Time (milliseconds) Speedup GPU-H GPU-O

20M 375 51 7.35x 40M 898 73 12.30x 60M 1449 112 12.94x 80M 1909 153 12.48x 100M 2157 209 10.32x

IX. CONCLUSIONS AND FUTURE WORK In this paper, we provided a modified SSSP Algorithm

suitable for dense graphs on GPU using NVIDIA’s CUDA technology. We provided a generalized algorithm which accepts negative weighted edges too. We used Adjacency Matrix representation to reduce the memory requirement and to exploit parallelism to a greater extent, in case of dense graphs. In some cases, for further reduction in memory requirement, we used character data type instead of integer. Early termination of the algorithm reduces the amount of useless calculations. For a completely connected graph of 900M edges, our Algorithm takes 2.26 seconds.

We are trying to extend our algorithm, to accept graphs of even larger sizes, using multiple streams as implemented by P. Harish et al. [2]. We are also in progress of developing an APSP Algorithm that uses this SSSP algorithm. We have started working on OpenCL [11] for achieving better scalability of programs. OpenCL provides a common parallel computing model for GPUs and multicore CPUs.

ACKNOWLEDGMENT We would like to acknowledge SRMGPC for providing

hardware used in this work. We would also like to thank Saiyedul Islam for his help on CUDA technology.

REFERENCES [1] NVIDIA. NVIDIA CUDA Programming Guide 3.1 [2] P. Harish, V. Vineet and P. J. Narayanan, “Large graph algorithms for

massively multithreaded architectures,” Technical Report IIIT/TR/2009/74, International Institute of Information Technology Hyderabad, India, 2009.

[3] P. Harish and P. J. Narayanan, “Accelerating large graph algorithms on the GPU using CUDA,” in HiPC, vol. 4873 of Lecture Notes in Computer Science, 2007, pp. 197-208.

[4] G. J. Katz and J. T. Kider, Jr, “All pairs shortest-paths for large graphs on the GPU,” Proc. 23rd ACM SIGGRAPH/ EUROGRAPHICS Symp. Graphics Hardware (GH 08), 2008, pp. 47 – 55.

[5] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to Algorithms, 3rd ed., PHI Learning Pvt. Ltd., 2010, pp. 643-654.

[6] P. J. Narayanan, “Single source shortest path problem on Processor arrays,” Proc. IEEE Symp. Frontiers of Massively Parallel Computing (Frontiers 92), IEEE Press, Oct. 1992, pp. 553-556, doi:10.1109/FMPC.1992.234924.

[7] Guy Melancon, “Just how dense are dense graphs in the real world?: a methodological note,” Proc. AVI Workshop on Beyond time and errors: novel evaluation methods for information visualization (BELIV 06), ACM New York, 2006, pp. 1-7, doi:10.1145/1168149.1168167.

[8] N. Edmonds, A. Breuer, D. Gregor, and A. Lumsdaine, “Single-Source Shortest Paths with the Parallel Boost Graph Library,” in the Ninth DIMACS Implementation Challenge: The Shortest Path Problem, 2006.

[9] J. R. Crobak, J. W. Berry, K. Madduri, and D. A. Bader, “Advanced shortest paths algorithms on a massively-multithreaded architecture,” Proc. IEEE Symp. Parallel and Distributed Processing (IPDPS 2007), IEEE International, Mar. 2007, pp. 1-8, doi:10.1109/IPDPS.2007.370687.

[10] A. Crauser, K. Mehlhorn, and U. Meyer, “A parallelization of Dijkstra's shortest path algorithm,” Proc. 23rd International Symp. Mathematical Foundations of Computer Science (MFCS 98), Lecture Notes in Computer Science, Springer, vol. 1450/1998, pp. 722-731, 1998, doi:10.1007/BFb0055823.

[11] A. Munshi. OpenCL: Parallel computing of the GPU and CPU, in SIGGRAPH, Tutorial, 2008.

[12] G. Potamias, “Utilizing Genes functional classification in Microarray data analysis: a hybrid clustering approach,” 9th Panhellenic Conference on Informatics, Nov. 2003.

[13] K. Lee, J. H. Kim, T. S. Chung, B. S. Moon, H. Lee and I. S. Kohane, “Evolution strategy applied to global optimization of Clusters in Gene expression data of DNA Microarrays,” Proc. 2001 IEEE Congress on Evolutionary Computation, May 2001, vol. 2, pp. 845-850, doi:10.1109/CEC.2001.934278.

[14] Joshua M. Stuart, Eran Segal, Daphne Koller and Stuart K. Kim, “A Gene coexpression network for global discovery of conserved Genetic Modules,” Science, vol. 302, Oct. 2003, pp. 249-255,doi:10.1126/science.1087447.

[15] G. Leibon, S. Pauls, D.N. Rockmore and R. Savell, “Topological structures in the equities market network,” Proc. National Academy of Sciences, vol. 105, Dec. 2008, pp. 20589-20594, doi: 10.1073/pnas.0802806106.

[16] A. Buluc, J. N. Gilbert and C. Budak, “Solving path problems on the GPU,” Parallel Computing, vol. 36, Jun. 2010, pp. 241-253, doi: 10.1016/j.parco.2009.12.002.


639

Date post:	24-Dec-2016
Category:	Documents
Upload:	raghvendra-singh
View:	216 times
Download:	2 times

[IEEE 2011 2nd International Conference on Computer and Communication Technology (ICCCT) -...

Documents