+ All Categories
Home > Documents > Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf ·...

Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf ·...

Date post: 22-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
39
Transcript
Page 1: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions
Page 2: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

ACCELERATING REAL-WORLD

APPLICATIONS

David A. Bader

Georgia Institute of Technology

Professor, School of Computational Science and Engineering

Page 3: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

3 | Accelerating Real-world Applications | June 14, 2011

ACCELERATING REAL-WORLD

APPLICATIONS

Page 4: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

4 | Accelerating Real-world Applications | June 14, 2011

EXASCALE STREAMING DATA ANALYTICS:

REAL-WORLD CHALLENGES

All involve analyzing massive streaming

complex networks:

Health care disease spread, detection and prevention

of epidemics/pandemics (e.g. SARS, Avian flu, H1N1

“swine” flu)

Massive social networks understanding

communities, intentions, population dynamics, pandemic

spread, transportation and evacuation

Intelligence business analytics, anomaly detection,

security, knowledge discovery from massive data sets

Systems Biology understanding complex life

systems, drug design, microbial research, unravel the

mysteries of the HIV virus; understand life, disease,

Electric Power Grid communication, transportation,

energy, water, food supply

Modeling and Simulation Perform full-scale

economic-social-political simulations

0

50

100

150

200

250

300

350

400

450

Dec-

04

Mar

-05

Jun-

05

Sep-

05

Dec-

05

Mar

-06

Jun-

06

Sep-

06

Dec-

06

Mar

-07

Jun-

07

Sep-

07

Dec-

07

Mar

-08

Jun-

08

Sep-

08

Dec-

08

Mar

-09

Jun-

09

Sep-

09

Dec-

09

Facebook Active Users

Million Users

Exponential growth:

More than 600 million active users

Sample queries: Allegiance switching: identify

entities that switch communities.

Community structure: identify

the genesis and dissipation of

communities

Phase change: identify

significant change in the network

structure

REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE

Ex: discovered minimal changes in

O(billions)-size complex network

that could hide or reveal top

influencers in the community

Page 5: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

5 | Accelerating Real-world Applications | June 14, 2011

EXAMPLE: MINING TWITTER FOR SOCIAL GOOD

ICPP

2010

Image credit: bioethicsinstitute.org

Page 6: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

6 | Accelerating Real-world Applications | June 14, 2011

CDC / Nation-scale surveillance of public health

Cancer genomics and drug design

– computed Betweenness Centrality of Human

Proteome

Human Genome core protein interactionsDegree vs. Betweenness Centrality

Degree

1 10 100

Bet

wee

nnes

s C

entra

lity

1e-7

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1e+0

MASSIVE DATA ANALYTICS: PROTECTING OUR NATION US High Voltage Transmission Grid

(>150,000 miles of line) Public Health

ENSG0000

0145332.2

Kelch-like

protein 8

implicated

in breast

cancer

Page 7: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

7 | Accelerating Real-world Applications | June 14, 2011

GRAPHS ARE PERVASIVE IN LARGE-SCALE DATA ANALYSIS

Sources of massive data: petascale simulations, experimental devices, the

Internet, scientific applications.

New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.

Astrophysics Problem: Outlier detection.

Challenges: massive datasets,

temporal variations.

Graph problems: clustering, matching.

Bioinformatics Problem: Identifying drug target proteins.

Challenges: Data heterogeneity, quality.

Graph problems: centrality, clustering.

Social Informatics Problem: Discover emergent

communities, model spread of

information.

Challenges: new analytics routines,

uncertainty in data.

Graph problems: clustering, shortest

paths, flows.

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg

(2,3) www.visualComplexity.com

Page 8: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

8 | Accelerating Real-world Applications | June 14, 2011

NETWORK ANALYSIS FOR INTELLIGENCE AND SURVEILLANCE

[Krebs ‟04] Post 9/11 Terrorist Network Analysis from public domain information

Plot masterminds correctly identified from interaction patterns: centrality

A global view of entities is often more

insightful

Detect anomalous activities by

exact/approximate graph matching

Image Source: http://www.orgnet.com/hijackers.html

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies

for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Page 9: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

9 | Accelerating Real-world Applications | June 14, 2011

LOOKING AHEAD: GRAND CHALLENGE IN GRAPH ANALYTICS

Driving real-world applications are not just traditional HPC:

– health care, transportation, energy, proteomics, security, data sciences, informatics, …

FLOPS are free, Data is the challenge!

Fundamental abstractions are more irregular and complex than sparse/dense

matrices

– Very sparse, very high-dimensional data

– For example, there are few (if any) efficient distributed-memory parallel

implementations of even the simplest algorithm for sparse, arbitrary graphs!

We must integrate across algorithms, programming models, and

architectures, to address the growing requirements of grand

challenge applications

Page 10: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

10 | Accelerating Real-world Applications | June 14, 2011

INFORMATICS GRAPHS ARE EVEN TOUGHER

Very different from graphs in scientific computing!

– Graphs can be enormous

– Power-law distribution of the number of neighbors

– Small world property – no long paths

– Very limited locality, not partitionable

– Highly unstructured

– Edges and vertices have types

Experience in scientific computing applications provides only limited insight.

Six degrees of Kevin Bacon

Source: Seokhee Hong

Page 11: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

11 | Accelerating Real-world Applications | June 14, 2011

OPEN QUESTIONS: ALGORITHMIC KERNELS FOR

SPATIO-TEMPORAL INTERACTION NETWORKS AND GRAPHS

(STING)

Traditional graph theory:

– Graph traversal (e.g. breadth-first search)

– S-T connectivity

– Single-source shortest paths

– All-pairs shortest paths

– Spanning Tree

– Connected Components

– Biconnected Components

– Subgraph isomorphism (pattern matching)

– ….

Page 12: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

12 | Accelerating Real-world Applications | June 14, 2011

HIERARCHY OF INTERESTING GRAPH ANALYTICS

Extend single-shot graph queries to include time.

Are there s-t paths between time T1 and T

2?

What are the important vertices at time T?

Use persistent queries to monitor properties.

Does the path between s and t shorten drastically?

Is some vertex suddenly very central?

Extend persistent queries to fully dynamic properties.

Does a small community stay independent rather than merge with larger groups?

When does a vertex jump between communities?

New types of queries, new challenges...

Page 13: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

13 | Accelerating Real-world Applications | June 14, 2011

GRAPH ANALYTICS FOR SOCIAL NETWORKS

Are there new graph techniques? Do they parallelize? Can the

computational systems (algorithms, machines) handle massive

networks with millions to billions of individuals? Can the techniques

tolerate noisy data, massive data, streaming data, etc. …

• Communities may overlap, exhibit different properties and

sizes, and be driven by different models

– Detect communities (static or emerging)

– Identify important individuals

– Detect anomalous behavior

– Given a community, find a representative member of the

community

– Given a set of individuals, find the best community that

includes them

Page 14: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

14 | Accelerating Real-world Applications | June 14, 2011

CENTRALITY IN MASSIVE SOCIAL NETWORK ANALYSIS

Centrality metrics: Quantitative measures to capture the importance of person in a social

network

– Betweenness is a global index related to shortest paths that traverse through the person

– Can be used for community detection as well

Identifying central nodes in large complex networks is the key metric in a number of

applications:

– Biological networks, protein-protein interactions

– Sexual networks and AIDS

– Identifying key actors in terrorist networks

– Organizational behavior

– Supply chain management

– Transportation networks

Current Social Network Analysis (SNA) packages handle 1,000‟s of entities, our techniques handle

BILLIONS (6+ orders of magnitude larger data sets)

Page 15: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

15 | Accelerating Real-world Applications | June 14, 2011

BETWEENNESS CENTRALITY (BC)

Key metric in social network analysis [Freeman ‟77, Goh ‟02, Newman ‟03, Brandes ‟03]

: Number of shortest paths between vertices s and t

: Number of shortest paths between vertices s and t passing through v

Exact BC is compute-intensive

st

s v t V st

vBC v

)(vst

st

Page 16: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

16 | Accelerating Real-world Applications | June 14, 2011

Vertex ID

0 1000 2000 3000 4000

Bet

wee

nnes

s ce

ntra

lity

scor

e

1e-1

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7Exact (scatter data)

Approximate (smooth)

BETWEENNESS CENTRALITY ALGORITHMS

Brandes [2003] proposed a faster sequential algorithm for BC on sparse graphs

– time and space for weighted graphs

– time for unweighted graphs

We designed and implemented the first parallel algorithm:

– [Bader, Madduri; ICPP 2006]

Approximating Betweenness Centrality [Bader Kintali Madduri Mihail 2007]

– Novel approximation algorithm for determining the

betweenness of a specific vertex or edge in a graph

– Adaptive in the number of samples

– Empirical result: At least 20X speedup over exact BC

)(nO)log( 2 nnmnO

)(mnO

Graph: 4K vertices and 32K edges,

System: Sun Fire T2000 (Niagara 1)

Page 17: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

17 | Accelerating Real-world Applications | June 14, 2011

RELATED WORK: PARTITIONING ALGORITHMS

FROM SCIENTIFIC COMPUTING

Theoretical and empirical evidence: existing techniques perform poorly on small-world networks

[Mihail, Papadimitriou ‟02] Spectral properties of power-law graphs are skewed in favor of high-degree vertices

[Lang ‟04] On using spectral techniques, “Cut quality varies inversely with cut balance” in social graphs: Yahoo! IM graph, DBLP collaborations

[Abou-Rjeili, Karypis ‟06] Multilevel partitioning heuristics give large edge-cut for small-world networks, new coarsening schemes necessary

Page 18: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

18 | Accelerating Real-world Applications | June 14, 2011

BRANDES’ ALGORITHM

Brandes showed that dependency of a source vertex s on vertex v obeys the following recursion,

BC is computed in two stages

1. Iterate over all vertices to compute distance and shortest path counts from s to each vertex.

2. Revisit vertices starting from the farthest vertex and accumulate dependencies to compute

Page 19: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

19 | Accelerating Real-world Applications | June 14, 2011

BETWEENNESS CENTRALITY USING GPGPU (JOINT WORK WITH PUSHKAR PANDE)

Page 20: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

20 | Accelerating Real-world Applications | June 14, 2011

SEQUENTIAL IMPLEMENTATION 1.for each v V do

2. BC[v] = 0;

3.endfor

4.for each v V do

5. I. Initialization

6. succCount[t] = 0, sigma[t] = 0,

7. and d[t] = -1, t V

8. level = 0; S[level] = v;

9. count =1;

10. II. Shortest paths

11. while count > 0 do

12. count = 0

13. for each neighbor w of v S[level] do

14. if d[w] = -1 then

15. p = count++;

16. Insert w at pos p in S[level+1];

17. d[w] = level + 1;

18. if d[w] = level + 1 then

19. p = succCount[v]++;

20. Insert w at pos p of succList[v];

21. sigma[w] += sigma[v];

22. endfor

23. endwhile

24. III. Dependency accumulation by back propagation

25. level--;

26. while level > 0 do

27. for each w S[level] do

28. del[w] = 0;

29. for each v succList[w] do

30.

31. endfor

32. BC[w] += del[w];

33. endfor

34. level--;

35. endwhile

36. endfor

K. Madduri, D. Ediger, K. Jiang, D.A. Bader, and D.G. Chavarría-Miranda, “A Faster Parallel Algorithm and Efficient

Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets,” Third Workshop on Multithreaded

Architectures and Applications (MTAAP), Rome, Italy, May 29, 2009.

Page 21: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

21 | Accelerating Real-world Applications | June 14, 2011

PARALLELIZATION STRATEGIES

Parallelism can be exploited at three levels of

granularity:

1. Coarse grained: Run multiple Breadth-First traversals in

parallel

2. Medium grained: Process different vertices of the same

level in parallel

3. Fine grained: Process the neighbors of the same vertex in

parallel

Page 22: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

22 | Accelerating Real-world Applications | June 14, 2011

MAPPING BC ALGORITHM TO GPGPU

Using OpenCL provides a hierarchy of threads, grouped into grids and

blocks, it looks promising to exploit medium grained parallelism at the grid

level and fine-grained parallelism at the block level.

Coarse grained parallelism can be carried out but has the following

limitations:

– Memory constraints limit the amount of coarse-grained parallelism that can

be exploited. Requires more memory as more copies of data structure are

required.

– Centrality running sum has to be updated atomically.

Page 23: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

23 | Accelerating Real-world Applications | June 14, 2011

OUTLINE OF THE HOST CODE 1. BC[v] = 0, v V 2. for each v V do

3. I. Initialization 4. level = 0; S[level] empty stack;

5. cuda_initialize (succCount[], sigma[], d[], del[]); // Kernel invocation

6. Enqueue v in S[level]; count = 1;

7. II. Shortest paths (Augmented Breadth-First traversal)

8. while count > 0 do

9. count = 0;

10. for all v S[level] in parallel do

11. cuda_traverse (v, S[], level, succList[], succCount[], sigma[], d[], &count); // Kernel invocation

12. level++;

14. III. Dependency accumulation by back propagation 15. level --; 16. chunkSize = numThreadsPerBlock;

17. while level > 0 do

18. for all w1…wchunkSize S[level] in parallel do

19. cuda_accumulate (w1…wchunkSize, S[], level, succList[], succCount[], sigma[], del[]); // Kernel invocation

20. level --;

Page 24: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

24 | Accelerating Real-world Applications | June 14, 2011

INITIALIZATION KERNEL

cuda_initialize (succCount[], sigma[], d[], del[], initBC)

1. global (d[], sig[], del[], succCount[])

2. idx = blockIdx.x * chunkSize + threadIdx.x; // chunkSize = blockDim.x

3. if ( idx < n ) then // n = |V|

4. d[idx] = -1;

5. sigma[idx] = 0;

6. del[idx] = 0;

7. succCount[idx] = 0;

Thread blocks

d[], sig[], del[],

succ_count[]

0 1 2 … b

….

chunkSize = blockDim.x

Page 25: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

25 | Accelerating Real-world Applications | June 14, 2011

GRAPH TRAVERSAL KERNEL

cuda_traverse (v, S[], level, succList[], succCount[], sigma[], d[], &count)

1. shared (v, nw, myS[], mySuccList[], myCount, mySuccCount)

2. global (G(V, E), sig[], d[], S[], succList[], count, succCount[])

3. v = S[ beg + blockIdx.x ];

4. w = ( thid < nw)? G.Ev[thid] : -1;

5. if (w != -1 and v != w ) then

6. dw = atomicCAS (&d[w], -1, level + 1);

7. if (dw == -1) then

8. p = atomicAdd (&myCount, 1);

9. myS[p] = w;

10. dw = level+ 1;

11. if (dw == level + 1) then

12. p = atomicAdd (&mySuccCount, 1)

13. mySuccList[p] = w;

14. atomicAdd(&sigma[w], sigma[v]);

15. copy_to_global (S, &count, myS, myCount);

16. copy_to_global (succList, &succCount[v], mySuccList, mySuccCount);

w1 w2 …

v1 v2 v3 vr …

v v v v v …

w1 w2 …

….

S[level]

Thread block 1 Thread block b

Page 26: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

26 | Accelerating Real-world Applications | June 14, 2011

DEPENDENCY ACCUMULATION KERNEL

cuda_accumulate (w1…wchunkSize, S[], level, succList[], succCount[], sigma[], del[])

1. global (S[], succList[], succCount[], del[], sigma[], BC[])

2. offset = beg + blockIdx.x * chunkSize + threadIdx.x

3. w1…wchunkSize = (offset < end)? S[offset] : -1;

4. if (w != -1) then // w w1…wchunkSize

5. myDelw = 0.0;

6. nSuccw = succCount[w];

7. mySuccList = succList + |G.Ew|;

8. for j = 0 to nSuccw do

9. v = mySuccList[j];

10.

11. del[w] = myDelw;

12. BC[w] += myDelw;

0 1 2 … b ….

chunkSize = blockDim.x

Thread blocks

S[level]

Page 27: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

27 | Accelerating Real-world Applications | June 14, 2011

TEST GRAPH INSTANCES Graph #Vertices #Edges Average

degree

Maximum

Degree

#Connected

components

syn1.gr 262,144 2,097,152 8 4,588 53,888

syn2.gr 524,288 4,194,304 8 4,063 115,376

syn3.gr 1,048,576 8,388,608 8 5,421 247,339

syn4.gr 1,000,000 10,000,000 10 10,030 536,229

syn5.gr 1,000,000 10,000,000 10 1,090 475,952

syn5.gr

Out degree

10-1 100 101 102 103 104

Freq

uenc

y

10-1

100

101

102

103

104

105

syn4.gr

Out degree

10-1 100 101 102 103 104 105

Freq

uenc

y

10-1

100

101

102

103

104

105

106

Degree distribution of syn4.gr and syn5.gr

• Unweighted – directed graphs were generated using R-MAT graph generator in SNAP[2].

• R-MAT[1] method generates graphs that match a property of the real world graphs like small diameter and the power law degree distribution.

1. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, 2004.

2. D.A. Bader and K. Madduri, “SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks,” 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 14-18, 2008.

Page 28: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

28 | Accelerating Real-world Applications | June 14, 2011

PRELIMINARY PERFORMANCE RESULTS

Graph #Vertices #Edges CPU Time (s) GPU Time (s) Speed Up

syn1.gr 262,144 2,097,152 26.0 20.7 1.26

syn2.gr 524,288 4,194,304 89.9 49.9 1.80

syn3.gr 1,048,576 8,388,608 193.2 87.1 2.22

syn4.gr 1,000,000 10,000,000 92.2 44.0 2.09

syn5.gr 1,000,000 10,000,000 252.2 79.7 3.16

• The computation was carried out for a subset of all-pairs shortest paths

by using 500 source vertices.

CPU : Intel(R) Core(TM)2 CPU 6400

• Clock rate : 2.13 GHz

• Cache : 2MB

• Memory : 2GB

GPU: Tesla C1060 with 30 MP x 8 cores/MP

• Clock rate : 1.3 GHz

• Shared Mem per block : 16 KB

• Memory : 4 GB

Runtime on CPU and GPU in seconds (with global memory stack and successor lists)

Page 29: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

29 | Accelerating Real-world Applications | June 14, 2011

PERFORMANCE RESULTS

Graph #Vertices #Edges CPU Time (s) GPU Time (s) Speed Up

syn1.gr 262,144 2,097,152 26.0 20.7 10.3 2.52

syn2.gr 524,288 4,194,304 89.9 49.9 23.4 3.84

syn3.gr 1,048,576 8,388,608 193.2 87.1 40.2 4.80

syn4.gr 1,000,000 10,000,000 92.2 44.0 22.3 4.14

syn5.gr 1,000,000 10,000,000 252.2 79.7 24.7 10.22

Using shared memory for stack and successor

lists improves the performance by more than 2x

Page 30: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

31 | Accelerating Real-world Applications | June 14, 2011

PERFORMANCE BOTTLENECK

The implementation exploits parallelism at the block level by distributing vertices at

each level of traversal to different blocks and distributing vertices to threads for

dependency accumulation.

Major Performance Bottlenecks:

– Load imbalance due to power law degree distribution causes different thread blocks in a

kernel to execute at different speeds. Distributing edges for the frontier to be examined

instead of the vertices can improve the load balance.

– Irregular memory access makes it difficult to realize coalesced memory access, thus

constrains effective memory bandwidth.

– Atomic operations on global memory are slow, but they are necessary for exploiting fine-

grained parallelism in graph algorithms.

Page 31: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

32 | Accelerating Real-world Applications | June 14, 2011

PERFORMANCE SUMMARY (WORK IN PROGRESS)

Page 32: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

33 | Accelerating Real-world Applications | June 14, 2011

FURTHER IMPROVEMENTS: LOAD BALANCE

Frontier Pre-processing • Pre-process frontier for balanced load

• Split adjacencies into „warpSize‟ chunks

• Observed best performance with „half-warp‟ sized

chunks.

• Each instruction is fetched and executed in parallel over 4 cycles for 32 data

elements (a warp)

• Global memory request for a warp are split into two, one each per half warp

Page 33: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

34 | Accelerating Real-world Applications | June 14, 2011

FURTHER IMPROVEMENT:

DEPENDENCY ACCUMULATION PER WARP

Page 34: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

35 | Accelerating Real-world Applications | June 14, 2011

CONCLUSION AND FUTURE WORK

GPGPU is a promising accelerator for real-world applications

Data-intensive irregular applications are a suitable candidate for parallelization

– Use of shared memory improved performance by as much as 3x.

Demonstration of high-performance betweenness centrality, a key kernel in the DARPA

HPCS SSCA2 Benchmark, highlights the advantage of a GPU-accelerated system for an

important application.

Page 35: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

36 | Accelerating Real-world Applications | June 14, 2011

COLLABORATORS AND ACKNOWLEDGMENTS

Jason Riedy, Research Scientist, (Georgia Tech)

Graduate Students (Georgia Tech):

– Seunghwa Kang (PNNL)

– David Ediger

– Karl Jiang

– Pushkar Pande

Bader PhD Graduates:

– Kamesh Madduri (Lawrence Berkeley National Lab Penn State)

– Guojing Cong (IBM TJ Watson Research Center)

John Feo and Daniel Chavarría-Miranda (Pacific Northwest Nat‟l Lab)

Page 36: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

37 | Accelerating Real-world Applications | June 14, 2011

RELATED RECENT PUBLICATIONS (2008-2010)

D.A. Bader and K. Madduri, “SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the

exploration of large-scale networks,” 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 14-18,

2008.

S. Kang, D.A. Bader, “An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest of Sparse Graphs,” 14th

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Raleigh, NC, February 2009.

Karl Jiang, David Ediger, and David A. Bader. “Generalizing k-Betweenness Centrality Using Short Paths and a Parallel Multithreaded

Implementation.” The 38th International Conference on Parallel Processing (ICPP), Vienna, Austria, September 2009.

Kamesh Madduri, David Ediger, Karl Jiang, David A. Bader, Daniel Chavarría-Miranda. “A Faster Parallel Algorithm and Efficient

Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets.” 3rd Workshop on Multithreaded

Architectures and Applications (MTAAP), Rome, Italy, May 2009.

David A. Bader, et al. “STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation.” 2009.

David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader. “Massive Streaming Data Analytics: A Case Study with Clustering

Coefficients,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.

Seunghwa Kang, David A. Bader. “Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce cluster and a

Highly Multithreaded System:,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.

David Ediger, Karl Jiang, Jason Riedy, David A. Bader, Courtney Corley, Rob Farber and William N. Reynolds. “Massive Social Network

Analysis: Mining Twitter for Social Good,” The 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA, September

2010.

Virat Agarwal, Fabrizio Petrini, Davide Pasetto and David A. Bader. “Scalable Graph Exploration on Multicore Processors,” The 22nd IEEE

and ACM Supercomputing Conference (SC10), New Orleans, LA, November 2010.

Page 37: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

38 | Accelerating Real-world Applications | June 14, 2011

RELATED RECENT PUBLICATIONS (2005-2007) D.A. Bader, G. Cong, and J. Feo, “On the Architectural Requirements for Efficient Execution of Graph Algorithms,” The 34th International Conference

on Parallel Processing (ICPP 2005), pp. 547-556, Georg Sverdrups House, University of Oslo, Norway, June 14-17, 2005.

D.A. Bader and K. Madduri, “Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors,” The 12th

International Conference on High Performance Computing (HiPC 2005), D.A. Bader et al., (eds.), Springer-Verlag LNCS 3769, 465-476, Goa, India, December

2005.

D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2,” The 35th

International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.

D.A. Bader and K. Madduri, “Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks,” The 35th International Conference on Parallel

Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.

K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “Parallel Shortest Path Algorithms for Solving Large-Scale Instances,” 9th DIMACS Implementation

Challenge -- The Shortest Path Problem, DIMACS Center, Rutgers University, Piscataway, NJ, November 13-14, 2006.

K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph

Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

J.R. Crobak, J.W. Berry, K. Madduri, and D.A. Bader, “Advanced Shortest Path Algorithms on a Massively-Multithreaded Architecture,” First Workshop

on Multithreaded Architectures and Applications (MTAAP), Long Beach, CA, March 30, 2007.

D.A. Bader and K. Madduri, “High-Performance Combinatorial Techniques for Analyzing Massive Dynamic Interaction Networks,” DIMACS Workshop

on Computational Methods for Dynamic Interaction Networks, DIMACS Center, Rutgers University, Piscataway, NJ, September 24-25, 2007.

D.A. Bader, S. Kintali, K. Madduri, and M. Mihail, “Approximating Betewenness Centrality,” The 5th Workshop on Algorithms and Models for the Web-

Graph (WAW2007), San Diego, CA, December 11-12, 2007.

David A. Bader, Kamesh Madduri, Guojing Cong, and John Feo, “Design of Multithreaded Algorithms for Combinatorial Problems,” in S. Rajasekaran and

J. Reif, editors, Handbook of Parallel Computing: Models, Algorithms, and Applications, CRC Press, Chapter 31, 2007.

Kamesh Madduri, David A. Bader, Jonathan W. Berry, Joseph R. Crobak, and Bruce A. Hendrickson, “Multithreaded Algorithms for Processing Massive

Graphs,” in D.A. Bader, editor, Petascale Computing: Algorithms and Applications, Chapman & Hall / CRC Press, Chapter 12, 2007.

Page 39: Title (Arial Bold Italic 20pt)developer.amd.com/wordpress/media/2013/06/2270_final.pdf · –computed Betweenness Centrality of Human Proteome Human Genome core protein interactions

40 | Accelerating Real-world Applications | June 14, 2011

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not

limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,

product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT,

SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED

HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD‟s positions, strategies or opinions. Unless explicitly stated, AMD

is not responsible for the content herein and no endorsements are implied.


Recommended