+ All Categories
Home > Documents > stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility...

stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility...

Date post: 10-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
29
Experiments Using IBMs Experiments Using IBM s Software Transactional Memory Compiler Barna L Bihari Barna L. Bihari ICON Consulting, Inc. Representing IBM at LLNL Representing IBM at LLNL Livermore, California John Gyllenhaal Tom Spelce Scott Futral John Gyllenhaal, Tom Spelce, Scott Futral Lawrence Livermore National Laboratory ScicomP 15 Meeting, Barcelona, Spain, May 18-22, 2009
Transcript
Page 1: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Experiments Using IBM’sExperiments Using IBM s Software Transactional Memory

CompilerBarna L BihariBarna L. Bihari

ICON Consulting, Inc. Representing IBM at LLNLRepresenting IBM at LLNL

Livermore, California

John Gyllenhaal Tom Spelce Scott FutralJohn Gyllenhaal, Tom Spelce, Scott FutralLawrence Livermore National Laboratory

ScicomP 15 Meeting, Barcelona, Spain, May 18-22, 2009

Page 2: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

AcknowledgementsAcknowledgements• Dong Ahn• Bronis de SupinskiBronis de Supinski• Peng Wu (IBM)

Page 3: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Transactional memory: what is it?Transactional memory: what is it?Motivation: Increased number of cores per node for the foreseeable

future shared memory parallel programming or threadingfuture – shared memory parallel programming or threadingProblem: Multiple threads in a shared-memory environment can lead to

race conditions, hence memory conflicts and incorrect executionOld S l ti T diti ll h d t t t h i tiOld Solution: Traditionally users had to resort to synchronization

techniques such as barriers or mutex type locks • Locks are expensive and can result in dead-locksNew Solution: Transactions allow operations to go through, but check

for memory violations and unroll it if necessary• Transactional memory (TM) ensures atomicity and isolation• An abstract construct that hides its details from the programmers• It is intended to make parallel threading easier and more efficient• Suggested use: when conflict probability is relatively lowSuggested use: when conflict probability is relatively low

Transactional memory enables efficient high level threading

Page 4: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

The IBM solution• IBM has a freely available software implementation of transactional memory• Currently available for AIX: http://www.alphaworks.ibm.com/tech/xlcstm• It uses OpenMP constructsIt uses OpenMP constructs• IBM STM features:- runtime library has standard and debugging versions

k i l ti f t i i ( d t “ t i l ti ”)- ensures weak isolation of atomic regions (as opposed to “strong isolation”)- TM languages extensions: #pragma tm atomic default(trans)- TM function attributes - intrinsic functions: tm_trans_read/write, tm_notrans_read/write- runtime statistics dumped via stm_print_stats- compiling and linking via stmxlc and stmxlc++ commandscompiling and linking via stmxlc and stmxlc++ commands - all other XL options valid for C/C++ Enterprise Edition for AIX, v9.0- need to link via –lstm (standard) or -lstmstat (debug)

il i t t ti t il bl i th t li k ti- compiler instrumentation report available via the –qreport link option

IBM STM is a freely available state-of-the-art compiler

Page 5: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Current stage of experimentationCurrent stage of experimentation

• Installed and tested the IBM STM on LLNL’s AIX systemsy• Current focus: identify algorithms that would benefit from STM• Present: actual runtime and parallel speed-up is secondary • Current goal: learn about potential fertile grounds for TM using

STM, hoping for eventual HTM (hardware transactional memory)Future goal: actual runtime is ultimate bottom line via HTM• Future goal: actual runtime is ultimate bottom line - via HTM

• Expect much of current work to carry over – OpenMP standard?• Algorithms with low but nonzero chance of conflicts are primeAlgorithms with low but nonzero chance of conflicts are prime

candidates• To date: identified two such algorithms (frequent in simulations)

TM is not for every code: round up the (usual?) suspects

Page 6: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Designing the Toy Code/Benchmark• Spawn multiple threads• Have at least one “parallel for” loopHave at least one parallel for loop• Should be relatively load balanced • Should mimic the loop-structure of an actual p

simulation code or code segment• Should generate race conditions between threads (i.e.

conflicts)conflicts)• Some number crunching in between memory updates• Error checking should be simpleg p• Upon request, be able to get the “right” answer• Should sometimes be able to get the “wrong” answer

Goal: simulate thread conflicts in a multiphysics code without the actual code

Page 7: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

The simplest toy code one could think ofp y

• The code has one key variable for each threadC fli t thi i bl i d• Conflicts on this variable occur in a random way

• Threads increment their own or other’s counter• Outer loop increments conflict probability at each• Outer loop increments conflict probability at each

pass (0 to (n_threads-1)/n_threads)• The update is atomic• Checksum over all threads is independent of:

- conflict probabilityn mber of threads- number of threads

• We always get the correct checksum with STM• No number-crunching yetNo number crunching yet We can confirm that STM corrects unresolved conflicts

Page 8: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

The SyntaxThe Syntax

#pragma tm atomic default(trans)Specifying a transactional memory section

#pragma tm atomic default(trans){

count_array[threadid]++;}}

Specifying the parallel (threaded) region

#pragma omp parallel for for (i=0; i < iter_count; i++){

Main body of threaded computationMain body of threaded computation}

Calling the STM diagnostic routineCalling the STM diagnostic routine

stm_print_stats();

Page 9: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

8x 106 Conflicts Resolved

6x 106 Conflicts Resolved

Actual conflicts vs. conflict probabilities

4

6

8x 10

of c

onfli

cts

Actual retriesPotential conflictsIntended conflicts

4

6

of c

onfli

cts

Actual retries

2 Threads

2

4

Num

ber

of

2

Num

ber

of Actual retries

Potential conflictsIntended conflicts

4 Threads

Conflicts ResolvedConflicts Resolved

0 0.2 0.4 0.6 0.80

Conflict Probability

N

0 0.2 0.4 0.6 0.8

0

Conflict Probability

N

10x 106

onfli

cts

Conflicts Resolved

Actual retriesPotential conflictsIntended conflicts

10x 106

onfli

cts

Conflicts Resolved

Actual retriesPotential conflictsIntended conflicts

5

umbe

r of

con

Intended conflicts

5

umbe

r of

con

Intended conflicts

16 Threads

8 Th d

0 0.5 10

Conflict Probability

Num

0 0.5 10

Conflict Probability

Num

8 Threads

Page 10: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Benchmark code designBenchmark code design• Goal: quick experimentation without production code changes• Be able to use realistic unstructured mesh connectivity• Be able to use realistic unstructured mesh connectivity• Have both deterministic and random run modes• Chance of conflict low or very low, but cannot be ruled outChance of conflict low or very low, but cannot be ruled out• Easy parallelization: embarrassingly parallel loops?• Current name: BUSTM (Benchmark for Unstructured-mesh

Software Transactional Memory)• Have at least some resemblance to some real algorithms

T t t f• Two targets so far:- deterministic CFD- Monte Carlo transport- Monte Carlo transport

Benchmark codes allow for fast feasibility studies

Page 11: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Unstructured-mesh infrastructure• Code can handle cells with arbitrary number of faces• Three (hierarchical) basic element types are used for implementation:1. nodes or grid points (physical xyz-points in 3-D space)2. faces (boundaries between cells) - contain nodes3. cells - contain faces and nodes

• Practical cell types:- Tetrahedron (4 faces, 4 nodes)

Hexahedron (6 faces 8 nodes)- Hexahedron (6 faces, 8 nodes)- Triangular prism (5 faces, 6 nodes)- Pyramid (5 faces, 5 nodes)

• Connections between the three basic elements:- each face knows which cells are on either side (2 per face)- each face knows which nodes are part of it (3 or 4 per face)p ( p )- each cell knows which nodes are part of it (4 to 8 per cell)- each cell knows which cells are part of it (4 to 6 per cell)

Page 12: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

1. “Deterministic” Conflicts

E C i fi i l hE.g. Conservative finite volume schemes on unstructured meshes-Compute-intensive loops are face-based-Each face has 2 cells on either sideEach face has 2 cells on either side-Cells are updated based on face-based loops-Face (flux) computations are compute-heavy-Probability of conflicts is very low, but nonzeroT diti l th d f t b i-Traditional thread-safety can be expensive

-Potential debugging nightmare -Transactional memory can have a huge payoff

Page 13: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Test of STM applicability• The set of connections between the different element types forms a graph• Two of the three basic elements form the nodes and edges of the graph• Indirect indexing is pervasive throughout such unstructured-mesh codes• Example of a triangular prism mesh and its corresponding graph might be:

• Heart of any conservative CFD code: flux computation – done face-by-face• Question: How to simulate flux computation without a full CFD solver?• Question: How to simulate flux computation without a full CFD-solver?

Answer: Compute the numerical gradient instead

Page 14: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Unstructured-mesh algorithmUnstructured mesh algorithm• Goal: compute the gradient of a function on

t t d han unstructured mesh • Use well-known formula ∫

=∇V

fnV

f dS||

1 r

1

∂VV ||

• Approximated at cell j by:h i th l l f f ll j

ii

ij

j fC

f ∑≈∇ n||

1

- where i runs over the local faces of cell j- n is the cell face normal (precomputed)

Similar loops compute the numerical divergence essential in conservative CFD codes on unstructured meshes

Page 15: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Relevant code section

#pragma tm atomic default(trans){

Gradients computed by accumulation within TM section of compute_cell

{gradient[cell_no_1] += incr; gradient[cell_no_2] -= incr;

}

#pragma omp parallel forThe general parallel (threaded) region is face-based

#pragma omp parallel for for (i=0; i < max_face; i++){

left_neighbor = left_cells[i];i ht i hb i ht ll [i]right_neighbor = right_cells[i];

compute_cell(incr, left_neighbor); (face increments cell)compute_cell(incr, right_neighbor); (face increments cell)p _ ( g _ g ) ( )

}Error checking: assume f = constant, then grad f = 0

Page 16: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Actual cases: 2 meshes used• Used two 3-D meshes for 2-D testing:

1 P i h 2 D l f 3 D ll ( di )• 1. Prism mesh – 2-D layer of 3-D cells (medium):119893 cells, 420060 faces (240655 BC +179405 interior), 123132 nodes

• 2. Hex mesh – 2-D layer of 3-D cells (small):3000 cells, 12110 faces (6220 BC + 5890 interior), 6324 nodes

Test cases are borrowed from CFD research simulations

Page 17: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Prism mesh results2000

rror

s

Conflicts and errors 10

s

Conflicts detected 1 thread

Prism mesh results

1000

1500

2000

onfli

cts/

erro

without STMwith STM

5

10

of c

onfli

cts 2 threads

4 threads8 threads16 threads

Conflicts over 1000 runs

Totals on

500

1000

mbe

r of

con5

Num

ber

of Totals on

1-16 threads

0 10 200

Number of threadsN

umb

0 500 10000

Run number

N

- Conflict occurrence (detected by STM) is between 0% and 0.00042%- No conflicts on 1 or 2 threads

N b f STM t i i h l th th t f “b d lt ” ( STM)

STM fixed the conflicts – much fewer “retries” than “wrong answers”

- Number of STM-retries is much less than that of “bad results” (non-STM)

Page 18: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Hex mesh results20

s

Conflicts detected

12

Hex mesh results

10

15

20

of c

onfli

cts

24816

Conflicts over 1000 runs

5

10

Num

ber

of

Totals on 1-16

0 500 10000

Run number

N

threads

- Conflict occurrence (detected by STM) is between 0% and 0.1%- No conflicts on 1 or 2 threads

N b f STM t i i bl t b f “b d lt ” ( STM)

STM fixed the conflicts – “retries” comparable to “wrong answers”

- Number of STM-retries is comparable to number of “bad results” (non-STM)

Page 19: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

2. Probabilistic Conflicts

• Imagine a (large) number of randomly releasedImagine a (large) number of randomly released particles travelling through a mesh composed of cells

• Each particle operates on many mesh cells as it hits p p ythem

• Parallelized (i.e. threaded) loop is over particles( ) p p• Conflicts can occur as more than one particle hits the

same cell• This is a simplistic view of a Monte-Carlo simulation• Embarrassingly parallel loops, except for the conflictsg y p p , p

Page 20: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Test of STM applicability• There is no one-to-one correspondence between particles and cells• Question: How to simulate particles without the physics?• Answer: Use unstructured mesh connectivity to “guide” particles• Add probabilistic flavor by:• 1. randomly selecting in which cell the particle is borny g p• 2. randomly selecting which cell-neighbor is next on the particle’s path

Conflict simulation: Multiple particles hit same cell

Page 21: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Relevant code sectionRelevant code section#pragma tm atomic default(trans){

Cell counter is incremented within TM section of routine mark_cell

cell_counter[cell_no] ++; }

The parallel (threaded) loop is over the particles#pragma omp parallel for for (i=0; i < max_particles; i++){

t ll d()

The parallel (threaded) loop is over the particles

next_cell = rand();while(inside){

mark_cell(next_cell); (particle increments cell)_ ( _ ); (p )next_face = rand();next_cell = neighbor(next_face);if(next cell < 0)inside = 0;if(next_cell < 0)inside = 0;

}} Error checking: total cell hits = total path lengths

Page 22: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Prism mesh results2.5x 10

6

rror

s

Conflicts and errors

without STM2.5x 10

4

s

Conflicts detected

1 thread

Prism mesh results

1.5

2

2.5

onfli

cts/

erro without STM

with STM

1.5

2

2.5

of c

onfli

cts 1 thread

2 threads4 threads8 threads16 threads

0.5

1

mbe

r of

con

0.5

1

Num

ber

of 16 threads

Conflicts over 100 runsTotals on 1-16 th d

0 5 10 15 200

Number of threadsN

umb

0 50 100 1500

Run number

N

threads

- Input: number of particles = 12000 (10% of total number of cells) - Conflict occurrence (detected by STM) is between 0% to 0.0099%- Number of STM-retries is quite a bit higher than number of “bad results” (non-STM)q g ( )

STM fixed the conflicts – much fewer “wrong answers” than “retries”

Page 23: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Hex mesh results2x 10

7

rror

s

Conflicts and errors 2x 10

4

ts

Conflicts detected 12

Hex mesh results

Totals on 1 16

1

1.5

conf

licts

/err

o

without STM1

1.5

of c

onfli

cts 2

4816

Totals on 1-16 threads

0.5

1

umbe

r of

co without STM

with STM

0.5

1

Num

ber

of

Conflicts over 1000 runs

0 5 10 15 200

Number of threadsN

um

0 500 10000

Run number

N

- Input: number of particles = 15000 (5 times the total number of cells) - Conflict occurrence (detected by STM) is between 0% to 0.26%

N b f STM t i i it bit hi h th b f “b d lt ” ( STM)

STM fixed the conflicts – much fewer “wrong answers” than “retries”

- Number of STM-retries is quite a bit higher than number of “bad results” (non-STM)

Page 24: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Early results on timingsEarly results on timings

• We know that STM has large overheads, but…We know that STM has large overheads, but…… should we time it anyways?

• Make sure statistics/diagnostic option is turned offMake sure statistics/diagnostic option is turned off• It would give us an idea about parallel behavior• It would create a minimum expectation for HTM• It would create a minimum expectation for HTM• Compare to:

1 non thread safe code (unfair but informative)1. non-thread-safe code (unfair, but informative)2. thread-safe directives, such as “critical” (fair)

Results on timings are tentative and approximate

Page 25: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Run-times: deterministic casePrism mesh

800Wall clock run time in 3 modes

6000Cpu time in 3 modes

600

800

ime

(s)

without STM/critical4000

6000

ime

(s)

200

400

Run

tim

e

without STM/criticalwith criticalwith STM 2000

Run

tim

e

without STM/criticalwith criticalwith STM

0 5 10 15 200

Number of threads

0 10 200

Number of threads

• Problem scales well (constant user time in “unsafe” mode)• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Page 26: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Run-times: deterministic caseHexahedral mesh

25Wall clock run time in 3 modes

200Cpu time in 3 modes

15

20

25

ime

(s)

without STM/critical 100

150

200

ime

(s)

5

10

Run

tim without STM/critical

with criticalwith STM 50

100

Run

tim

without STM/criticalwith criticalwith STM

0 5 10 15 200

Number of threads

0 5 10 15 200

Number of threads

• Problem scales reasonably up to 4 threads• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Page 27: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Run-times: probabilistic casePrism mesh

14000Wall clock run time in 3 modes

10x 104 Cpu time in 3 modes

10000

12000

14000

ime

(s)

10

ime

(s)

without STM/criticawith critical

4000

6000

8000

Run

tim

e

without STM/criticalwith criticalwith STM

5

Run

tim

e

with criticalwith STM

0 10 202000

4000

Number of threads

0 5 10 15 200

Number of threads

• Problem does not scale (uneven particle path lengths)• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Page 28: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Run-times: probabilistic caseHexahedral mesh

4000Wall clock run time in 3 modes

4x 104 Cpu time in 3 modes

3000

4000

ime

(s)

without STM/critical

3

4

ime

(s)

1000

2000

Run

tim

e without STM/criticalwith criticalwith STM

1

2

Run

tim

e

without STM/criticalwith criticalwith STM

0 10 200

Number of threads

0 5 10 15 200

Number of threads

• Problem does not scale (uneven particle path lengths)• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Page 29: stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility Mode] Author: bihari1 Created Date: 6/1/2009 3:13:40 PM

Summary and Current/Future Work• Transactional Memory promises to make thread-safe

programming easier while keeping efficiencyprogramming easier while keeping efficiency• IBM has STM available now, ready to be tried out• STMXLC has a useful statistics/diagnostics toolSTMXLC has a useful statistics/diagnostics tool• Demonstrated two algorithms which may benefit• Pay-off is highly algorithm- and problem-dependent• Pay-off is highly algorithm- and problem-dependent• In compute-heavy codes STM over-predicts conflicts• It is already competitive with OpenMP “critical”• It is already competitive with OpenMP critical• Need more accurate timing routines (CLOMP code)

Will try fine tuning of STM sections/routines• Will try fine-tuning of STM sections/routines• Need to demonstrate on real simulations


Recommended