stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility...

Experiments Using IBM’sExperiments Using IBM s Software Transactional Memory

CompilerBarna L BihariBarna L. Bihari

ICON Consulting, Inc. Representing IBM at LLNLRepresenting IBM at LLNL

Livermore, California

John Gyllenhaal Tom Spelce Scott FutralJohn Gyllenhaal, Tom Spelce, Scott FutralLawrence Livermore National Laboratory

ScicomP 15 Meeting, Barcelona, Spain, May 18-22, 2009

AcknowledgementsAcknowledgements• Dong Ahn• Bronis de SupinskiBronis de Supinski• Peng Wu (IBM)

Transactional memory: what is it?Transactional memory: what is it?Motivation: Increased number of cores per node for the foreseeable

future shared memory parallel programming or threadingfuture – shared memory parallel programming or threadingProblem: Multiple threads in a shared-memory environment can lead to

race conditions, hence memory conflicts and incorrect executionOld S l ti T diti ll h d t t t h i tiOld Solution: Traditionally users had to resort to synchronization

techniques such as barriers or mutex type locks • Locks are expensive and can result in dead-locksNew Solution: Transactions allow operations to go through, but check

for memory violations and unroll it if necessary• Transactional memory (TM) ensures atomicity and isolation• An abstract construct that hides its details from the programmers• It is intended to make parallel threading easier and more efficient• Suggested use: when conflict probability is relatively lowSuggested use: when conflict probability is relatively low

Transactional memory enables efficient high level threading

The IBM solution• IBM has a freely available software implementation of transactional memory• Currently available for AIX: http://www.alphaworks.ibm.com/tech/xlcstm• It uses OpenMP constructsIt uses OpenMP constructs• IBM STM features:- runtime library has standard and debugging versions

k i l ti f t i i ( d t “ t i l ti ”)- ensures weak isolation of atomic regions (as opposed to “strong isolation”)- TM languages extensions: #pragma tm atomic default(trans)- TM function attributes - intrinsic functions: tm_trans_read/write, tm_notrans_read/write- runtime statistics dumped via stm_print_stats- compiling and linking via stmxlc and stmxlc++ commandscompiling and linking via stmxlc and stmxlc++ commands - all other XL options valid for C/C++ Enterprise Edition for AIX, v9.0- need to link via –lstm (standard) or -lstmstat (debug)

il i t t ti t il bl i th t li k ti- compiler instrumentation report available via the –qreport link option

IBM STM is a freely available state-of-the-art compiler

Current stage of experimentationCurrent stage of experimentation

• Installed and tested the IBM STM on LLNL’s AIX systemsy• Current focus: identify algorithms that would benefit from STM• Present: actual runtime and parallel speed-up is secondary • Current goal: learn about potential fertile grounds for TM using

STM, hoping for eventual HTM (hardware transactional memory)Future goal: actual runtime is ultimate bottom line via HTM• Future goal: actual runtime is ultimate bottom line - via HTM

• Expect much of current work to carry over – OpenMP standard?• Algorithms with low but nonzero chance of conflicts are primeAlgorithms with low but nonzero chance of conflicts are prime

candidates• To date: identified two such algorithms (frequent in simulations)

TM is not for every code: round up the (usual?) suspects

Designing the Toy Code/Benchmark• Spawn multiple threads• Have at least one “parallel for” loopHave at least one parallel for loop• Should be relatively load balanced • Should mimic the loop-structure of an actual p

simulation code or code segment• Should generate race conditions between threads (i.e.

conflicts)conflicts)• Some number crunching in between memory updates• Error checking should be simpleg p• Upon request, be able to get the “right” answer• Should sometimes be able to get the “wrong” answer

Goal: simulate thread conflicts in a multiphysics code without the actual code

The simplest toy code one could think ofp y

• The code has one key variable for each threadC fli t thi i bl i d• Conflicts on this variable occur in a random way

• Threads increment their own or other’s counter• Outer loop increments conflict probability at each• Outer loop increments conflict probability at each

pass (0 to (n_threads-1)/n_threads)• The update is atomic• Checksum over all threads is independent of:

- conflict probabilityn mber of threads- number of threads

• We always get the correct checksum with STM• No number-crunching yetNo number crunching yet We can confirm that STM corrects unresolved conflicts

The SyntaxThe Syntax

#pragma tm atomic default(trans)Specifying a transactional memory section

#pragma tm atomic default(trans){

count_array[threadid]++;}}

Specifying the parallel (threaded) region

#pragma omp parallel for for (i=0; i < iter_count; i++){

Main body of threaded computationMain body of threaded computation}

Calling the STM diagnostic routineCalling the STM diagnostic routine

stm_print_stats();

8x 106 Conflicts Resolved

6x 106 Conflicts Resolved

Actual conflicts vs. conflict probabilities

4

6

8x 10

of c

onfli

cts

Actual retriesPotential conflictsIntended conflicts

4

6

of c

onfli

cts

Actual retries

2 Threads

2

4

Num

ber

of

2

Num

ber

of Actual retries

Potential conflictsIntended conflicts

4 Threads

Conflicts ResolvedConflicts Resolved

0 0.2 0.4 0.6 0.80

Conflict Probability

N

0 0.2 0.4 0.6 0.8

0


N

10x 106

onfli

cts

Conflicts Resolved


10x 106

onfli

cts

Conflicts Resolved


5

umbe

r of

con

Intended conflicts

5

umbe

r of

con

Intended conflicts

16 Threads

8 Th d

0 0.5 10


Num

0 0.5 10


Num

8 Threads

Benchmark code designBenchmark code design• Goal: quick experimentation without production code changes• Be able to use realistic unstructured mesh connectivity• Be able to use realistic unstructured mesh connectivity• Have both deterministic and random run modes• Chance of conflict low or very low, but cannot be ruled outChance of conflict low or very low, but cannot be ruled out• Easy parallelization: embarrassingly parallel loops?• Current name: BUSTM (Benchmark for Unstructured-mesh

Software Transactional Memory)• Have at least some resemblance to some real algorithms

T t t f• Two targets so far:- deterministic CFD- Monte Carlo transport- Monte Carlo transport

Benchmark codes allow for fast feasibility studies

Unstructured-mesh infrastructure• Code can handle cells with arbitrary number of faces• Three (hierarchical) basic element types are used for implementation:1. nodes or grid points (physical xyz-points in 3-D space)2. faces (boundaries between cells) - contain nodes3. cells - contain faces and nodes

• Practical cell types:- Tetrahedron (4 faces, 4 nodes)

Hexahedron (6 faces 8 nodes)- Hexahedron (6 faces, 8 nodes)- Triangular prism (5 faces, 6 nodes)- Pyramid (5 faces, 5 nodes)

• Connections between the three basic elements:- each face knows which cells are on either side (2 per face)- each face knows which nodes are part of it (3 or 4 per face)p ( p )- each cell knows which nodes are part of it (4 to 8 per cell)- each cell knows which cells are part of it (4 to 6 per cell)

1. “Deterministic” Conflicts

E C i fi i l hE.g. Conservative finite volume schemes on unstructured meshes-Compute-intensive loops are face-based-Each face has 2 cells on either sideEach face has 2 cells on either side-Cells are updated based on face-based loops-Face (flux) computations are compute-heavy-Probability of conflicts is very low, but nonzeroT diti l th d f t b i-Traditional thread-safety can be expensive

-Potential debugging nightmare -Transactional memory can have a huge payoff

Test of STM applicability• The set of connections between the different element types forms a graph• Two of the three basic elements form the nodes and edges of the graph• Indirect indexing is pervasive throughout such unstructured-mesh codes• Example of a triangular prism mesh and its corresponding graph might be:

• Heart of any conservative CFD code: flux computation – done face-by-face• Question: How to simulate flux computation without a full CFD solver?• Question: How to simulate flux computation without a full CFD-solver?

Answer: Compute the numerical gradient instead

Unstructured-mesh algorithmUnstructured mesh algorithm• Goal: compute the gradient of a function on

t t d han unstructured mesh • Use well-known formula ∫

∂

=∇V

fnV

f dS||

1 r

1

∂VV ||

• Approximated at cell j by:h i th l l f f ll j

ii

ij

j fC

f ∑≈∇ n||

1

- where i runs over the local faces of cell j- n is the cell face normal (precomputed)

Similar loops compute the numerical divergence essential in conservative CFD codes on unstructured meshes

Relevant code section

#pragma tm atomic default(trans){

Gradients computed by accumulation within TM section of compute_cell

{gradient[cell_no_1] += incr; gradient[cell_no_2] -= incr;

}

#pragma omp parallel forThe general parallel (threaded) region is face-based

#pragma omp parallel for for (i=0; i < max_face; i++){

left_neighbor = left_cells[i];i ht i hb i ht ll [i]right_neighbor = right_cells[i];

compute_cell(incr, left_neighbor); (face increments cell)compute_cell(incr, right_neighbor); (face increments cell)p _ ( g _ g ) ( )

}Error checking: assume f = constant, then grad f = 0

Actual cases: 2 meshes used• Used two 3-D meshes for 2-D testing:

1 P i h 2 D l f 3 D ll ( di )• 1. Prism mesh – 2-D layer of 3-D cells (medium):119893 cells, 420060 faces (240655 BC +179405 interior), 123132 nodes

• 2. Hex mesh – 2-D layer of 3-D cells (small):3000 cells, 12110 faces (6220 BC + 5890 interior), 6324 nodes

Test cases are borrowed from CFD research simulations

Prism mesh results2000

rror

s

Conflicts and errors 10

s

Conflicts detected 1 thread

Prism mesh results

1000

1500

2000

onfli

cts/

erro

without STMwith STM

5

10

of c

onfli

cts 2 threads

4 threads8 threads16 threads

Conflicts over 1000 runs

Totals on

500

1000

mbe

r of

con5

Num

ber

of Totals on

1-16 threads

0 10 200

Number of threadsN

umb

0 500 10000

Run number

N

- Conflict occurrence (detected by STM) is between 0% and 0.00042%- No conflicts on 1 or 2 threads

N b f STM t i i h l th th t f “b d lt ” ( STM)

STM fixed the conflicts – much fewer “retries” than “wrong answers”

- Number of STM-retries is much less than that of “bad results” (non-STM)

Hex mesh results20

s

Conflicts detected

12

Hex mesh results

10

15

20

of c

onfli

cts

24816


5

10

Num

ber

of

Totals on 1-16

0 500 10000

Run number

N

threads

- Conflict occurrence (detected by STM) is between 0% and 0.1%- No conflicts on 1 or 2 threads

N b f STM t i i bl t b f “b d lt ” ( STM)

STM fixed the conflicts – “retries” comparable to “wrong answers”

- Number of STM-retries is comparable to number of “bad results” (non-STM)

2. Probabilistic Conflicts

• Imagine a (large) number of randomly releasedImagine a (large) number of randomly released particles travelling through a mesh composed of cells

• Each particle operates on many mesh cells as it hits p p ythem

• Parallelized (i.e. threaded) loop is over particles( ) p p• Conflicts can occur as more than one particle hits the

same cell• This is a simplistic view of a Monte-Carlo simulation• Embarrassingly parallel loops, except for the conflictsg y p p , p

Test of STM applicability• There is no one-to-one correspondence between particles and cells• Question: How to simulate particles without the physics?• Answer: Use unstructured mesh connectivity to “guide” particles• Add probabilistic flavor by:• 1. randomly selecting in which cell the particle is borny g p• 2. randomly selecting which cell-neighbor is next on the particle’s path

Conflict simulation: Multiple particles hit same cell

Relevant code sectionRelevant code section#pragma tm atomic default(trans){

Cell counter is incremented within TM section of routine mark_cell

cell_counter[cell_no] ++; }

The parallel (threaded) loop is over the particles#pragma omp parallel for for (i=0; i < max_particles; i++){

t ll d()

The parallel (threaded) loop is over the particles

next_cell = rand();while(inside){

mark_cell(next_cell); (particle increments cell)_ ( _ ); (p )next_face = rand();next_cell = neighbor(next_face);if(next cell < 0)inside = 0;if(next_cell < 0)inside = 0;

}} Error checking: total cell hits = total path lengths

Prism mesh results2.5x 10

6

rror

s

Conflicts and errors

without STM2.5x 10

4

s

Conflicts detected

1 thread

Prism mesh results

1.5

2

2.5

onfli

cts/

erro without STM

with STM

1.5

2

2.5

of c

onfli

cts 1 thread

2 threads4 threads8 threads16 threads

0.5

1

mbe

r of

con

0.5

1

Num

ber

of 16 threads

Conflicts over 100 runsTotals on 1-16 th d

0 5 10 15 200

Number of threadsN

umb

0 50 100 1500

Run number

N

threads

- Input: number of particles = 12000 (10% of total number of cells) - Conflict occurrence (detected by STM) is between 0% to 0.0099%- Number of STM-retries is quite a bit higher than number of “bad results” (non-STM)q g ( )

STM fixed the conflicts – much fewer “wrong answers” than “retries”

Hex mesh results2x 10

7

rror

s

Conflicts and errors 2x 10

4

ts

Conflicts detected 12

Hex mesh results

Totals on 1 16

1

1.5

conf

licts

/err

o

without STM1

1.5

of c

onfli

cts 2

4816

Totals on 1-16 threads

0.5

1

umbe

r of

co without STM

with STM

0.5

1

Num

ber

of


0 5 10 15 200

Number of threadsN

um

0 500 10000

Run number

N

- Input: number of particles = 15000 (5 times the total number of cells) - Conflict occurrence (detected by STM) is between 0% to 0.26%

N b f STM t i i it bit hi h th b f “b d lt ” ( STM)

STM fixed the conflicts – much fewer “wrong answers” than “retries”

- Number of STM-retries is quite a bit higher than number of “bad results” (non-STM)

Early results on timingsEarly results on timings

• We know that STM has large overheads, but…We know that STM has large overheads, but…… should we time it anyways?

• Make sure statistics/diagnostic option is turned offMake sure statistics/diagnostic option is turned off• It would give us an idea about parallel behavior• It would create a minimum expectation for HTM• It would create a minimum expectation for HTM• Compare to:

1 non thread safe code (unfair but informative)1. non-thread-safe code (unfair, but informative)2. thread-safe directives, such as “critical” (fair)

Results on timings are tentative and approximate

Run-times: deterministic casePrism mesh

800Wall clock run time in 3 modes

6000Cpu time in 3 modes

600

800

ime

(s)

without STM/critical4000

6000

ime

(s)

200

400

Run

tim

e

without STM/criticalwith criticalwith STM 2000

Run

tim

e

without STM/criticalwith criticalwith STM

0 5 10 15 200

Number of threads

0 10 200

Number of threads

• Problem scales well (constant user time in “unsafe” mode)• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Run-times: deterministic caseHexahedral mesh


200Cpu time in 3 modes

15

20

25

ime

(s)

without STM/critical 100

150

200

ime

(s)

5

10

Run

tim without STM/critical

with criticalwith STM 50

100

Run

tim


0 5 10 15 200

Number of threads

0 5 10 15 200

Number of threads

• Problem scales reasonably up to 4 threads• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Run-times: probabilistic casePrism mesh


10x 104 Cpu time in 3 modes

10000

12000

14000

ime

(s)

10

ime

(s)

without STM/criticawith critical

4000

6000

8000

Run

tim

e


5

Run

tim

e

with criticalwith STM

0 10 202000

4000

Number of threads

0 5 10 15 200

Number of threads

• Problem does not scale (uneven particle path lengths)• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Run-times: probabilistic caseHexahedral mesh


4x 104 Cpu time in 3 modes

3000

4000

ime

(s)

without STM/critical

3

4

ime

(s)

1000

2000

Run

tim

e without STM/criticalwith criticalwith STM

1

2

Run

tim

e


0 10 200

Number of threads

0 5 10 15 200

Number of threads

• Problem does not scale (uneven particle path lengths)• STM is slightly faster than “critical” on multiple threads• Single thread overhead for STM is very high

Summary and Current/Future Work• Transactional Memory promises to make thread-safe

programming easier while keeping efficiencyprogramming easier while keeping efficiency• IBM has STM available now, ready to be tried out• STMXLC has a useful statistics/diagnostics toolSTMXLC has a useful statistics/diagnostics tool• Demonstrated two algorithms which may benefit• Pay-off is highly algorithm- and problem-dependent• Pay-off is highly algorithm- and problem-dependent• In compute-heavy codes STM over-predicts conflicts• It is already competitive with OpenMP “critical”• It is already competitive with OpenMP critical• Need more accurate timing routines (CLOMP code)

Will try fine tuning of STM sections/routines• Will try fine-tuning of STM sections/routines• Need to demonstrate on real simulations

Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

stm talk 10.ppt - spscicomp.org · Title: Microsoft PowerPoint - stm_talk_10.ppt [Compatibility...

Documents