Atomic operations considered hARMful - Parallel...

M. Aldinucci, M. Drocco, M. Torquati University of Torino & Pisa - Parallel Computing group

Atomic operations considered hARMful ARM Research Summary, 15-16 Sept 2016 Cambridge, UK

Moshe Vardi (Sep 2016). “Academic Rankings Considered Harmful!” Communication of ACM 59 (9)A Sotirov et al (Dec 2008). "MD5 considered harmful today - Creating a rogue CA certificate”. WEB

A Mishra et al (Jun 2006). "Partially Overlapped Channels Not Considered Harmful". Sigmetrics. 34 (63) Sergei Gorlatch (Jan 2004). “Send-receive considered harmful”. TOPLAS 22(1) J Yoon; M Liu; B Noble (April 2003). "Random Waypoint Considered Harmful". Infocom

JHagino (Oct 2003). "IPv4-Mapped Addresses on the Wire Considered Harmful”. WEB Eric A. Meyer (Dec 2002). "Considered Harmful" Essays Considered Harmful”.WEB

Ian Hickson (Sep 2002). "Sending XHTML as text/html Considered Harmful”.WEB J Amsterdam (Feb 2002). "Java's new Considered Harmful". Software Development Magazine

C. Ponder; B. Bush (1992). "Polymorphism considered harmful". ACM SIGPLAN Notices. 27 (6)

Peter Miller (1998). "Recursive Make Considered Harmful". AUUGN. 19 (1) Tom Christiansen (Oct 1996). "Csh Programming Considered Harmful". UNIX FAQs.org

Hayes, Patrick J., Ford, Ken (1995). “Turing Test Considered Harmful”. IJCAI CA Ken t; JC Mogul (Jan 1995). "Fragmentation Considered Harmful". ACM SIGCOMM CCR 25

John McCarthy (Dec 1989). "Networks Considered Harmful for Electronic Mail". CACM 32 (12) Rob Pike and Brian Kernighan (1983). "UNIX Style, or cat -v Considered Harmful". USENIX

Bruce A. Martin (Nov 1976). "Letter O Considered Harmful". ANSI Fortran Standards Committee William Wulf and Mary Shaw (Feb 1973). "Global Variable Considered Harmful". ACM SIGPLAN Notices. 8 (2)

Edsger Dijkstra (Mar 1968). "Go To Statement Considered Harmful". Communication of ACM 11 (3)

http://faqs.org

https://en.wikipedia.org/wiki/Patrick_J._Hayes

Plot• Atomic are (considered) crucial operations in the shared memory

• Typically used as global synchronisations devices for equivalent threads, e.g. workpool • Fast lock-free algorithms and fast atomics exist, but they might be anyway centralisation points

• Producer-Consumer problems do not strictly require atomics • Single-Writer-Single-Reader structures + mediator threads

• Synchronisation in a message-passing style, data globally allocated in a shared-memory fashion • Networks can be composed. Fully distributed approach. Not a dogmatic, sometime atomics are useful • Performance good. Power efficiency good (not in this talk). Works on PGAS

• FastFlow pattern library. Developed in 4 EU project last 5 years (~10M € tot cost) • LGPL. Over 50K downloads. Works everywhere because is plain C++11 header-only library

3

Atomics• Shared-memory: data transfer + synchronisation

• Atomic Operations are the primary mean for synchronisation

• Atomicity is (quite) orthogonal with memory consistency • Both for synchronisation: atomics → concurrent access, fences → true data dependencies

4

• x86: atomic+fence a single operation

• ARMv7: atomic via LL/SC lock/unlock might require mem fence

p

P1

C1

push(p)

pop

1p

p = malloc (…);*p=1;push(p);

q = push( )if(*p==1) …

?pMemory Barrier

x

P1 P2

C1 C2

Atomic

Atomic

push(x) push(y)

poppop

P1

C1

Synch path

Data path

Shared data

❶❷

❸

Atomics are for mutex and lock-free algorithms

• Atomic → mutex → Blocking for coarse grain tasks • Atomic → lock-free → Nonblocking for fine grain tasks • Atomic used in Multiple-Writer-Multiple-Reader lock-free algorithms

• Fast lock- or wait-free structures/FIFO exists - the race is at PPoPP • C. Yang, J. Mellor-Crummer. A wait-free queue as fast as fetch-and-add, PPoPP 2016 • A. Morrison and Y. Afek. Fast Concurrent Queues for x86 Processors. PPoPP 2013 • A. Kogan and E. Petrank. Wait-free Queues with Multiple Enqueuers and Dequeuers. PPoPP 2011

• Cost of atomics depend on (Operation FFA/CAS/Swap, Processor, Cache-coherency …) • H. Schweizer et al. Evaluating the Cost of Atomic Operations on Modern Architectures. PACT 2015

5

Atomics: some considerations • Support global synchronisations. Have a cost, worth to pay if

you need them • Not really needed for master-worker, data parallelism, embarrassingly parallel,

pipeline, and other unexpected cases …

• Support a on-demand approach. Scheduling policy is embedded in the protocol, hardly programmable • Shuffle (e.g. for MapReduce): data movement path is fixed and known at run-time • Affinity problems: parallel memory allocation, user-defined/ data-dependent

scheduling (Ex BT)

• Specialised cores, e.g. BIG-little

• Memory subsystem and cache coherency can be bottlenecks

6

f

User-definedscheduling function

??

Alternative: Single-Writer-Single-Reader FIFO

• Does not require atomic • No fence under TSO, WriteFence under WO

• J. Giacomoni et al. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. PPoPP 08

• M. Aldinucci et al. An Efficient Unbounded Lock-Free Queue for Multi-core Systems. Euro-Par 2012

• Enough to support Producer-Consumer • Inherently asynchronous • Powerful enough to build a general purpose parallel

programming model

7

0

5

10

15

20

25

30

35

40

45

64 1024 8192

na

no

se

co

nd

s

buffer size

same core, different contextdifferent cores, same CPU

different CPUs

FastFlow unbound queue core-to-core message latency

Xeon E7-4820 @2.0GHz Sandy Bridge

SWSR queues + mediators

node

mi-node

channelname

or channel

channelnames

. . .

mo-node

channelnames

channelname

or channel. . .

channel nameor channel


GPU

GPUnode



distributednode

networksymmetric

or asymmetric(scatter, gather, etc)

mo mi

n

n

n

process

FF bound shmem FIFO channelSingle-Producer-Single-Consumer

lock-free fence-free queue

FF unbound shmem FIFO channelSingle-Producer-Single-Consumer

lock-free fence-free queue

FF (lock-free) distributed memory channelRDMA or TCP

Generate the networkTrue data dependencies moves across arrows

9

S1

S1

S1

...

S0 S2 S3

E

W1

W2

Wn

...

E

W1

W2

Wn

C...

E C

...

GPU-1

GPU-n

E

W1

W2

Wn

C E C

...

GPU-1

GPU-k

...

E

W1

W2

Wn

...

S1

E

11

21

n1

...

12

22

n2

C

S1 S2 Sn...

S1 S2 Sn... S1 Sn...

GPU

S2

Composing via mediator guarantee correctness (data races & deadlock freedom)

...

......

Possible data race

Programming model: synchronisations happen by way of P2P data dependencies (thus no atomics are needed)

10

standard pointersi.e.

synchronisation capabilities

shared memory

L3 L3

write readno atomic read-update

operations

Shared-memory cache-coherent or non-coherent multicore

Synchronisations are in a message-passing style, but designers are not forced to think in a distributed way

No copies are needed, the memory fences are but asynchrony helps

Programming model: synchronisations happen by way of P2P data dependencies (thus no atomics are needed)

10

PGAS/opaque pointersi.e.


PGASL3 L3


operations

One-sidedeither eager “put” or lazy “get”

standard pointersi.e.


shared memory

L3 L3


operations

Shared-memory cache-coherent or non-coherent multicore Distributed PGAS

Balance• Can be used to approximate any MWMR structure

• Coherency enforced by messages, could be scaled to PGAS

• User-defined code can be embedded in mediators (as callback) • User-defined & data-dependent scheduling, stateful data management state e.g. micro-batch, streaming

windows, …

• Complex global behaviour programmed as distributed algorithms • Low-level: Termination, Blocking-to-NonBlocking transition, Dynamic memory allocation

• High-level: pipeline, farm, map, reduce, MapReduce

• Compositional • Thread graphs can be merged with a congruent behaviour. No need to re-prove correctness • Specialised networks can be generated from the structure of the problem (e.g. Alloc, MapReduce)

11

Test: Torus (Intel E7)

• Intel Xeon CPU E7- 4820 @ 2.00GHz

• Node 0 send k messages, when it gets them back, for each of them produce w messages

• a torus of threads connected by way of unbound list-based queues • FastFlow (ff) • Michael & Scott (ms) • Yang & Mellor-Crummey (wf) - crash

• I have results on ARMv7 - not shown • Odroid big.LITTLE, NVidia Tegra • Some problems with high precision time measurement

12

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

com

mun

icat

ion

capa

city

(msg

/ m

s)

number of threads

Torus

ms, grain = 0ms, grain = 1 us

unbound ff, grain = 0unbound ff, grain = 1 us

0

Better

Test: master-worker (Intel E7)• Intel Xeon CPU E7- 4820 @ 2.00GHz

• Task sizeof(long) messages, task execution time 1 us, 10 us

• All queues list-based unbound • ❶ FastFlow (ff)

• ❷ Yang & Mellor-Crummey PPoPP 2016 (wf)

• ❸ Michael & Scott (ms)

13

FIFO

loc

kfre

e

W1

W2

Wn

...

❶❷ ❸

❶

❶

❷

❷❸❸

E

W1

W2

Wn

... Better 0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

spee

dup

number of workers

Master-Worker

idealms, ex. time = 1 us

ms, ex. time = 10 uswf, ex. time = 1 us

wf, ex. time = 10 usff, task ex. time = 1 us

ff, task ex. time = 10 us

Test: master-worker (Intel E7)• Intel Xeon CPU E7- 4820 @ 2.00GHz

• Task sizeof(long) messages, task execution time 1 us, 10 us

• All queues list-based unbound • ❶ FastFlow (ff)

• ❷ Yang & Mellor-Crummey PPoPP 2016 (wf)

• ❸ Michael & Scott (ms)

13

FIFO

loc

kfre

e

W1

W2

Wn

...

❶❷ ❸

❶

❶

❷

❷❸❸

E

W1

W2

Wn

... Better 0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

spee

dup

number of workers

Master-Worker

idealms, ex. time = 1 us

ms, ex. time = 10 uswf, ex. time = 1 us

wf, ex. time = 10 usff, task ex. time = 1 us

ff, task ex. time = 10 us

Methodology!

Test: master-worker (odroid big.LITTLE)

14

Better 0

50 100 150 200 250 300 350 400 450 500

1 2 3 4 5 6 7 8

exec

. tim

e (m

s)

number of workers

Master-Worker

ms, grain = 0.1 usms, grain = 1 us

wf, grain = 0.1 uswf, grain = 1 us

unbound ff, grain = 0.1 usunbound ff, grain = 1 us

Ex. NGS applications: Bowtie (BT) and BWA- EU FP7 Paraphrase project -

• Top tools for parallel DNA alignment • Hand-tuned C/C++/SSE2 code • Fast spin-locks + Pthreads • Sequences (reads) are streamed

from MMIO files to workers • Memory bound

• FastFlow master-worker • Memory affinity, pinning, affinity

scheduling (embedded in the pattern)

• BT: up to 200% speedupBWA: up to 10% speedup over originals

Shared-Memory input

(Reads, Genome)

Shared-Memory output(Alignment)

W1 Wn

mutex

(mutex)

Memory task

(Read)

Memory task

(Read)

workerthreads

Required in some tools(Bowtie2=yes, BWA=no)

Memory input/output

(Reads, Alignment)

W1 Wn

E…

Shared-Memory interleaved and read-only

(Alignment)

Memory task

(Read)

Memory task

(Read)…

core-pinnedworker threads

lock-freesynchronisations

affinity schedulingimplemented via

scheduler thread (active)or lock-free scheduler

object (passive)

FastFlowmaster-worker

C. Misale, G. Ferrero, M. Torquati, M. Aldinucci. Sequence alignment tools: one parallel pattern to rule them all? BioMed Research International, 2014.

BT2.0-FF

BT2.

0BT

2.2

15

Ex. C++14 parallel patterns (copy or move semantics)- EU H2020 Rephrase project -

int n=300; // parallel_execution_tbb p{}, f{}; parallel_execution_ff p{}, f{}; // sequential_execution p{}, f{}; // parallel_execution_omp p{}, f{};

Pipeline(p, // Pipeline stage S0 [&]() { std::vector<int> v(100);

// C++11 business code

return optional<std::vector<int>>(v); },

// Pipeline stage S1 Farm(f, [&](std::vector<int> v) { std::vector<int> acumm( v.size() );

// C++11 business code

return acumm; }),

// Pipeline stage S2 [&]( std::vector<int> acc ) { double acumm = 0; for ( int i = 0; i < acc.size(); i++ ) acumm += acc[ i ]; return acumm; },

// Pipeline stage S3 [&]( double v ) { // C++11 business code } );

16

S1

S1

S1

...

S0 S2 S3

S0

S1

S2

S3

Ex: C++14 spark-like DSL for data-analyticsDistributed+multicore+GPU

typedef KeyValue<std::string, int> KV;

class Tokenizer{public:Tokenizer(){};std::string operator()(std::string in) {std::istringstream f(in);std::string s;while (std::getline(f, s, ' ')) {

//std::cout << "send out: " << s << std::endl;send_out(s); // defined in global.hpp

}std::cout << "Line tokenized\n";return s;

}};

class countwords : public Pipe {public:countwords(){Tokenizer t; add(new FlatMap<std::string, std::string>(t)).add(new Map<std::string, KV>

([&](std::string in){std::cout<< "Preparing pair <" << in << ", 1>\n"; return KV(in,1);}))

.add(new Reduce<KV>([&](KV v1, KV v2){return v1+v2;})); //ok, global output

list name: kv_res}

};

int main(int argc, char** argv) {if(argc < 2) {std::cerr << "Usage: ./main <input file>\n";return -1;

}std::string filename = argv[1];

Pipe p2(new ReadFromFile<std::string>(filename, [](std::string s){return s;})); //ok, global output

list name: file_lines

p2.to(countwords()).add(new WriteToDisk<KV>("kv_pairs.txt", [&](KV in){return in;}));

p2.run();p2.to_dotfile(“graphviz/wordcount.dot");

}

17

Ex. Parallel memory allocation

new(C)

delete(B)new(A)

delete(A)delete(C)

new(B)new(C)

18


new(C)

delete(B)new(A)

delete(A)delete(C)

new(B)new(C)

OSallocator

18

You can make it parallel with mutex or lock-free with atomics, anyway you are

synchronising too much


• A network that connects data allocations-deallocations paths

• Faster than posix, often faster than hoard and TBB • unpublished, code available on sourceforge

• Implements deferred deallocation to avoid ABA problem

new(C)

delete(B)new(A)

delete(A)delete(C)

new(B)new(C)

FF allocator

18

You can make it parallel with mutex or lock-free with atomics, anyway you are

synchronising too much

FFalloc

FFalloc

FFalloc

FFallocP

Cn

C2

C1

Producer P:for(i=0;i<10M;i++){ pi = malloc(rnd(size)); *pi=...; dispatch_RR pi;}

...

Consumer Ci:while (pi=get()) do_work(1μs,pi); free(pi);}

uSPSC queue

P

Cn

C2

C1

...

FFalloc

FFalloc

FFalloc

FFalloc

...uSPSC queue

+= malloc

free

free

free

0

2

4

6

8

10

12

14

1 4 8 12 16 20 24 28 32

Tim

e (s

)

N. of (dealloc) threads

10M alloc/dealloc (32B) - 1µs tasks - 32-core Intel E7 2Ghz

Hoard-3.9libc-6TBB-4.0FastFlowIdeal

Better

FastFlowhttp://mc-fastflow.sourceforge.net/

• Toreador (EC-RIA, H2020, ICT-16-2015 big data): TrustwOrthy model-awaRE Analytics Data platfORm (2016, 36 months, total cost 6.5M €)

• Rephrase (EC-RIA, H2020, ICT-2014-1): Refactoring Parallel Heterogeneous Resource-Aware Applications – a Software Engineering Approach (2015, 36 months, total cost 3.5M €)

• HyVar (EC-RIA, H2020, ICT-2014-1): Scalable Hybrid Variability for Distributed Evolving Software Systems (2015, 36 months, total cost 2.8M €)

• REPARA (EC-STREP, 7th FP): Reengineering and Enabling Performance And poweR of Applications (2013, 36 months, total cost 3.5M €)

• ParaPhrase (EC-STREP, 7th FP): Parallel Patterns for Adaptive Heterogeneous Multicore Systems (2011, 42 months, total cost 4.2M €)

• IBM Research 3 faculty awards 2015 (50K $) • Noesis Solutions: Machine learning for engineering 2015 (75K €) • A3CUBE Inc.: FastFlow/PGAS with in memory fabric 2014 • NVidia Corp: CUDA Research Center at University of Torino 2013

Date post:	22-Apr-2018
Category:	Documents
Upload:	vudang
View:	217 times
Download:	5 times

Atomic operations considered hARMful - Parallel...

Documents