Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos...

Post on 28-Dec-2015

216 views 1 download

transcript

Behavior of Synchronization Methods in Commonly Used Languages and Systems

Yiannis Nikolakopoulosioaniko@chalmers.se

Joint work with:D. Cederman, B. Chatterjee, N. Nguyen,

M. Papatriantafilou, P. Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

2

Developing a multithreaded application…

Yiannis Nikolakopoulosioaniko@chalmers.se

The boss wants .NET

The client wants speed…

(C++?)

Java is nice

Multicores everywhere

3Yiannis Nikolakopoulosioaniko@chalmers.se

The worker threads need to access data

Concurrent Data Structures

Then we need Synchronization.

Developing a multithreaded application…

4

Implementation

Coarse Grain Locking

Fine Grain Locking

Test And Set

Array Locks

And more!

Yiannis Nikolakopoulosioaniko@chalmers.se

Implementing Concurrent Data Structures

Performance Bottleneck

5

Implementation

Coarse Grain Locking

Fine Grain Locking

Test And Set

Array Locks

And more!

Lock Free

Yiannis Nikolakopoulosioaniko@chalmers.se

Implementing Concurrent Data Structures

Runtime System

Hardware platform

Which is the fastest/most

scalable?

6

Implementing concurrent data structures

Yiannis Nikolakopoulosioaniko@chalmers.se

7

Problem Statement

• How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures.

Yiannis Nikolakopoulosioaniko@chalmers.se

Yiannis Nikolakopoulos ioaniko@chalmers.se

8

Outline

Introduction

Experiment SetupHighlights of Study and ResultsConclusion

9

Which data structures to study?

Represent different levels of contention:• Queue - 1 or 2 contention points• Hash table - multiple contention points

Yiannis Nikolakopoulosioaniko@chalmers.se

10

How do we choose implementation?

Possible criteria:• Framework dependencies• Programmability• “Good” performance

Yiannis Nikolakopoulosioaniko@chalmers.se

11

Interpreting “good”

• Throughput:The more operations completed per time unit the better.

• Is this enough?

Yiannis Nikolakopoulosioaniko@chalmers.se

12

Non-fairness

13

• Throughput:Data structure operations completed per time unit.

What to measure?

Yiannis Nikolakopoulosioaniko@chalmers.se

Operations by thread i

Average operations per

thread

14

Implementation Parameters

Yiannis Nikolakopoulosioaniko@chalmers.se

Programming Environments C++ Java C# (.NET, Mono)

SynchronizationMethods

TAS, TTAS, Lock-free, Array lock

PMutex, Lock-free memory

management

Reentrant, synchronized

lock construct,Mutex

NUMAArchitectures

Intel Nehalem, 2 x 6 core(24 HW threads)

AMD Bulldozer, 4 x 12 core(48 HW threads)

Do they influence fairness?

15

Experiment Parameters

• Different levels of contention• Number of threads• Measured time intervals

Yiannis Nikolakopoulosioaniko@chalmers.se

Yiannis Nikolakopoulos ioaniko@chalmers.se

16

Outline

• Queue– Fairness– Intel vs AMD– Throughput vs Fairness

• Hash Table– Intel vs AMD– Scalability

IntroductionExperiment Setup

Highlights of Study and ResultsConclusion

Yiannis Nikolakopoulos ioaniko@chalmers.se

17

Fairness can change along different time intervals24 Threads, High contention

Observations: Queue

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess

Measurement interval (ms)

C# (.NET)

Intel - Lock-free AMD - Lock-free

Intel - TAS AMD - TAS

Yiannis Nikolakopoulos ioaniko@chalmers.se

18

Significantly different fairness behavior in different architectures24 Threads, High contention

Observations: Queue

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Measurement interval (ms)

Java

Intel - TAS Intel - TTAS

Intel - Synchronized Intel - Lock-free

Fai

rnes

s

Yiannis Nikolakopoulos ioaniko@chalmers.se

19

Significantly different fairness behavior in different architectures24 Threads, High contention

Lock-free is less affected in this case

Observations: Queue

Fai

rnes

s

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess

Measurement interval (ms)

Java

Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Synchronized AMD - SynchronizedIntel - Lock-free AMD - Lock-free

Yiannis Nikolakopoulos ioaniko@chalmers.se

20

Queue: Throughput vs Fairness

Fairness 0.6 s, Intel Throughput

0

0,2

0,4

0,6

0,8

1

2 4 6 8 12 24 48

Fairn

ess

Threads

C++

TTAS Lock-free PMutex

0

2

4

6

8

10

12

14

16

2 4 6 8 12 24 48

Ope

ratio

ns p

er m

s (t

hous

ands

)

Threads

C++

21

Observations: Hash table

• Operations are distributed in different buckets• Things get interesting when

#threads > #buckets• Tradeoff between throughput and fairness– Different winners and losers– Contention is lowered in the linked list

components

Yiannis Nikolakopoulosioaniko@chalmers.se

Yiannis Nikolakopoulos ioaniko@chalmers.se

22

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess

Measurement interval (ms)

C# (Mono)

Intel - TAS Intel - TTAS Intel - Lock-free

Fairness differences in Hash table across architectures24 Threads, High contention

Observations: Hash table

Yiannis Nikolakopoulos ioaniko@chalmers.se

23

Fairness differences in Hash table across architectures24 Threads, High contention

Lock-free is again not affected

Observations: Hash table

0

0,2

0,4

0,6

0,8

1

400 600 800 1000 2000 3000 4000 5000 10000

Fairn

ess

Measurement interval (ms)

C# (Mono)

Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Lock-free AMD - Lock-free

Yiannis Nikolakopoulos ioaniko@chalmers.se

24

Observations: Hash tableIn C++, custom memory management and lock-free implementations excel in

scalability and performance.

0

5

10

15

20

25

30

2 4 6 8 12 24 48

Su

cess

ful o

per

atio

ns

pe

r m

s (t

ho

usa

nd

s)

Threads

C++

TAS TTAS Lock-free

Array Lock PMutex Lock-free, MM

0

1

2

3

4

5

6

2 4 6 8 12 24 48

Threads

Java

TAS TTAS Lock-freeArray Lock Reentrant Reentrant FairSynchronized

Yiannis Nikolakopoulos ioaniko@chalmers.se

25

Conclusion

• Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots

• Scalability via more complex, inherently parallel designs and implementations

• Tradeoff between throughput and fairness– LF Hash table – Reentrant lock vs Array Lock vs LF Queue

• Fairness can be heavily influenced by HW– Interesting exceptions

Which is the fastest/most

scalable?

Is fairness influenced by

NUMA?