Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos...

transcript

Behavior of Synchronization Methods in Commonly Used Languages and Systems

Yiannis Nikolakopoulosioaniko@chalmers.se

Joint work with:D. Cederman, B. Chatterjee, N. Nguyen,

M. Papatriantafilou, P. Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

Developing a multithreaded application…

The boss wants .NET

The client wants speed…

(C++?)

Java is nice

Multicores everywhere

3Yiannis Nikolakopoulosioaniko@chalmers.se

The worker threads need to access data

Concurrent Data Structures

Then we need Synchronization.

Developing a multithreaded application…

Implementation

Coarse Grain Locking

Fine Grain Locking

Test And Set

Array Locks

And more!

Implementing Concurrent Data Structures

Performance Bottleneck

Implementation

Coarse Grain Locking

Fine Grain Locking

Test And Set

Array Locks

And more!

Lock Free

Implementing Concurrent Data Structures

Runtime System

Hardware platform

Which is the fastest/most

scalable?

Implementing concurrent data structures

Problem Statement

• How the interplay of the above parameters and the different synchronization methods, affect the performance and the behavior of concurrent data structures.

Yiannis Nikolakopoulos ioaniko@chalmers.se

Outline

Introduction

Experiment SetupHighlights of Study and ResultsConclusion

Which data structures to study?

Represent different levels of contention:• Queue - 1 or 2 contention points• Hash table - multiple contention points

How do we choose implementation?

Possible criteria:• Framework dependencies• Programmability• “Good” performance

Interpreting “good”

• Throughput:The more operations completed per time unit the better.

• Is this enough?

Non-fairness

• Throughput:Data structure operations completed per time unit.

What to measure?

Operations by thread i

Average operations per

thread

Implementation Parameters

Programming Environments C++ Java C# (.NET, Mono)

SynchronizationMethods

TAS, TTAS, Lock-free, Array lock

PMutex, Lock-free memory

management

Reentrant, synchronized

lock construct,Mutex

NUMAArchitectures

Intel Nehalem, 2 x 6 core(24 HW threads)

AMD Bulldozer, 4 x 12 core(48 HW threads)

Do they influence fairness?

Experiment Parameters

• Different levels of contention• Number of threads• Measured time intervals

Outline

• Queue– Fairness– Intel vs AMD– Throughput vs Fairness

• Hash Table– Intel vs AMD– Scalability

IntroductionExperiment Setup

Highlights of Study and ResultsConclusion

Fairness can change along different time intervals24 Threads, High contention

Observations: Queue

400 600 800 1000 2000 3000 4000 5000 10000

Measurement interval (ms)

C# (.NET)

Intel - Lock-free AMD - Lock-free

Intel - TAS AMD - TAS

Significantly different fairness behavior in different architectures24 Threads, High contention

Observations: Queue

400 600 800 1000 2000 3000 4000 5000 10000

Intel - TAS Intel - TTAS

Intel - Synchronized Intel - Lock-free

Significantly different fairness behavior in different architectures24 Threads, High contention

Lock-free is less affected in this case

Observations: Queue

400 600 800 1000 2000 3000 4000 5000 10000

Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Synchronized AMD - SynchronizedIntel - Lock-free AMD - Lock-free

Queue: Throughput vs Fairness

Fairness 0.6 s, Intel Throughput

2 4 6 8 12 24 48

Threads

TTAS Lock-free PMutex

2 4 6 8 12 24 48

Threads

Observations: Hash table

• Operations are distributed in different buckets• Things get interesting when

#threads > #buckets• Tradeoff between throughput and fairness– Different winners and losers– Contention is lowered in the linked list

components

400 600 800 1000 2000 3000 4000 5000 10000

C# (Mono)

Intel - TAS Intel - TTAS Intel - Lock-free

Fairness differences in Hash table across architectures24 Threads, High contention

Lock-free is again not affected

400 600 800 1000 2000 3000 4000 5000 10000

C# (Mono)

Intel - TAS AMD - TASIntel - TTAS AMD - TTASIntel - Lock-free AMD - Lock-free

Observations: Hash tableIn C++, custom memory management and lock-free implementations excel in

scalability and performance.

2 4 6 8 12 24 48

Threads

TAS TTAS Lock-free

Array Lock PMutex Lock-free, MM

2 4 6 8 12 24 48

Threads

TAS TTAS Lock-freeArray Lock Reentrant Reentrant FairSynchronized

Conclusion

• Complex synchronization mechanisms (Pmutex, Reentrant lock) pay off in heavily contended hot spots

• Scalability via more complex, inherently parallel designs and implementations

• Tradeoff between throughput and fairness– LF Hash table – Reentrant lock vs Array Lock vs LF Queue

• Fairness can be heavily influenced by HW– Interesting exceptions

Which is the fastest/most

scalable?

Is fairness influenced by

Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos...

Documents