Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | blaze-morgan |
View: | 220 times |
Download: | 0 times |
Concurrent Data Structures in Architectures with
Limited Shared Memory Support
Ivan WalulyaYiannis NikolakopoulosMarina Papatriantafilou
Philippas Tsigas
Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden
Yiannis Nikolakopoulos [email protected]
2
Concurrent Data Structures• Parallel/Concurrent programming:– Share data among threads/processes,
sharing a uniform address space (shared memory)
• Inter-process/thread communication and synchronization– Both a tool and a goal
Yiannis Nikolakopoulos [email protected]
3
Concurrent Data Structures:Implementations
• Coarse grained locking– Easy but slow...
• Fine grained locking– Fast/scalable but: error-prone, deadlocks
• Non-blocking– Atomic hardware primitives (e.g. TAS, CAS)– Good progress guarantees (lock/wait-freedom)– Scalable
Yiannis Nikolakopoulos [email protected]
4
What’s happening in hardware?• Multi-cores many-cores– “Cache coherency wall”
[Kumar et al 2011]– Shared address space
will not scale– Universal atomic primitives (CAS, LL/SC) harder to
implement• Shared memory message passing
Cache Cache
IA Core
Shared Local
Yiannis Nikolakopoulos [email protected]
5
• Networks on chip (NoC)• Short distance
between cores• Message passing
model support• Shared memory support
Can we have Data Structures:Fast
ScalableGood progress guarantees
Cache Cache
IA Core
Shared Local
• Eliminatedcache coherency
• Limited support for synchronization primitives
Yiannis Nikolakopoulos [email protected]
6
Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion
Yiannis Nikolakopoulos [email protected]
7
Single-chip Cloud Computer (SCC)• Experimental processor by Intel• 48 independent x86 cores arranged on 24 tiles• NoC connects all tiles• TestAndSet register
per core
Yiannis Nikolakopoulos [email protected]
8
SCC: Architecture Overview
Memory Controllers:to private & shared
main memory
Message Passing
Buffer (MPB) 16Kb
Yiannis Nikolakopoulos [email protected]
9
Programming Challenges in SCC• Message Passing but…– MPB small for
large data transfers– Data Replication is difficult
• No universal atomic primitives (CAS); no wait-free implementations [Herlihy91]
Yiannis Nikolakopoulos [email protected]
10
Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion
Yiannis Nikolakopoulos [email protected]
11
Concurrent FIFO Queues• Main idea:– Data are stored in shared off-chip memory– Message passing for communication/coordination
• 2 design methodologies:– Lock-based synchronization (2-lock Queue)– Message passing-based synchronization
(MP-Queue, MP-Acks)
Yiannis Nikolakopoulos [email protected]
12
2-lock Queue• Array based, in shared off-chip memory (SHM)• Head/Tail pointers in MPBs• 1 lock for each pointer [Michael&Scott96]• TAS based locks on 2 cores
Yiannis Nikolakopoulos [email protected]
13
2-lock Queue:“Traditional” Enqueue Algorithm
• Acquire lock• Read & Update
Tail pointer (MPB)• Add data (SHM)• Release lock
Yiannis Nikolakopoulos [email protected]
14
2-lock Queue:Optimized Enqueue Algorithm
• Acquire lock• Read & Update
Tail pointer (MPB)• Release lock• Add data to node SHM• Set memory flag to dirty Why?
No Cache Coherency!
Yiannis Nikolakopoulos [email protected]
15
2-lock Queue:Dequeue Algorithm
• Acquire lock• Read & Update
Head pointer• Release lock• Check flag• Read node dataWhat about
progress?
Yiannis Nikolakopoulos [email protected]
16
2-lock Queue:Implementation
Head/TailPointers (MPB)
Data nodes
Locks?On which tile(s)?
Yiannis Nikolakopoulos [email protected]
17
Message Passing-based Queue• Data nodes in SHM• Access coordinated by a Server node who
keeps Head/Tail pointers• Enqueuers/Dequeuers request access through
dedicated slots in MPB• Successfully enqueued data are flagged with
dirty bit
Yiannis Nikolakopoulos [email protected]
18
MP-Queue
ENQ
TAIL
DEQ
HEAD
SPIN
What if this fails and is
never flagged?“Pairwise blocking”
only 1 dequeue blocks
ADDDATA
Yiannis Nikolakopoulos [email protected]
19
Adding Acknowledgements• No more flags!
Enqueue sends ACK when done• Server maintains in SHM a private queue of
pointers• On ACK:
Server adds data location to its private queue• On Dequeue:
Server returns only ACKed locations
Yiannis Nikolakopoulos [email protected]
20
MP-Acks
ENQ
TAIL
ACK
DEQ
HEAD
No blocking between
enqueues/dequeues
Yiannis Nikolakopoulos [email protected]
21
Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion
Yiannis Nikolakopoulos [email protected]
22
Evaluation
Benchmark:• Each core performs Enq/Deq at random• High/Low contention
• Perfomance? Scalability?• Is it the same for all cores?
23
• Throughput:Data structure operations completed per time unit.
[Cederman et al 2013]
Measures
Yiannis [email protected]
Operations by core i
Average operations per
core
Yiannis Nikolakopoulos [email protected]
28
Conclusion• Lock based queue– High throughput– Less fair– Sensitive to lock locations, NoC performance
• MP based queues– Lower throughput– Fairer– Better liveness properties– Promising scalability
Yiannis Nikolakopoulos [email protected]
31
Experimental Setup• 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations• High/Low contention• One thread per core• 600ms per execution • Averaged over 12 runs
Yiannis Nikolakopoulos [email protected]
32
Concurrent FIFO Queues• Typical 2-lock queue [Michael&Scott96]