+ All Categories
Home > Documents > A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Date post: 13-Jan-2016
Category:
Upload: nanji
View: 21 times
Download: 1 times
Share this document with a friend
Description:
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines. Chao Mei 05/02/2008 The 6 th Charm++ Workshop. Motivation. Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2 Intel quad-core chips) - PowerPoint PPT Presentation
18
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop
Transcript
Page 1: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei05/02/2008

The 6th Charm++ Workshop

Page 2: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Motivation

Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2 Intel quad-core chips) 16 cores/node on Ranger (4 AMD quad-core chips) …

Charm has a building version for SMP node for many years Not tuned

So, what are the issues for getting high performance?

Page 3: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Start with a kNeighbor benchmark

A synthetic kNeighbor benchmark Each element communicates with its neighbors in K-

stride (wrap-around), and then neighbors send back an acknowledge.

An iteration: all elements finish the above communication

Environment A smp node with 2 Xeon quadcores, only use 7 cores Ubuntu 7.04; gcc 4.2 Charm: net-linux-amd64-smp vs. net-linux-amd64 1 element/core, K=3

Page 4: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Performance at first glance

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1. 4

1. 6

0 2000 4000 6000 8000 10000 12000 14000 16000

msg si ze (byte)

iter

atio

n ti

me (

ms)

Non- SMP SMP

Page 5: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Outline

Examine the communication model in Charm++ between the Non-SMP and SMP layers

Describe current optimizations for SMP step by step

Talk about a different approach to utilize multicore

Conclude with the future work

Page 6: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Communication model for the multicore

Page 7: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Possible overheads in SMP version

Locks Overusing locks to ensure correctness Locks in message queues …

False sharing Some per thread data structures are allocated together

in an array form: e.g. each element in “CmiState state[numThds]” belongs to a thread

Page 8: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Reducing the usage of locks

By examining the source codes, finding overuse of locks Narrower sections enclosed by locks

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1. 4

1. 6

0 2000 4000 6000 8000 10000 12000 14000 16000

msg si ze (byte)

ite

ratio

n t

ime (

ms)

Non- SMP SMP SMP- Rel axed l ock

Page 9: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Overhead in message queues

A micro-benchmark to show the overhead in message queues N producers, 1

consumer lock vs. memory

fence + atomic operation (fetch-and-increment)

1 queue vs. N queues

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

1 2 3 4 5 6 7 8

number of producers

avg

iter t

ime

(us)

multiQ-fence singleQ-fence+atomic op singleQ-lock

1. Each producer produces 10K items per iteration

2. One iteration: consumer cosumes all items

Page 10: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Applying multi Q + Fence

Less than 2% improvement Much less

contention compared with the micro-benchmark

0. 4

0. 45

0. 5

0. 55

0. 6

0. 65

0. 7

0. 75

0. 8

0 2000 4000 6000 8000 10000 12000 14000 16000

msg si ze (byte)

iter

atio

n ti

me (

ms)

Non- SMP SMP- Rel axed l ock SMP- Rel axed l ock- mul t i Q- Fence

Page 11: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Big overhead in msg allocation

We noticed that:

We used our own default memory module Every memory allocation is protected by a lock Provide some useful functionalities in Charm++ system

(a historic reason not using other memory modules) memory footprint information, memory debugger Isomalloc

Page 12: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Switching to OS memory module

We don’t lose the aforementioned functionalities by recent updates

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1. 4

1. 6

0 2000 4000 6000 8000 10000 12000 14000 16000msg si ze (byte)

iter

atio

n ti

me (

ms)

Non-SMP SMP-Rel axed l ock-Si ngl eQ-Fence SMP-Reduced l ock overhead

Page 13: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Identifying false sharing overhead

Another micro-benchmark Each element repeatedly sends itself a message, but

each time the message is reused (i.e., not allocating a new message)

Benchmark timing of 1000 iterations

Use Intel VTune performance analysis tool Focusing on the cache misses caused by “Invalidate” in

the MESI coherence protocol

Declaring variables with “__thread” specifier will make them thread private

Page 14: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Performance for the micro-benchmark

Parameters: 1 element/core, 7 cores

Before: 1.236 us per iteration

After: 0.913 us per iteration

Page 15: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Adding the gains from removing false sharing

Around 1% improvement

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1. 4

1. 6

0 2000 4000 6000 8000 10000 12000 14000 16000

msg si ze (byte)

iteration time (ms)

Non- SMP SMP SMP- Opt i mi zed

Page 16: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Rethinking communication model

Posix-shared memory layer

No threads, every core still runs a process

Inter-core message passing doesn’t go through NIC, but through memory copy (inter-process communication)

Page 17: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Performance comparison

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1. 4

1. 6

0 2000 4000 6000 8000 10000 12000 14000 16000

msg si ze (byte)

iter

atio

n ti

me (

ms)

Non- SMP SMP- Opt i mi zed Posi x Shared Memory

Page 18: A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei ([email protected])Parallel Programming Lab, UIUC

Future work

Other platform BG/P

Optimize the posix shared memory version

Effects on real applications For NAMD, initial result shows that SMP helps up to 24

nodes on Abe

Any other communication models Adaptive one?


Recommended