1 Blue Gene Simulator Gengbin Zheng gzheng@uiuc.edu Gunavardhan Kakulapati kakulapa@uiuc.edu...

Post on 03-Jan-2016

222 views 2 download

Tags:

transcript

1

Blue Gene SimulatorBlue Gene Simulator

Gengbin Zhenggzheng@uiuc.edu

Gunavardhan Kakulapatikakulapa@uiuc.edu

Parallel Programming LaboratoryDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://charm.cs.uiuc.edu

2

OverviewOverview

Blue Gene Emulator

Blue Gene Simulator

Timing correction schemes

Performance and results

3

Emulation on a Parallel MachineEmulation on a Parallel Machine

Simulating (Host) Processor

BG/C Nodes

Hardware thread

4

Blue Gene Emulator: functional viewBlue Gene Emulator: functional view

Communication threads

Non-affinity message queues Affinity message queues

Worker threads

inBuffer

One Blue Gene/C node

CorrectionQ

5

Blue Gene Emulator: functional viewBlue Gene Emulator: functional view

Affinity message queues

Communication threads

Worker threads

inBuff

Non-affinity message queues

CorrectionQ

Converse scheduler

Converse Q

Communication threads

Worker threads

inBuff

Non-affinity message queues

CorrectionQ Affinity message

queues

6

What is capable …What is capable …

Blue Gene API supportBlue Gene Charm++

– Structured DaggerTrace Projections

7

Emulator to SimulatorEmulator to Simulator

Emulator:

– Study programming model and application development

Simulator:

– performance prediction capability

– models communication latency based on network model;

– Doesn’t model memory access on chip, or network

contention

8

SimulatorSimulator

Parallel performance is hard to model– Communication subsystem

Out of order messagesCommunication/computation overlap

– Event dependenciesParallel Discrete Event Simulation

– Emulation program executes in parallel with event time stamp correction.

– Exploit inherent determinacy of application

9

How to simulate?How to simulate? Time stamping events

– Per thread timer (sharing one physical timer)

– Time stamp messages Calculate communication latency based on network model

Parallel event simulation– When a message is sent out, calculate the predicted

arrival time for the destination bluegene-processor

– When a message is received, update current time. currTime = max(currTime,recvTime)

– Time stamp correction

10

Thread Timer: curT

Time Stamping messages and threadsTime Stamping messages and threadsMessage sent:RecvT(msg) = curT+Latency

Message scheduled:curT = max(curT, RecvT(msg))

11

Need for timestamp correctionNeed for timestamp correction

Time stamp correction needed for out-of-order messages

Out-of-order delivery can occur:– A message arrives late while some other

message updates the thread time to future– So late message executes in the context of

future, although its predicted time is earlier

12

Parallel correction algorithmParallel correction algorithmSort message execution by receive time;Adjust time stamps when neededUse correction message to inform the change

in event startTime.Send out correction messages following the

path message was sentThe events already in the timeline may have

to move.

13

M8

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Timestamps CorrectionTimestamps Correction

14

M8M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Timestamps CorrectionTimestamps Correction

15

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

M8

ExecutionTimeLineM1 M7M6M5M4M3M2 M8

RecvTime

Correction Message

Timestamps CorrectionTimestamps Correction

16

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Correction Message (M4)

M4

Correction Message (M4)

M4

M1 M7M4M3M2

RecvTime

ExecutionTimeLineM5 M6

Correction Message

M1 M7M6M4 M3M2

RecvTime

ExecutionTimeLineM5

Correction Message

Timestamps CorrectionTimestamps Correction

17

Linear-order correctionLinear-order correction

Works only when– Programs have no alternate orders of

execution possible– Messages are processed in the same order for

multiple executions– Eg: MPI programs with no-wildcard recvs,

structured-dagger code with no “overlap” or “forall”.

18

Reasons:Reasons:

Correction algorithm breaks dependency logic– Only based on receive time;– Cases:

When an event depends on several messages– Last message triggers the computation

Message buffered until some condition holdsExample for invalid correction scheme:

Jacobi-1D

19

20

SolutionSolution

Use structured dagger to retrieve dependence information

As the program runs, form a chain of bluegene logs preserving the dependency information .

Bluegene logs for entry functions and structured dagger functions

21

Timestamp correction schemeTimestamp correction scheme

Every event has a list of backward and forward dependents.

An event cannot start till its backward dependents have finished.

Define effRecvTime =

max(recvTime, endOfBackDeps) An event can start only after its effRecvTime.

startTime = max(effRecvTime,timeline.last.endTime)

22

Timestamp correction schemeTimestamp correction scheme

Timeline is not sorted on the recvTime of the event like the previous case.

Timeline is sorted based on the effRecvTime. Steps to process a correction message

– Find the earliest updated event due to the message

– Cut the timeline from that event

– Calculate new effRecvTimes from then.

– Reinsert into the timeline in the order of effRecvTime

23

Non-linear order correction Non-linear order correction schemeschemeThe new scheme :

– Takes into account the event dependencies– Works even when messages can be received in

different orders in different runs.– Requires all the dependencies to be captured

using structured dagger.But the timing correction is very slow.

Several optimizations possible.

24

Optimizations to online Optimizations to online correction schemecorrection schemeOverwrite old corrections:

– An event can get multiple correction messages.

– Reduce the number of corrections– Same scheme if correction message arrives

earlier than the message itself Use multisend

– Messages destined to same real processor but different events can be sent collectively.

25

More optimizationsMore optimizations Prioritize messages based on their predicted

recvTime. Lazy processing

– Process correction messages periodically.

– Allows corrections to be overwritten. Batch processing

– Process many correction messages at a time

– Many events will be affected

– Choose the earliest and reinsert in the order of effRecvTime.

Ability to start corrections in the middle– Can ignore the startup events for timing correction

26

Timing correction still very slow.Observations:

– Don’t let the execution go far ahead of the correction wave.

– A large difference means many wrong events to be corrected.

– Closely following the execution wave also may not help.

A new scheme – Similar to the one used for gvt (Global virtual

time)

27

GVT-like schemeGVT-like schemeUse heartbeat

– Periodically broadcast asking for gvtGvt

– Is the time after which the events are invalid due to pending corrections

– Compute the gvt as the minimum of predict recvTimes of all correction messages and startTimes of all affected events.

Use a parameter “leash”. Execution of the program cannot go beyond “gvt + leash”

28

Projections before correctionProjections before correction

29

Projections after correctionProjections after correction

30

Correctness of the scheme (using Correctness of the scheme (using Jacobi1D)Jacobi1D)

31

Predicted time vs latency factorPredicted time vs latency factor

32

Predicted speedupPredicted speedup

33

More workMore workOngoing work

– Make sure gvt scheme is correctFuture work

– The presented scheme is on-line correction– Explore the off-line (post-mortem) correction

scheme using generated traces.