Efficient Optimistic Parallel Simulations Using Reverse Computation

transcript

Chris CarothersDepartment of Computer Science

Rensselaer Polytechnic Institute

Kalyan Permulla

Richard M. FujimotoCollege of Computing

Georgia Institute of Technology

Goal: speed up discrete-event simulation programs using multiple processors

Enabling technology for… intractable simulation models tractable off-line decision aides on-line aides for time critical

situation analysis DPAT: A distributed simulation success story

simulation model of the National Airspace developed @ MITRE using Georgia Tech Time Warp (GTW) simulates 50,000 flights in < 1 minute, which use to take 1.5 hours. web based user-interface to be used in the FAA Command Center for on-line “what if” planning

Parallel/distributed simulation has the potential to improve how “what if” planning strategies are evaluated

Why Parallel/Distributed Simulation?

How to Synchronize Distributed Simulations?parallel time-stepped simulation:

lock-step execution

PE 1 PE 2 PE 3

barrier

VirtualTime

parallel discrete-event simulation:must allow for sparse, irregular

event computations

PE 1 PE 2 PE 3

VirtualTime

Problem: events arrivingin the past

Solution: Time Warp

processed event

“straggler” event

Time Warp...

Local Control Mechanism:error detection and rollback

LP 1 LP 2 LP 3

Virtual

undostate ’s

(2) cancel“sent” events

Global Control Mechanism:compute Global Virtual Time (GVT)

LP 1 LP 2 LP 3

Virtual

collect versionsof state / events& perform I/O

operationsthat are < GVT

processed event

“straggler” event

unprocessed event

“committed” event

Challenge: Efficient Implementation?

Advantages:• automatically finds available

parallelism• makes development easier• outperforms conservative schemes

by a factor of N

Disadvantages:• Large memory requirements to support

rollback operation• State-saving incurs high overheads for

fine-grain event computations• Time Warp is out of “performance”

envelop for many applications

Time Warp

P PPPPP P P

Shared Memory or High Speed Network

Our Solution: Reverse Computation

Outline... Reverse Computation

Example: ATM Multiplexor Beneficial Application Properties Rules for Automation Reversible Random Number Generator

Experimental Results Conclusions Future Work

Our Solution: Reverse Computation...

Use Reverse Computation (RC) automatically generate reverse code from model source undo by executing reverse code

Delivers better performance negligible overhead for forward computation significantly lower memory utilization

if( qlen < B )

qlen++delays[qlen]++

lost++

on cell arrival...

Original

if( b1 == 1 )

delays[qlen]--

qlen--

lost--

Reverse

if( qlen < B )

b1 = 1

qlen++delays[qlen]++

b1 = 0

lost++

Forward

Example: ATM Multiplexor

State size reduction from B+2 words to 1 word e.g. B=100 => 100x reduction!

Negligible overhead in forward computation removed from forward computation moved to rollback phase

Result significant increase in speed significant decrease in memory

How?...

Gains….

Beneficial Application Properties

1. Majority of operations are constructive e.g., ++, --, etc.

2. Size of control state < size of data state e.g., size of b1 < size of qlen, sent, lost, etc.

3. Perfectly reversible high-level operations

gleaned from irreversible smaller operations e.g., random number generation

Type Description Application Code Bit RequirementsOriginal Translated Reverse Self Child Total

T0 simple choice if() s1 if() {s1; b=1;} if(b==1){inv(s1);} 1 x1, 1+else s2 else {s2; b=0;} else{inv(s2);} x2 max(x1,x2)

T1 compound choice if () s1; if() {s1; b=1;} if(b==1) {inv(s1);} lg(n) x1, lg(n) +(n-way) elseif() s2; elseif() {s2; b=2;} elseif(b==2) {inv(s2);} x2, max(x1….xn)

elseif() s3; elseif() {s3; b=3;} elseif(b==3) {inv(s3);} ….,else() sn; else {sn; b=n;} else {inv(sn);} xn

T2 fixed iterations (n) for(n)s; for(n) s; for(n) inv(s); 0 x n*xT3 variable iterations while() s; b=0; for(b) inv(s); lg(n) x lg(n) +n*x

(maximum n) while() {s; b++;}T4 function call foo(); foo(); inv(foo)(); 0 x xT5 constructive v@ = w; v@ = w; v = @w; 0 0 0

assignmentT6 k-byte destructive v = w; {b =v; v = w;} v = b; 8k 0 8k

assignmentT7 sequence s1; s1; inv(sn); 0 x1+ x1+…+xn

s2; s2; inv(s2); ….+sn; sn; inv(s1); xn

T8 Nesting of T0-T7 Recursively apply the above Recursively apply the above

Generation rules, and upper-bounds on bit requirements for various statement types

Rules for Automation...

Destructive assignment (DA): examples: x = y;

x %= y; requires all modified bytes to be saved

Caveat: reversing technique for DA’s can degenerate to traditional

incremental state saving

Good news: certain collections of DA’s are perfectly reversible! queueing network models contain collections of easily/perfectly

reversible DA’s queue handling (swap, shift, tree insert/delete, … ) statistics collection (increment, decrement, …) random number generation (reversible RNGs)

Destructive Assignment...

Reversing an RNG?double RNGGenVal(Generator g){ long k,s; double u; u = 0.0;

s = Cg [0][g]; k = s / 46693; s = 45991 * (s - k * 46693) - k * 25884; if (s < 0) s = s + 2147483647; Cg [0][g] = s; u = u + 4.65661287524579692e-10 * s;

s = Cg [1][g]; k = s / 10339; s = 207707 * (s - k * 10339) - k * 870; if (s < 0) s = s + 2147483543; Cg [1][g] = s; u = u - 4.65661310075985993e-10 * s; if (u < 0) u = u + 1.0;

s = Cg [2][g]; k = s / 15499; s = 138556 * (s - k * 15499) - k * 3979; if (s < 0.0) s = s + 2147483423; Cg [2][g] = s; u = u + 4.65661336096842131e-10 * s; if (u >= 1.0) u = u - 1.0;

s = Cg [3][g]; k = s / 43218; s = 49689 * (s - k * 43218) - k * 24121; if (s < 0) s = s + 2147483323; Cg [3][g] = s; u = u - 4.65661357780891134e-10 * s; if (u < 0) u = u + 1.0;

return (u);}

Observation: k = s / 46693 is a Destructive AssignmentResult: RC degrades to classic state-saving…can we do better?

RNGs: A Higher Level ViewThe previous RNG is based on the following recurrence….

xi,n = aixi,n-1 mod mi

where xi,n one of the four seed values in the Nth set, mi is one the four

largest primes less than 231, and ai is a primitive root of mi.

Now, the above recurrence is in fact reversible….

inverse of ai modulo mi is defined,

bi = aimi-2 mod mi

Using bi, we can generate the reverse recurrence as follows:

xi,n-1 = bixi,n mod mi

Reverse Code Efficiency... Future RNGs may result in even greater savings.

Consider the MT19937 Generator... Has a period of 219937

Uses 2496 bytes for a single “generator”

Property... Non-reversibility of indvidual steps DO NOT imply that the

computation as a whole is not reversible. Can we automatically find this “higher-level” reversibility?

Other Reversible Structures Include... Circular shift operation Insertion & deletion operations on trees (i.e., priority queues).

Reverse computation is well-suited for queuing network models!

Performance StudyPlatform

SGI Origin 2000, 16 processors (R10000), 4GB RAM

Model• 3 levels of multiplexers, fan-in N• N^3 sources => N^3 + N^2 + N + 1 entities in totaleg. N=4 => entities=85, N=64 => entities=266,305

Reverse Computation executes significantly faster than

State Saving!

1 2 3 4 5 6 7 8 9 10 11 12

Number of Processors

Fanin 4

Fanin 12

Fanin 32

Fanin 48

million events/second

Why the large increase in parallel performance?

Cache Performance...

Faults TLB P cache S cache SS 12pe: 43966018 1283032615 162449694 RC 12pe: 11595326 590555715 94771426

Related Work...

Reverse computation used in low power processors, debugging, garbage collection, database

recovery, reliability, etc.

All previous work either prohibit irreversible constructs, or use copy-on-write implementation for every modification

(correspond to incremental state saving)

Many operate at coarse, virtual page-level

Contributions

We identify that RC makes Time Warp usable for fine-grain models!

disproved previous beliefthat “fine grain models can’t be optimistically simulated efficiently”

less memory consumption, more speed, without extra user effort

RC generalizes state saving e.g., incremental state saving, copy state saving

For certain data types, RC is more memory efficient than SS e.g., priority queues

Future Work

Develop state minimization algorithms, by State compression:

bit size for reversibility < bit size of data variables State reuse:

same state bits for different statements based on liveness, analogous to register allocation

Complete RC automation algorithm designavoiding the straightforward incremental state saving approach

Lossy integer and floating point arithmetic Jump statements Recursive functions

Geronimo! System Architecture

multiprocessor rack-mounted CPUs(not in demonstration)

Myrinet

Geronimo

High Performance Simulation Application

distributed computeserver

Geronimo Features: (1) “risky” or “speculative” processing of object computations, (2) reverse computation to support “undo” operation, (3) “Active Code” in a combination, heterogeneous, shared-memory, message passing environment...

Geronimo!: “Risky” Processing...Error detection and Rollback

Object 1 Object 2 Object 3

Virtual

undostate ’s

(2) cancel“scheduled”

processed thread

“straggler” thread

unprocessed thread

Execution Framework:

• Objects

• schedule Threads / Tasks

• at some “virtual time”

Applications:

• discrete-event simulations

• scientific computing applications

CAVEAT: Good performance relies on cost of recovery * probability of failure being less than cost of being “safe”!

Geronimo!: Efficient “Undo” Traditional approach: State Saving

save byte-copies of modified items high overhead for fine-granularity computations memory utilization is large

need alternative for large-scale, fine-grain simulations

Our approach: Reverse Computation automatically generate reverse code from model source utilize reverse code to do rollback

negligible overhead for forward computation significantly lower memory utilization

joint with Kalyan Perumalla and Richard Fujimoto

Observation: “reverse” computation treats “code” as“state”. This results in a code-state duality.Can we generalize notion?…..

Geronimo!: Active Code

Key idea: allow object methods/code to be dynamically changed during run-time. objects can schedule in the future a new method or re-define old methods of

other objects and themselves. objects can erase/delete methods on themselves or other objects. new methods can contain “Active Code” which can re-specialize itself or other

objects. work in a heterogeneous environment.

How is this useful? increase performance by allowing the program to consistently “execute the

common case fast”. adaptive, perturbation-free, monitoring of distributed systems. potential for increasing a language’s “expressive power”.

Our approach? Java…no, need higher performance…maybe used in the future... special compiler…no, can’t keep up with changes to microprocessors.

Geronimo!: Active Code Implementation

Runtime infrastructure modifies source code tree start a rebuild of the executable on a another existing machine uses a system’s naïve compiler

Re-exec system call reloads only the new text or code segment of new executable fix-up old stack to reflect new code changes fix-up pointers to functions will run in “user-space” for portability across platforms

Language preprocessor instruments code to support stack and function pointer fix-up instruments code to support stack reconstruction and re-start process

Research Issues Software architecture for the heterogeneous, shared-

memory, message passing environment. Development of distributed algorithms that are fully

optimized for this “combination” environment. What language to use for development, C or C++ or

both? Geronimo! API. Active Code Language and Systems Support. Mapping relevant application types to this framework

Homework Problem: Can you find specific applications/problems where we can apply Geronimo!?

Efficient Optimistic Parallel Simulations Using Reverse Computation

Documents