Post on 02-Feb-2016
description
transcript
Efficient Optimistic Parallel Simulations Using Reverse Computation
Chris CarothersDepartment of Computer Science
Rensselaer Polytechnic Institute
Kalyan Permulla
and
Richard M. FujimotoCollege of Computing
Georgia Institute of Technology
Goal: speed up discrete-event simulation programs using multiple processors
Enabling technology for… intractable simulation models tractable off-line decision aides on-line aides for time critical
situation analysis DPAT: A distributed simulation success story
simulation model of the National Airspace developed @ MITRE using Georgia Tech Time Warp (GTW) simulates 50,000 flights in < 1 minute, which use to take 1.5 hours. web based user-interface to be used in the FAA Command Center for on-line “what if” planning
Parallel/distributed simulation has the potential to improve how “what if” planning strategies are evaluated
Why Parallel/Distributed Simulation?
How to Synchronize Distributed Simulations?parallel time-stepped simulation:
lock-step execution
PE 1 PE 2 PE 3
barrier
VirtualTime
parallel discrete-event simulation:must allow for sparse, irregular
event computations
PE 1 PE 2 PE 3
VirtualTime
Problem: events arrivingin the past
Solution: Time Warp
processed event
“straggler” event
Time Warp...
Local Control Mechanism:error detection and rollback
LP 1 LP 2 LP 3
Virtual
Ti
me
undostate ’s
(2) cancel“sent” events
Global Control Mechanism:compute Global Virtual Time (GVT)
LP 1 LP 2 LP 3
Virtual
Ti
me
GVT
collect versionsof state / events& perform I/O
operationsthat are < GVT
processed event
“straggler” event
unprocessed event
“committed” event
Challenge: Efficient Implementation?
Advantages:• automatically finds available
parallelism• makes development easier• outperforms conservative schemes
by a factor of N
Disadvantages:• Large memory requirements to support
rollback operation• State-saving incurs high overheads for
fine-grain event computations• Time Warp is out of “performance”
envelop for many applications
Time Warp
P PPPPP P P
Shared Memory or High Speed Network
P
Our Solution: Reverse Computation
Outline... Reverse Computation
Example: ATM Multiplexor Beneficial Application Properties Rules for Automation Reversible Random Number Generator
Experimental Results Conclusions Future Work
Our Solution: Reverse Computation...
Use Reverse Computation (RC) automatically generate reverse code from model source undo by executing reverse code
Delivers better performance negligible overhead for forward computation significantly lower memory utilization
if( qlen < B )
qlen++delays[qlen]++
else
lost++
NB
on cell arrival...
Original
if( b1 == 1 )
delays[qlen]--
qlen--
else
lost--
Reverse
if( qlen < B )
b1 = 1
qlen++delays[qlen]++
else
b1 = 0
lost++
Forward
Example: ATM Multiplexor
State size reduction from B+2 words to 1 word e.g. B=100 => 100x reduction!
Negligible overhead in forward computation removed from forward computation moved to rollback phase
Result significant increase in speed significant decrease in memory
How?...
Gains….
Beneficial Application Properties
1. Majority of operations are constructive e.g., ++, --, etc.
2. Size of control state < size of data state e.g., size of b1 < size of qlen, sent, lost, etc.
3. Perfectly reversible high-level operations
gleaned from irreversible smaller operations e.g., random number generation
Type Description Application Code Bit RequirementsOriginal Translated Reverse Self Child Total
T0 simple choice if() s1 if() {s1; b=1;} if(b==1){inv(s1);} 1 x1, 1+else s2 else {s2; b=0;} else{inv(s2);} x2 max(x1,x2)
T1 compound choice if () s1; if() {s1; b=1;} if(b==1) {inv(s1);} lg(n) x1, lg(n) +(n-way) elseif() s2; elseif() {s2; b=2;} elseif(b==2) {inv(s2);} x2, max(x1….xn)
elseif() s3; elseif() {s3; b=3;} elseif(b==3) {inv(s3);} ….,else() sn; else {sn; b=n;} else {inv(sn);} xn
T2 fixed iterations (n) for(n)s; for(n) s; for(n) inv(s); 0 x n*xT3 variable iterations while() s; b=0; for(b) inv(s); lg(n) x lg(n) +n*x
(maximum n) while() {s; b++;}T4 function call foo(); foo(); inv(foo)(); 0 x xT5 constructive v@ = w; v@ = w; v = @w; 0 0 0
assignmentT6 k-byte destructive v = w; {b =v; v = w;} v = b; 8k 0 8k
assignmentT7 sequence s1; s1; inv(sn); 0 x1+ x1+…+xn
s2; s2; inv(s2); ….+sn; sn; inv(s1); xn
T8 Nesting of T0-T7 Recursively apply the above Recursively apply the above
Generation rules, and upper-bounds on bit requirements for various statement types
Rules for Automation...
Destructive assignment (DA): examples: x = y;
x %= y; requires all modified bytes to be saved
Caveat: reversing technique for DA’s can degenerate to traditional
incremental state saving
Good news: certain collections of DA’s are perfectly reversible! queueing network models contain collections of easily/perfectly
reversible DA’s queue handling (swap, shift, tree insert/delete, … ) statistics collection (increment, decrement, …) random number generation (reversible RNGs)
Destructive Assignment...
Reversing an RNG?double RNGGenVal(Generator g){ long k,s; double u; u = 0.0;
s = Cg [0][g]; k = s / 46693; s = 45991 * (s - k * 46693) - k * 25884; if (s < 0) s = s + 2147483647; Cg [0][g] = s; u = u + 4.65661287524579692e-10 * s;
s = Cg [1][g]; k = s / 10339; s = 207707 * (s - k * 10339) - k * 870; if (s < 0) s = s + 2147483543; Cg [1][g] = s; u = u - 4.65661310075985993e-10 * s; if (u < 0) u = u + 1.0;
s = Cg [2][g]; k = s / 15499; s = 138556 * (s - k * 15499) - k * 3979; if (s < 0.0) s = s + 2147483423; Cg [2][g] = s; u = u + 4.65661336096842131e-10 * s; if (u >= 1.0) u = u - 1.0;
s = Cg [3][g]; k = s / 43218; s = 49689 * (s - k * 43218) - k * 24121; if (s < 0) s = s + 2147483323; Cg [3][g] = s; u = u - 4.65661357780891134e-10 * s; if (u < 0) u = u + 1.0;
return (u);}
Observation: k = s / 46693 is a Destructive AssignmentResult: RC degrades to classic state-saving…can we do better?
RNGs: A Higher Level ViewThe previous RNG is based on the following recurrence….
xi,n = aixi,n-1 mod mi
where xi,n one of the four seed values in the Nth set, mi is one the four
largest primes less than 231, and ai is a primitive root of mi.
Now, the above recurrence is in fact reversible….
inverse of ai modulo mi is defined,
bi = aimi-2 mod mi
Using bi, we can generate the reverse recurrence as follows:
xi,n-1 = bixi,n mod mi
Reverse Code Efficiency... Future RNGs may result in even greater savings.
Consider the MT19937 Generator... Has a period of 219937
Uses 2496 bytes for a single “generator”
Property... Non-reversibility of indvidual steps DO NOT imply that the
computation as a whole is not reversible. Can we automatically find this “higher-level” reversibility?
Other Reversible Structures Include... Circular shift operation Insertion & deletion operations on trees (i.e., priority queues).
Reverse computation is well-suited for queuing network models!
Performance StudyPlatform
SGI Origin 2000, 16 processors (R10000), 4GB RAM
Model• 3 levels of multiplexers, fan-in N• N^3 sources => N^3 + N^2 + N + 1 entities in totaleg. N=4 => entities=85, N=64 => entities=266,305
Reverse Computation executes significantly faster than
State Saving!
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 10 11 12
Number of Processors
Ev
en
t R
ate
of
Re
v.
Co
mp
/ E
ve
nt
Ra
te o
f S
tate
Sa
vin
g
Fanin 4
Fanin 12
Fanin 32
Fanin 48
million events/second
Why the large increase in parallel performance?
Cache Performance...
Faults TLB P cache S cache SS 12pe: 43966018 1283032615 162449694 RC 12pe: 11595326 590555715 94771426
Related Work...
Reverse computation used in low power processors, debugging, garbage collection, database
recovery, reliability, etc.
All previous work either prohibit irreversible constructs, or use copy-on-write implementation for every modification
(correspond to incremental state saving)
Many operate at coarse, virtual page-level
Contributions
We identify that RC makes Time Warp usable for fine-grain models!
disproved previous beliefthat “fine grain models can’t be optimistically simulated efficiently”
less memory consumption, more speed, without extra user effort
RC generalizes state saving e.g., incremental state saving, copy state saving
For certain data types, RC is more memory efficient than SS e.g., priority queues
Future Work
Develop state minimization algorithms, by State compression:
bit size for reversibility < bit size of data variables State reuse:
same state bits for different statements based on liveness, analogous to register allocation
Complete RC automation algorithm designavoiding the straightforward incremental state saving approach
Lossy integer and floating point arithmetic Jump statements Recursive functions
Geronimo! System Architecture
multiprocessor rack-mounted CPUs(not in demonstration)
Myrinet
Geronimo
High Performance Simulation Application
distributed computeserver
Geronimo Features: (1) “risky” or “speculative” processing of object computations, (2) reverse computation to support “undo” operation, (3) “Active Code” in a combination, heterogeneous, shared-memory, message passing environment...
Geronimo!: “Risky” Processing...Error detection and Rollback
Object 1 Object 2 Object 3
Virtual
Ti
me
undostate ’s
(2) cancel“scheduled”
tasks
processed thread
“straggler” thread
unprocessed thread
Execution Framework:
• Objects
• schedule Threads / Tasks
• at some “virtual time”
Applications:
• discrete-event simulations
• scientific computing applications
CAVEAT: Good performance relies on cost of recovery * probability of failure being less than cost of being “safe”!
Geronimo!: Efficient “Undo” Traditional approach: State Saving
save byte-copies of modified items high overhead for fine-granularity computations memory utilization is large
need alternative for large-scale, fine-grain simulations
Our approach: Reverse Computation automatically generate reverse code from model source utilize reverse code to do rollback
negligible overhead for forward computation significantly lower memory utilization
joint with Kalyan Perumalla and Richard Fujimoto
Observation: “reverse” computation treats “code” as“state”. This results in a code-state duality.Can we generalize notion?…..
Geronimo!: Active Code
Key idea: allow object methods/code to be dynamically changed during run-time. objects can schedule in the future a new method or re-define old methods of
other objects and themselves. objects can erase/delete methods on themselves or other objects. new methods can contain “Active Code” which can re-specialize itself or other
objects. work in a heterogeneous environment.
How is this useful? increase performance by allowing the program to consistently “execute the
common case fast”. adaptive, perturbation-free, monitoring of distributed systems. potential for increasing a language’s “expressive power”.
Our approach? Java…no, need higher performance…maybe used in the future... special compiler…no, can’t keep up with changes to microprocessors.
Geronimo!: Active Code Implementation
Runtime infrastructure modifies source code tree start a rebuild of the executable on a another existing machine uses a system’s naïve compiler
Re-exec system call reloads only the new text or code segment of new executable fix-up old stack to reflect new code changes fix-up pointers to functions will run in “user-space” for portability across platforms
Language preprocessor instruments code to support stack and function pointer fix-up instruments code to support stack reconstruction and re-start process
Research Issues Software architecture for the heterogeneous, shared-
memory, message passing environment. Development of distributed algorithms that are fully
optimized for this “combination” environment. What language to use for development, C or C++ or
both? Geronimo! API. Active Code Language and Systems Support. Mapping relevant application types to this framework
Homework Problem: Can you find specific applications/problems where we can apply Geronimo!?