Performance analysis of a Pose application -- BigNetSim
Nilesh Choudhury
BigNetSim● A parallel simulator for performance prediction
of parallel machines● Two components:
– Processor performance modelling– Interconnection Network modelling
● The two components could be used individually or in synergy.
● Two modes of operation:– Direct Simulation (on-line mode)– Trace-driven simulation (off-line mode)
Architecture of BigNetSim
Which part is most relevant to Pose Performance ?
● The Interconnection Network simulation
Network Simulator Modelling
● Very detailed model:– Switch modelled as:
● Collection of Input/Output ports● Arbitration strategies to serve incoming requests fairly● Detailed Virtual Channel selection strategies
– Input VC and Output VC● Switch delays and arbitration costs modelled here● Switch load and contention measures computed and
updated to assist adaptive routing strategies and fault-tolerant routing
● Virtual cut-through routing; store-and-forward routing● Number of posers per switch = # ports!
Network Simulator Modelling– Network Information Card (NIC)
● 'Send NIC' packetizes and sends a message at CPU's request● 'Recv Nic' unpacks, reassembles and delivers a message to
the CPU on receiving incoming packets.● Network card send and receive latencies modelled here● Number of posers per NIC = 2
– Channel● Doesn't need to be very sophisticated● Models a simple channel delay and receives a packet from a
switch/Nic and delivers it to the corresponding switch/Nic it is connected to.
● Number of posers per channel = 1
Topologies and Routing Algorithms
● Topology and Routing strategies provide functions which the network uses
● Extrmely modular design● Write your own routing strategies● Write your own topology● We have some available:
– KaryNcube; KaryNmesh; KaryNtree; Nmesh; fattree; hypercbe and some hybrid variations
Routing Algorithms
● Minimal deadlock-free; Non-minimal and Fault-tolerant variations
● K-ary-N-mesh / N-mesh● Direction Ordered;● Planar Routing;● Static Direction Reversal Routing● Optimally Fully Adaptive Routing (modified too)
– K-ary-N-tree● UpDown (modified, non-minimal)
– HyperCube● Hamming● P-Cube (modified too)
Input/Output VC selection
● Input Virtual Channel Selection– RoundRobin;– Shortest Length Queue– Output Buffer length
● Output Virtual Channel Selection– Max. available buffer length– Max. available buffer bubble VC– Output Buffer length
Building up a machine● Involves selecting the processor capabilities● Selecting the Interconnection network
– Available set of topologies, routing algorithms, virtual channel selection strategies
– Easy to build an interconnection network closely modelling the target machine
– All these modules are easily extendable to create and plug in new topology, routing algorithm, etc
● Some preconfigured machines include:– Bluegene; RedStorm; lemieux; etc– Generalized hypercube, fat-tree, torii and mesh
architectures
Hardware support for Collectives
● You could also model a network with hardware collectives for multicast, reduction and broadcast
● Collective Manager is interfaced with the basic network units
● You need to define the collective manager operations for the corresponding topology
● Already available for:– Hypercube; fattree; densegraph and hybrid variations
Network configuration Parameters
● Apart from Routing algorithm; Topology; virtual channel selection; switch size (number of ports); number of virtual channels associated to a port– Size of network– Channel bandwidth; Switch bandwidth– NIC send/recv packet latencies– NIC packetization costs– Switch buffer size– Size of a single packet– Delays in various components– DMA delay; Processor send overhead; etc
How does this modelling translate in the POSE framework
● If we model the following machine:– 'n' nodes;– 's' switches;– 'p' ports per switch
● There are:– 4*n + 2p*s posers.– A proc, co-proc, send-NIC, recv-NIC per machine– 'p' ports and 'p' channels per switch
An example
● Suppose we model a 2048 node bluegene network connected as a 3D torus:– n=2048; – s=2048; – p=6;– Total number of posers = 4*2048+2*6*2048 =
16*2048 = 32,768 posers.– Ample virtualization to run this simulation on 100
processors.
Factors related to Performance● Number of GVT synchronizations:
– Gives an insight of the amount of parallelism within the threshold controlled by the simulation
● Large number of sync – possibly little work within allowable limits
● Phase Time – real time elapsed between consecutive GVT synchronizations– Indicates the amount of parallelism
● Rollback fraction– Proportion of time for undoing speculative work– Implies too many strict dependencies in the simulation
contd...● Communication fraction:
– Fraction of total time spent communicating● Simulation dependencies:
– Posers should be distributed on processors such that it minimizes dependency
● Simulation strategy to use:– Optimistic; Adept; etc– Control the amount of throttling – speculative window
● Speedup with sequential simulation:– Sequential simulation is faster as it gets rid of all
synchronization, provided it fits in memory
DetailedSim – performance case study● DetailedSim (with switch
modelled as a single poser)– running on 16 processors– simulating a 2048 node
hypercube network– random traffic generated at
each processing node● Specculation still within
reasonable limits (<20%)● Phase time very small
(<5ms)
contd...
● Poor real speedup● Breakeven with
sequential at 12 procs● Increasing number of
processors worsened the problem– Synchronization more
expensive● Did not scale
Identify the problem● Large switch poser
– Trying to do a lot of activities– Hence had a very complex state– Handles a disproportionally large number of events– Faces large number of rollbacks– Leading to frequent synchronizations– Not allowing the GVT to advance– Large state size caused each check-point to be
expensive– Large number of events meant frequently check-
pointing its state
The Solution● Decompose switch into fine-grained posers● Ports are logical parallel entities in a switch.● Refactor switch in a number of ports● Smaller state; infrequent events● Meticulosly refactor, so as not to increase the
number of events● Output Buffered switches were refactored● Input Buffered switches need a complex
arbitration mechanism involving a central switch state
Improved Results● Phase time up● # GVT iterations down● Rollback fraction ok● Simulation time half● We still had a problem:
– Could not scale!!● Expedited GVT calculation
– First idle processor triggers a gvt calculation, and everyone has an updated GVT, not waiting for the phase to finish
– GVT computation gets highest priority, if any processor is idle
Load Imbalances● transient load imbalance went down● # GVT computations up● Improved scaling
But, small cyclic imbalanceApplication specific dependenciesDistribute posers to minimize
simulation dependenciesPartition input problem randomly
Communication load● Important consideration for fine-grained
simulation is communication● partition along the min-cut of the application
communication graph– decreases communiation– might increase inherent appliation dependencies
among various partitions
Performance results● Hypercube networks● Run on Turing● Reached over 2.5 million
events/sec on 128 processors
Communication Challenges● A 8192 node hypercube network across 128 procs
– Fits in memory comfortably– Communication – 50MB/s per processor– Small messages (msg size ~250 bytes)– Myrinet just about handles this
● A step further:– 16384 node hypercube on 128 procs
● Still fits in memory● Myrinet starts dropping packets at an alarming rate● NIC freezes● Runs out of execution time
Conclusion● Virtualization and fine decomposition coupled with
adaptive synchronization strategies help to address the challenges of large-scale fine-grained PDES
● Excellent problem-size and self scaling● Careful decomposition of complex objects required● Modelling posers correctly is essential for the
simulation to have good performance and scale
Download charm / POSE
● Charm++ / POSE / BigNetSim all freely downloadable at http://charm.cs.uiuc.edu/
● For more information on the research projects http://charm/cs.uiuc.edu/research/
● POSE: http://charm.cs.uiuc.edu/research/pose● BigNetSim:
http://charm.cs.uiuc.edu/research/BigNetSim