Performance analysis of a Pose application -- BigNetSim

Performance analysis of a Pose application -- BigNetSim

Nilesh Choudhury

BigNetSim● A parallel simulator for performance prediction

of parallel machines● Two components:

– Processor performance modelling– Interconnection Network modelling

● The two components could be used individually or in synergy.

● Two modes of operation:– Direct Simulation (on-line mode)– Trace-driven simulation (off-line mode)

Architecture of BigNetSim

Which part is most relevant to Pose Performance ?

● The Interconnection Network simulation

Network Simulator Modelling

● Very detailed model:– Switch modelled as:

● Collection of Input/Output ports● Arbitration strategies to serve incoming requests fairly● Detailed Virtual Channel selection strategies

– Input VC and Output VC● Switch delays and arbitration costs modelled here● Switch load and contention measures computed and

updated to assist adaptive routing strategies and fault-tolerant routing

● Virtual cut-through routing; store-and-forward routing● Number of posers per switch = # ports!

Network Simulator Modelling– Network Information Card (NIC)

● 'Send NIC' packetizes and sends a message at CPU's request● 'Recv Nic' unpacks, reassembles and delivers a message to

the CPU on receiving incoming packets.● Network card send and receive latencies modelled here● Number of posers per NIC = 2

– Channel● Doesn't need to be very sophisticated● Models a simple channel delay and receives a packet from a

switch/Nic and delivers it to the corresponding switch/Nic it is connected to.

● Number of posers per channel = 1

Topologies and Routing Algorithms

● Topology and Routing strategies provide functions which the network uses

● Extrmely modular design● Write your own routing strategies● Write your own topology● We have some available:

– KaryNcube; KaryNmesh; KaryNtree; Nmesh; fattree; hypercbe and some hybrid variations

Routing Algorithms

● Minimal deadlock-free; Non-minimal and Fault-tolerant variations

● K-ary-N-mesh / N-mesh● Direction Ordered;● Planar Routing;● Static Direction Reversal Routing● Optimally Fully Adaptive Routing (modified too)

– K-ary-N-tree● UpDown (modified, non-minimal)

– HyperCube● Hamming● P-Cube (modified too)

Input/Output VC selection

● Input Virtual Channel Selection– RoundRobin;– Shortest Length Queue– Output Buffer length

● Output Virtual Channel Selection– Max. available buffer length– Max. available buffer bubble VC– Output Buffer length

Building up a machine● Involves selecting the processor capabilities● Selecting the Interconnection network

– Available set of topologies, routing algorithms, virtual channel selection strategies

– Easy to build an interconnection network closely modelling the target machine

– All these modules are easily extendable to create and plug in new topology, routing algorithm, etc

● Some preconfigured machines include:– Bluegene; RedStorm; lemieux; etc– Generalized hypercube, fat-tree, torii and mesh

architectures

Hardware support for Collectives

● You could also model a network with hardware collectives for multicast, reduction and broadcast

● Collective Manager is interfaced with the basic network units

● You need to define the collective manager operations for the corresponding topology

● Already available for:– Hypercube; fattree; densegraph and hybrid variations

Network configuration Parameters

● Apart from Routing algorithm; Topology; virtual channel selection; switch size (number of ports); number of virtual channels associated to a port– Size of network– Channel bandwidth; Switch bandwidth– NIC send/recv packet latencies– NIC packetization costs– Switch buffer size– Size of a single packet– Delays in various components– DMA delay; Processor send overhead; etc

How does this modelling translate in the POSE framework

● If we model the following machine:– 'n' nodes;– 's' switches;– 'p' ports per switch

● There are:– 4*n + 2p*s posers.– A proc, co-proc, send-NIC, recv-NIC per machine– 'p' ports and 'p' channels per switch

An example

● Suppose we model a 2048 node bluegene network connected as a 3D torus:– n=2048; – s=2048; – p=6;– Total number of posers = 4*2048+2*6*2048 =

16*2048 = 32,768 posers.– Ample virtualization to run this simulation on 100

processors.

Factors related to Performance● Number of GVT synchronizations:

– Gives an insight of the amount of parallelism within the threshold controlled by the simulation

● Large number of sync – possibly little work within allowable limits

● Phase Time – real time elapsed between consecutive GVT synchronizations– Indicates the amount of parallelism

● Rollback fraction– Proportion of time for undoing speculative work– Implies too many strict dependencies in the simulation

contd...● Communication fraction:

– Fraction of total time spent communicating● Simulation dependencies:

– Posers should be distributed on processors such that it minimizes dependency

● Simulation strategy to use:– Optimistic; Adept; etc– Control the amount of throttling – speculative window

● Speedup with sequential simulation:– Sequential simulation is faster as it gets rid of all

synchronization, provided it fits in memory

DetailedSim – performance case study● DetailedSim (with switch

modelled as a single poser)– running on 16 processors– simulating a 2048 node

hypercube network– random traffic generated at

each processing node● Specculation still within

reasonable limits (<20%)● Phase time very small

(<5ms)

contd...

● Poor real speedup● Breakeven with

sequential at 12 procs● Increasing number of

processors worsened the problem– Synchronization more

expensive● Did not scale

Identify the problem● Large switch poser

– Trying to do a lot of activities– Hence had a very complex state– Handles a disproportionally large number of events– Faces large number of rollbacks– Leading to frequent synchronizations– Not allowing the GVT to advance– Large state size caused each check-point to be

expensive– Large number of events meant frequently check-

pointing its state

The Solution● Decompose switch into fine-grained posers● Ports are logical parallel entities in a switch.● Refactor switch in a number of ports● Smaller state; infrequent events● Meticulosly refactor, so as not to increase the

number of events● Output Buffered switches were refactored● Input Buffered switches need a complex

arbitration mechanism involving a central switch state

Improved Results● Phase time up● # GVT iterations down● Rollback fraction ok● Simulation time half● We still had a problem:

– Could not scale!!● Expedited GVT calculation

– First idle processor triggers a gvt calculation, and everyone has an updated GVT, not waiting for the phase to finish

– GVT computation gets highest priority, if any processor is idle

Load Imbalances● transient load imbalance went down● # GVT computations up● Improved scaling

But, small cyclic imbalanceApplication specific dependenciesDistribute posers to minimize

simulation dependenciesPartition input problem randomly

Communication load● Important consideration for fine-grained

simulation is communication● partition along the min-cut of the application

communication graph– decreases communiation– might increase inherent appliation dependencies

among various partitions

Performance results● Hypercube networks● Run on Turing● Reached over 2.5 million

events/sec on 128 processors

Communication Challenges● A 8192 node hypercube network across 128 procs

– Fits in memory comfortably– Communication – 50MB/s per processor– Small messages (msg size ~250 bytes)– Myrinet just about handles this

● A step further:– 16384 node hypercube on 128 procs

● Still fits in memory● Myrinet starts dropping packets at an alarming rate● NIC freezes● Runs out of execution time

Conclusion● Virtualization and fine decomposition coupled with

adaptive synchronization strategies help to address the challenges of large-scale fine-grained PDES

● Excellent problem-size and self scaling● Careful decomposition of complex objects required● Modelling posers correctly is essential for the

simulation to have good performance and scale

Download charm / POSE

● Charm++ / POSE / BigNetSim all freely downloadable at http://charm.cs.uiuc.edu/

● For more information on the research projects http://charm/cs.uiuc.edu/research/

● POSE: http://charm.cs.uiuc.edu/research/pose● BigNetSim:

http://charm.cs.uiuc.edu/research/BigNetSim

http://charm.cs.uiuc.edu/

http://charm/cs.uiuc.edu/research/

http://charm.cs.uiuc.edu/research/pose

http://charm.cs.uiuc.edu/research/BigNetSim

Date post:	02-Feb-2016
Category:	Documents
Upload:	joshua
View:	38 times
Download:	0 times

Performance analysis of a Pose application -- BigNetSim

Documents