TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Altix UV HW/SW • SGI Altix UV utilizes an array of advanced hardware and software
feature to offload:
thread synchronization
data sharing
massage passing overhead from CPUs.
• This system has a rich set of hardware features that enable scalable programming models to be implemented with high efficiency and performance.
SGI MPI
• The SGI MPI software stack includes a number of software components.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
SGI MPI Software Stack • MPI
• XPMEM(cross process memory mapping)
• GRU development kit
• NUMA tools
• Perfboost
• Perfcatcher
• MPInside
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
UV HUB The UV_HUB is a custom ASIC developed by SGI. It
implements NUMAlink5 protocol, memory operations and associated atomic operations. It provides following capabilities:
Cache-coherent global shared memory.
Offloading time-sensitive and data-intensive operations from processors to increase processing efficiency and scaling.
Scalable, reliable, fair interconnect with other blades via NUMAlink5.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Altix UV Blade and HUB
Source : SGI Altix UV 1000 System User’s Guide
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
UV HUB in detail • SI(socket interface):
provides bridge between the Hub’s LH and RH chip sets and Intel sockets.
• To communicate with Intel sockets, the SI implements an Intel proprietary Interconnect called CSI(common system interface).
Source : SGI Altix UV admin manual
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
UV HUB in detail • LH(local home)
manages directory operations associated with remote memory requests. The LH has a single external memory channel.
• RH(Remote home)
processes coherent and non-coherent CSI transactions that are initialized by a local socket to a remote system address.
processes Numalink intervention and invalidate requests when remote is locally cached by a socket.
• LB(local block)
provides system software the ability to select, configure and control various functionalities of the UV hub chip.
provides facilities to monitor, diagnose, and debug hardware states and operations on live systems.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
UV HUB Units • The NUMAlink interconnect
• The Global reference unit(GRU)
• The processor interconnect
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
NUMAlink Interconnect • Shared memory, globally addressable system
interconnect.
• All physically distributed system memory is mapped into one global address space.
• Peak aggregate bi-directional bandwidth 15GB/s.
• 2-3x MPI latency improvement.
• Special support for block transfer and global operation.
• NUMAlink is connected into the memory infrastructure of the system, versus being indirectly connected through an IO subsystem chip.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Fetch-Op in HUB • Fetch-Op-variables on Hub provide fast synchronization
• The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 microseconds.
• Used by MPI_Barrier, MPI_Win_fence, and shmem barrier all
CPU HUB
ROUTER
CPU
Fetch-op variable
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
GRU • Hardware in the Hub for memory to memory block
transfer and CPU synchronization events.
• It is used by MPI, SHMEM, UPC
• External TLB with large page support
• Page initialization
• Scatter/Gather operations
• Update cache for AMOs
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
GRU API Components • GRU resource allocators
• GRU memory access functions
• XPMEM address mapping functions
• MPT address mapping functions
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
MOE • It is a set of functionality that offloads MPI communication
workload from CPUs to the Altix UV_HUB ASIC, accelerating common MPI tasks such as barriers and reductions across GSM(global shared memory).
• Similar in concept to a TCP/IP offload engine(TOE) which offloads TCP/IP protocol processing from system CPUs.
• Frees CPU from MPI activity.
• Faster reduction operations.
• Faster barriers and random access.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
MPI and MOE • Accessing the MOE.
• MOE implements atomic memory operations in conjunction with a hardware multicast facility that helps to accelerate MPI_barrier, MPI_Bcast, MPI_Allreduce.
• Accelerates MPI point-to-point and collective communication.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
MOE Advantages • MOE provides:
MPI message queues
synchronization primitives
Advanced RDMA capabilities such as strided and indexed global memory updates.
Hardware multicast.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Determining System Configuration
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
topology • topology:
displays general information about SGI Altix system, with a focus on node information.
• It includes node counts for blades, node IDs, NASIDs, memory per node, UV hub and partition number.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
cpumap • cpumap: displays logical CPUs and shows
relationship between them.
• Aspects displayed include, hyper threading, last level cache sharing and topology placements.
• It gets information from /proc/cpuinfo, /sys/devices/system and /proc/sgi_uv/topology
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
cpumap
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
nodeinfo
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
nodeinfo • Hit: page was allocated on the preferred node.
• Miss: preferred node was full. Allocation occurred on this node by a process running on another node that was full.
• Foreign: preferred node was full. Had to allocate somewhere else.
• Interleave: allocation was for interleaved policy numactl –i.
• Local: page allocated on This node by a process running on this node.
• Remote: page allocated on this node by a process running on the another node.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
x86info
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
pmchart • Put figure here
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
pmchart
• Put figure here
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
HW Summary • /proc/cpuinfo • /proc/meminfo • /sys/devices/system/node • /dev/cpuset/torque/job#
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Data Placement Tool
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
CPU Scheduling • In a single-processor system, only one process
can run at a time.
• CPU scheduling controls how the OS switches access to the CPU between processes.
• Kernel provides mechanism called time slicing.
• Time slice is the maximum length of time that a process owns its CPU resource and executes at its current policy.
• Each CPU has its own run queue.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Cache Affinity • Affinity scheduling is a special scheduling discipline
used in multiprocessor system.
• As a process executes, it causes more and more data and instruction text to be loaded into the processor cache. This creates an “affinity” between the process and the CPU.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Data Placement Tool
• NUMA machines have a shared address space. There is a single shared memory space and a single operating system instance.
• Performance penalty to access remote memory versus local memory.
• Access time to memory vary over physical address ranges and between processing elements. NUMAlink used to access memory between blades/node.
• Memory latency is lowest when a processor accesses local memory.
• NUMA tool also helps run multiple instances of serial program in a single job script with better processes placement.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
NUMA API • The API is called from libcpuset
cpuset: create, modify, destroy cpuset.
taskset: Run a process on specific physical CPU.
numactl: Control NUMA policy for processes or shared memory.
dplace: Binds process to specific logical CPU.
omplace: Controls the placement of MPI processes and OpenMP threads.
Batch systems: LSF, PBSPro, Torque, SGE
dlook, dlook-summary, pidstat, cpuset-Q
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
cpuset • cpuset includes sched_setaffinity for CPU
binding and memory binding.
• Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.
• All tasks sharing the same placement constraints reference the same cpuset.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Why Use a cpuset ? • Restrict consumption of designated resources
CPU to specified processes/threads.
• Limit run time variability.
• Memory affinity.
• Isolates the I/O.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
How Are cpuset’s Used • Static cpusets (batch calls shared by queue)
Cpusets are defined by administrator after system startup.
User attach processes to the existing cpusets.
Cpusets continue to exist after job finish executing.
• Dynamic cpusets
Workload management system(WMS) creates cpuset when It is required by a job.
WMS attaches job to the newly created cpus.
WMS destroys cpuset at the end of job.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
cpuset Command Line Options
• cpuset
-c cpuset_name Create CPUSET
-m cpuset_name Modify CPUSET
-x cpuset_name Destroy CPUSET
-d cpuset_name Dump CPUSET attributes
-i csname –I script Run command
-p cpuset_name List all procs in CPUSET
-a cpuset_name Attach pids to CPUSET
-w pid List CPUSET the PID is attached to
-f filename input config file
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Advantage of Cpuset? • It improves cache locality and memory access
time.
• Facilitates providing equal resources to each thread in a job.
Results in both optimum and repeatable performance.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
taskset • taskset: restricts execution to the listed set of
CPUs. However, processes are still free to move among listed CPUs.
• It is used to set of retrieve the CPU affinity of a running process given its PID or to launch a new command with a given CPU affinity.
• The CPU affinity is represented as a bitmask(hexadecimal), with the lowest order bit.
0x00000001 ## is processor number 0.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
taskset • taskset:It does not pin a task to a specific CPU.
It only restricts a task so that it does not run any CPU that is not in the cpulist.
• If you are running an MPI application, you do not use the taskset command. Instead of taskset use dplace.
mpirun –np 8 dplace -s1 –c10, 11, 16-21 ./a.out
export MPI_DSM_CPULIST 10,11,16-21 mpirun –np 8 ./a.out
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
taskset examples • taskset 0x1 ./a.out #executes on physical CPU 1
• taskset 0x00131 ./a.out #executes on physical CPUs 0 4 5 8
• taskset –p 0xa8 14386 #executes PID 14386 on physical CPUs 3 5 and 7
• taskset –p –c 5 ./a.out #execute a.out on physical CPU 5
• taskset –p 14386 #returns the affinity mask of PID 14386
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
numactl • Runs processes with a specific NUMA scheduling or memory placement
policy.
• Control memory placement
Interleave node(round robin)
Membind(allocate from specified node pool)
Preferred node
Local allocation(first touch)
• Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.
• All tasks sharing the same placement constraints reference the same cpuset.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
numactl Command Line Options
• numactl --interleave Set a memory interleave policy.
--membind Only allocate memory from nodes.
--cpunodebind Only execute command on the CPUs of nodes.
--physcpubind Only execute process on CPUs.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
numactl examples • numactl --physcpubind=+0-4,8-12 myapplic arguments #Run myapplic on cpus 0-4 and 8-12 of the current cpuset.
• numactl --interleave=all bigdatabase arguments #Run big database with its memory interleaved on all CPUs.
• numactl --cpubind=0 --membind=0,1 process #Run process on node 0 with memory allocated on node 0 and 1.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
numactl --hardware
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
dplace • dplace ensures the Linux kernel “pins” a thread [or series
of threads] to a specific CPU core within a container. Once pinned they do not migrate.
• By default, binds processes sequentially in a round-robin fashion against logical CPUs in current cpuset.
• Integrate with MPT[via omplace and environmental variables].
• It understands fork, exec, threads etc..
• Helps to ensure optimal performance and to minimize runtime variability.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
dplace Feature • Default memory allocation policy is node-local (first
touch).
• dplace allows processes to be bound to specific logical(within cpuset) cpus.
• Prevents migration (thread hopping).
• May require knowledge of application.
• Global load balancing.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
dplace Command Line Options
• dplace -c CPU list
-e exact placement
-s skip n cpu’s before starting placement
-n only processes with name
-x skip mask
-p placement file
-r replicate shared text to each node
-q list global count
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
dplace examples • dplace –c 0-3 ./a.out # places thread on the first four cpus, beginning with core 0.
• dplace –c 0-7 –x2 ./a.out # place threads on the first 8 cpus, but used SKIP MASK[-x2] to skip the second thread(which in the case of Intel OpenMP is the lightweight monitor thread)
• mpirun –np 8 dplace –s1 –c 0-7 ./a.out # skips the first process as this process is essentially the MPI shepherd. dplace handles the placements of the other 7 MPI ranks.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
numactl and dplace • Consider a code that runs with 4 threads.
• What is the difference between
numactl –c 0-3 a.out
dplace –c 0-3 a.out
• With dplace, each thread is bound to a particular cpu. With numactl, the threads are bound to the range of cpus 0-3, and are free to migrate within that range.
• numactl does have memory binding options.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
omplace • Tool for controlling the placement of MPI processes and
OpenMP threads.
-c cpulist: specifies the effective CPU list.
-nt threads: specifies the number of threads per MPI process.
-s skip: the number of processes to skip before placements starts.
-vv verbose: Automatically generated placement file will be displayed in its entirely.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
omplace examples • mpirun –np 2 omplace –nt 4 -vv ./a.out # To run 2 MPI processes with 4 threads per process, and to display the generated placement file.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
dlook • Tool for showing process memory maps and cpu usage.
• View address space and page placement.
• Two forms
dlook [options] pid
dlook [options] <command> [command-args]
• Run a MPI job using mpirun and print the memory map for each thread:
mpirun –np 8 dlook a.out
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Summary • Use cpumap to determine partitioning and placement.
• Use taskset to lock a process or process groups into CPU or group of CPUs.
• Use dplace to place a process group into system topology.
• Run an MPI/OpenMP hybrid and use omplace for pining.
• Use numactl to control memory placement.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Tips!!! • Use dplace, numactl, or cpuset to lock down
processes, preventing thread hopping/migration.
• Strong cache affinity reduces cache misses, instruction pipeline flushes.
• Keeps processes close to their node-local memory.
• Be aware of data placement.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
Heisenberg Principle • Looking at the system will impact the system
• Tracing events are the highest impact: strace, gprof,
• PCP and sar the lowest impact
• You can not measure a system without effecting it. top will show up in the top display.
• PCP uses less than 1% of a CPU.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
sar • sar indicates normal/abnormal behavior of
system. sar can imply performance problems and bottlenecks.
• Many people look at sar as a set of performance metrics when it is not. It is an indicator of what a system is doing!
• PCP and sar simply tell you what to look for.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
sar • sar –vq to check kernel table sizes.
• sar -W to check swapping activity.
• sar –rsW to what memory and swap is left.
• sar –u reports the amount of time executing kernel code.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
top, ps , pstree • top provides a dynamic real-time view of a
running system.
• top with H provides thread information.
• ps: report a snapshot of the currently running processes on the system. Use with grep <username> to get user specific information.
• pstree: display a tree of processes.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
vmstat, mpstat • vmstat indicates reports information about
processes, memory, paging, block IO, traps and cpu activity.
• mpstat writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
mpvis • mpvis displays a three dimensional bar chart of
CPU utilization. The display is updated with new values retrieved from the target host or archive every interval seconds.
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
pidstat • pidstat is used for monitoring individual task a
currently being managed by the Linux kernel.
–r report page faults and memory utilization
-d report I/O statistics
-u report CPU utilization
-p select tasks for which statistics are to be reported
-t display statistics for threads associated with selected tasks
• pidstat –t –p 14374
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
cpuset-Q • It gives information allocated CPUs, node, IPD,
WCHAN, Command name etc..
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
dlook • Tool for showing process memory maps and cpu
usage.
• Two forms
dlook [options] pid
dlook [options] <command> [command-args]
• Run a MPI job using mpirun and print the memory map for each thread:
mpirun –np 8 dlook a.out
TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV
References • UV System Analysis Manual • UV System Administration Manual • Technical Advances in the SGI Altix UV
Architecture(white paper) • A Hardware-Accelerated MPI Implementation on
SGI Altix UV Systems(white paper) • Linux Application Tuning Guide for SGI X86_64
Based Systems • SGI Message Passing Toolkit(MPT) User’s Guide • SGI NUMAlink white paper