Altix UV HW/SW · TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV Cache Afﬁnity • Afﬁnity...

TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV

Altix UV HW/SW •  SGI Altix UV utilizes an array of advanced hardware and software

feature to offload:

  thread synchronization

  data sharing

  massage passing overhead from CPUs.

•  This system has a rich set of hardware features that enable scalable programming models to be implemented with high efficiency and performance.

  SGI MPI

•  The SGI MPI software stack includes a number of software components.


SGI MPI Software Stack •  MPI

•  XPMEM(cross process memory mapping)

•  GRU development kit

•  NUMA tools

•  Perfboost

•  Perfcatcher

•  MPInside


UV HUB The UV_HUB is a custom ASIC developed by SGI. It

implements NUMAlink5 protocol, memory operations and associated atomic operations. It provides following capabilities:

  Cache-coherent global shared memory.

  Offloading time-sensitive and data-intensive operations from processors to increase processing efficiency and scaling.

  Scalable, reliable, fair interconnect with other blades via NUMAlink5.


Altix UV Blade and HUB

Source : SGI Altix UV 1000 System User’s Guide


UV HUB in detail •  SI(socket interface):

provides bridge between the Hub’s LH and RH chip sets and Intel sockets.

•  To communicate with Intel sockets, the SI implements an Intel proprietary Interconnect called CSI(common system interface).

Source : SGI Altix UV admin manual


UV HUB in detail •  LH(local home)

  manages directory operations associated with remote memory requests. The LH has a single external memory channel.

•  RH(Remote home)

  processes coherent and non-coherent CSI transactions that are initialized by a local socket to a remote system address.

  processes Numalink intervention and invalidate requests when remote is locally cached by a socket.

•  LB(local block)

  provides system software the ability to select, configure and control various functionalities of the UV hub chip.

  provides facilities to monitor, diagnose, and debug hardware states and operations on live systems.


UV HUB Units •  The NUMAlink interconnect

•  The Global reference unit(GRU)

•  The processor interconnect


NUMAlink Interconnect •  Shared memory, globally addressable system

interconnect.

•  All physically distributed system memory is mapped into one global address space.

•  Peak aggregate bi-directional bandwidth 15GB/s.

•  2-3x MPI latency improvement.

•  Special support for block transfer and global operation.

•  NUMAlink is connected into the memory infrastructure of the system, versus being indirectly connected through an IO subsystem chip.


Fetch-Op in HUB •  Fetch-Op-variables on Hub provide fast synchronization

•  The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 microseconds.

•  Used by MPI_Barrier, MPI_Win_fence, and shmem barrier all

CPU HUB

ROUTER

CPU

Fetch-op variable


GRU •  Hardware in the Hub for memory to memory block

transfer and CPU synchronization events.

•  It is used by MPI, SHMEM, UPC

•  External TLB with large page support

•  Page initialization

•  Scatter/Gather operations

•  Update cache for AMOs


GRU API Components • GRU resource allocators

• GRU memory access functions

•  XPMEM address mapping functions

• MPT address mapping functions


MOE •  It is a set of functionality that offloads MPI communication

workload from CPUs to the Altix UV_HUB ASIC, accelerating common MPI tasks such as barriers and reductions across GSM(global shared memory).

•  Similar in concept to a TCP/IP offload engine(TOE) which offloads TCP/IP protocol processing from system CPUs.

•  Frees CPU from MPI activity.

•  Faster reduction operations.

•  Faster barriers and random access.


MPI and MOE •  Accessing the MOE.

• MOE implements atomic memory operations in conjunction with a hardware multicast facility that helps to accelerate MPI_barrier, MPI_Bcast, MPI_Allreduce.

•  Accelerates MPI point-to-point and collective communication.


MOE Advantages • MOE provides:

  MPI message queues

  synchronization primitives

  Advanced RDMA capabilities such as strided and indexed global memory updates.

  Hardware multicast.


Determining System Configuration


topology •  topology:

displays general information about SGI Altix system, with a focus on node information.

•  It includes node counts for blades, node IDs, NASIDs, memory per node, UV hub and partition number.


cpumap •  cpumap: displays logical CPUs and shows

relationship between them.

•  Aspects displayed include, hyper threading, last level cache sharing and topology placements.

•  It gets information from /proc/cpuinfo, /sys/devices/system and /proc/sgi_uv/topology


cpumap


nodeinfo


nodeinfo •  Hit: page was allocated on the preferred node.

•  Miss: preferred node was full. Allocation occurred on this node by a process running on another node that was full.

•  Foreign: preferred node was full. Had to allocate somewhere else.

•  Interleave: allocation was for interleaved policy numactl –i.

•  Local: page allocated on This node by a process running on this node.

•  Remote: page allocated on this node by a process running on the another node.


x86info


pmchart •  Put figure here


pmchart

•  Put figure here


HW Summary •  /proc/cpuinfo •  /proc/meminfo •  /sys/devices/system/node •  /dev/cpuset/torque/job#


Data Placement Tool


CPU Scheduling •  In a single-processor system, only one process

can run at a time.

•  CPU scheduling controls how the OS switches access to the CPU between processes.

•  Kernel provides mechanism called time slicing.

•  Time slice is the maximum length of time that a process owns its CPU resource and executes at its current policy.

•  Each CPU has its own run queue.


Cache Affinity •  Affinity scheduling is a special scheduling discipline

used in multiprocessor system.

•  As a process executes, it causes more and more data and instruction text to be loaded into the processor cache. This creates an “affinity” between the process and the CPU.


Data Placement Tool

•  NUMA machines have a shared address space. There is a single shared memory space and a single operating system instance.

•  Performance penalty to access remote memory versus local memory.

•  Access time to memory vary over physical address ranges and between processing elements. NUMAlink used to access memory between blades/node.

•  Memory latency is lowest when a processor accesses local memory.

•  NUMA tool also helps run multiple instances of serial program in a single job script with better processes placement.


NUMA API •  The API is called from libcpuset

  cpuset: create, modify, destroy cpuset.

  taskset: Run a process on specific physical CPU.

  numactl: Control NUMA policy for processes or shared memory.

  dplace: Binds process to specific logical CPU.

  omplace: Controls the placement of MPI processes and OpenMP threads.

  Batch systems: LSF, PBSPro, Torque, SGE

  dlook, dlook-summary, pidstat, cpuset-Q


cpuset •  cpuset includes sched_setaffinity for CPU

binding and memory binding.

•  Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.

•  All tasks sharing the same placement constraints reference the same cpuset.


Why Use a cpuset ? •  Restrict consumption of designated resources

CPU to specified processes/threads.

•  Limit run time variability.

•  Memory affinity.

•  Isolates the I/O.


How Are cpuset’s Used •  Static cpusets (batch calls shared by queue)

  Cpusets are defined by administrator after system startup.

  User attach processes to the existing cpusets.

  Cpusets continue to exist after job finish executing.

•  Dynamic cpusets

  Workload management system(WMS) creates cpuset when It is required by a job.

  WMS attaches job to the newly created cpus.

  WMS destroys cpuset at the end of job.


cpuset Command Line Options

•  cpuset

-c cpuset_name Create CPUSET

-m cpuset_name Modify CPUSET

-x cpuset_name Destroy CPUSET

-d cpuset_name Dump CPUSET attributes

-i csname –I script Run command

-p cpuset_name List all procs in CPUSET

-a cpuset_name Attach pids to CPUSET

-w pid List CPUSET the PID is attached to

-f filename input config file


Advantage of Cpuset? •  It improves cache locality and memory access

time.

•  Facilitates providing equal resources to each thread in a job.

  Results in both optimum and repeatable performance.


taskset •  taskset: restricts execution to the listed set of

CPUs. However, processes are still free to move among listed CPUs.

•  It is used to set of retrieve the CPU affinity of a running process given its PID or to launch a new command with a given CPU affinity.

•  The CPU affinity is represented as a bitmask(hexadecimal), with the lowest order bit.

  0x00000001 ## is processor number 0.


taskset •  taskset:It does not pin a task to a specific CPU.

It only restricts a task so that it does not run any CPU that is not in the cpulist.

•  If you are running an MPI application, you do not use the taskset command. Instead of taskset use dplace.

mpirun –np 8 dplace -s1 –c10, 11, 16-21 ./a.out

export MPI_DSM_CPULIST 10,11,16-21 mpirun –np 8 ./a.out


taskset examples •  taskset 0x1 ./a.out #executes on physical CPU 1

•  taskset 0x00131 ./a.out #executes on physical CPUs 0 4 5 8

•  taskset –p 0xa8 14386 #executes PID 14386 on physical CPUs 3 5 and 7

•  taskset –p –c 5 ./a.out #execute a.out on physical CPU 5

•  taskset –p 14386 #returns the affinity mask of PID 14386


numactl •  Runs processes with a specific NUMA scheduling or memory placement

policy.

•  Control memory placement

  Interleave node(round robin)

  Membind(allocate from specified node pool)

  Preferred node

  Local allocation(first touch)

•  Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.

•  All tasks sharing the same placement constraints reference the same cpuset.


numactl Command Line Options

•  numactl --interleave Set a memory interleave policy.

--membind Only allocate memory from nodes.

--cpunodebind Only execute command on the CPUs of nodes.

--physcpubind Only execute process on CPUs.


numactl examples •  numactl --physcpubind=+0-4,8-12 myapplic arguments #Run myapplic on cpus 0-4 and 8-12 of the current cpuset.

•  numactl --interleave=all bigdatabase arguments #Run big database with its memory interleaved on all CPUs.

•  numactl --cpubind=0 --membind=0,1 process #Run process on node 0 with memory allocated on node 0 and 1.


numactl --hardware


dplace •  dplace ensures the Linux kernel “pins” a thread [or series

of threads] to a specific CPU core within a container. Once pinned they do not migrate.

• By default, binds processes sequentially in a round-robin fashion against logical CPUs in current cpuset.

•  Integrate with MPT[via omplace and environmental variables].

•  It understands fork, exec, threads etc..

• Helps to ensure optimal performance and to minimize runtime variability.


dplace Feature •  Default memory allocation policy is node-local (first

touch).

•  dplace allows processes to be bound to specific logical(within cpuset) cpus.

•  Prevents migration (thread hopping).

•  May require knowledge of application.

•  Global load balancing.


dplace Command Line Options

•  dplace -c CPU list

-e exact placement

-s skip n cpu’s before starting placement

-n only processes with name

-x skip mask

-p placement file

-r replicate shared text to each node

-q list global count


dplace examples •  dplace –c 0-3 ./a.out # places thread on the first four cpus, beginning with core 0.

•  dplace –c 0-7 –x2 ./a.out # place threads on the first 8 cpus, but used SKIP MASK[-x2] to skip the second thread(which in the case of Intel OpenMP is the lightweight monitor thread)

•  mpirun –np 8 dplace –s1 –c 0-7 ./a.out # skips the first process as this process is essentially the MPI shepherd. dplace handles the placements of the other 7 MPI ranks.


numactl and dplace •  Consider a code that runs with 4 threads.

• What is the difference between

numactl –c 0-3 a.out

dplace –c 0-3 a.out

• With dplace, each thread is bound to a particular cpu. With numactl, the threads are bound to the range of cpus 0-3, and are free to migrate within that range.

•  numactl does have memory binding options.


omplace •  Tool for controlling the placement of MPI processes and

OpenMP threads.

-c cpulist: specifies the effective CPU list.

-nt threads: specifies the number of threads per MPI process.

-s skip: the number of processes to skip before placements starts.

-vv verbose: Automatically generated placement file will be displayed in its entirely.


omplace examples •  mpirun –np 2 omplace –nt 4 -vv ./a.out # To run 2 MPI processes with 4 threads per process, and to display the generated placement file.


dlook •  Tool for showing process memory maps and cpu usage.

•  View address space and page placement.

•  Two forms

dlook [options] pid

dlook [options] <command> [command-args]

•  Run a MPI job using mpirun and print the memory map for each thread:

mpirun –np 8 dlook a.out


Summary •  Use cpumap to determine partitioning and placement.

•  Use taskset to lock a process or process groups into CPU or group of CPUs.

•  Use dplace to place a process group into system topology.

•  Run an MPI/OpenMP hybrid and use omplace for pining.

•  Use numactl to control memory placement.


Tips!!! •  Use dplace, numactl, or cpuset to lock down

processes, preventing thread hopping/migration.

•  Strong cache affinity reduces cache misses, instruction pipeline flushes.

•  Keeps processes close to their node-local memory.

•  Be aware of data placement.


Heisenberg Principle •  Looking at the system will impact the system

•  Tracing events are the highest impact: strace, gprof,

•  PCP and sar the lowest impact

•  You can not measure a system without effecting it. top will show up in the top display.

•  PCP uses less than 1% of a CPU.


sar •  sar indicates normal/abnormal behavior of

system. sar can imply performance problems and bottlenecks.

•  Many people look at sar as a set of performance metrics when it is not. It is an indicator of what a system is doing!

•  PCP and sar simply tell you what to look for.


sar •  sar –vq to check kernel table sizes.

•  sar -W to check swapping activity.

•  sar –rsW to what memory and swap is left.

•  sar –u reports the amount of time executing kernel code.


top, ps , pstree •  top provides a dynamic real-time view of a

running system.

•  top with H provides thread information.

•  ps: report a snapshot of the currently running processes on the system. Use with grep <username> to get user specific information.

•  pstree: display a tree of processes.


vmstat, mpstat •  vmstat indicates reports information about

processes, memory, paging, block IO, traps and cpu activity.

•  mpstat writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.


mpvis •  mpvis displays a three dimensional bar chart of

CPU utilization. The display is updated with new values retrieved from the target host or archive every interval seconds.


pidstat •  pidstat is used for monitoring individual task a

currently being managed by the Linux kernel.

–r report page faults and memory utilization

-d report I/O statistics

-u report CPU utilization

-p select tasks for which statistics are to be reported

-t display statistics for threads associated with selected tasks

•  pidstat –t –p 14374


cpuset-Q •  It gives information allocated CPUs, node, IPD,

WCHAN, Command name etc..


dlook •  Tool for showing process memory maps and cpu

usage.

•  Two forms

dlook [options] pid

dlook [options] <command> [command-args]

•  Run a MPI job using mpirun and print the memory map for each thread:

mpirun –np 8 dlook a.out


References •  UV System Analysis Manual •  UV System Administration Manual •  Technical Advances in the SGI Altix UV

Architecture(white paper) •  A Hardware-Accelerated MPI Implementation on

SGI Altix UV Systems(white paper) •  Linux Application Tuning Guide for SGI X86_64

Based Systems •  SGI Message Passing Toolkit(MPT) User’s Guide •  SGI NUMAlink white paper

Date post:	17-Feb-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	1 times

Altix UV HW/SW · TG11 - SGI Altix UV Tutorial NCSA - PSC - RDAV Cache Afﬁnity • Afﬁnity...

Documents