- Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P...

transcript

- Tuning and Analysis Utilities: TAU- Debugging on Blue Gene/P System- Intel Multi-core Architecture- Hybrid / Multi-core Programming

Tulin KamanDepartment of Applied Mathematics and Statistics

Stony Brook University

Stony Brook Center for Computational CenterJune 5th 2009

Overview

PART-I • Blue Gene Overview• Performance Tool• Debugging Tool

PART-II• Intel Multi-core Architectures• Hybrid / Multi-core Programming

Blue Gene Systems

New York Blue/L : 18 racks18432 compute nodes (36864 CPUs)

New York Blue/P : 2 racks2048 compute nodes (8192 CPUs)

Blue Gene System Software

consists of five integrated software subsystems System administration and management Partition and job management Application development and debugging tools Compute Node kernel and services I/O Node kernel and services

BG Hardware Subsystems

• These software subsystems are required in the hardwaresubsystems: Front-end node (fen/fenp): provides access users

to edit and compile program, create job script file andsubmit jobs

Service node: is the manager of the system andused for controlling the system.

I/O nodes (IO): provide access to external devicesthrough an Ethernet port to the 10 gigabit functionalnetwork and can perform file I/O operation

Compute nodes (CN): run user application.

Nodes (Compute and I/O)

• The Compute Nodes on BG/P are made of one quadcore chip with 2GB physical memory.

• Compute nodes are reserved for computations and I/O iscarried out via the I/O nodes.

• Each Blue Gene I/O Node serves a group of ComputeNodes.

• The I/O Node include a complete Internet Protocol (IP)stack, with TCP and UDP services.

• CNs share the IP address of the I/O node, and thesocket port number is a single address space within theprocessor set.

CN and I/O node properties

NYBlue partition naming convention

• B denotes the Block.• Partition size• The nodes are interconnected through

six networks, one of which connects thenearest neighbors into 3D torus ormesh.

• Pset ratio: specifies the I/O node tocompute node ratio. A designates a pset ratio of 1:16 B designates a pset ratio of 1:32 C designates a pset ratio of 1:64 D designates a pset ratio of 1:128

The Compute NodeExecution Mode

• The CN on BG/P: quad-core on a single chip with 2GB.

Execution process modes on BG/P -mode SMP : Symmetrical MultiProcessing

one MPI task/node, four threads/task, 2GB -mode DUAL : Dual Node

two MPI task/node, two threads/task, 1GB -mode VN : Virtual Node

four single threaded MPI task/node, 512MB

Compilers Overview

• IBM XL Family of Compilers: xlc, xlc++,xlf ,… : Compilers are for FENP not for CN bgxlc, bgxlc++, bgxlc_r, bgxlc++_r, bgxlf, bgxlf_r … The programmer

explicitly identifies all the libraries and include filesC and C++ /opt/ibmcmp/vacpp/bg/9.0/bin/Fortran /opt/ibmcmp/xlf/bg/11.1/bin/

• GNU Compiler Collection: gcc, g++, gfortran compilers are for FENP not for CN powerpc-bgp-linux-gcc, powerpc-bgp-linux-g++, powerpc-bgp-linux-

gfortran, … The programmer explicitly identifies all the libraries andinclude files.

/bgsys/drivers/ppcfloor/gnu-linux/bin/

Building Executableswith MPI wrappers

• IBM XL Family of Compilers:/bgsys/drivers/ppcfloor/comm/bin/

MPI wrappersmpixlc, mpixlcxx, mpixlf77, mpixlf90, …

Thread-safe version of MPI wrappersmpixlc_r, mpixlcxx_r, mpixlf77_r, mpixlf90_r, …

• GNU Compiler Collection:/bgsys/drivers/ppcfloor/comm/bin/

MPI wrappers mpicc, mpicxx, mpif77, mpif90, …

IBM XL CompilersOptimization Options

• IBM XL compilers have level of optimization support. Basic command-line optimization

-O0 = quick local optimization -O2 = -O0 -qmaxmem=8192

Advanced command-line optimization -O3 = -O2 -qnostrict -qmaxmem=-1 -qhot=level=0 -O4 = -O3 -qarch=auto -qtune=auto -qcache=auto -qhot -qipa -O5 = “All of -O4” -qipa=level=2

• Make sure that the application is first compiled andexecuted properly at low optimization levels.

IBM XL compilers for betterperformance

• Specify the qarch and qtune values, otherwise you may get -qarch=auto -qtune=auto which optimizes for the front-end to optimize for the compute nodes -qarch=450 : Generates code for a single FPU -qarch=450d : Generates code for a double FPU(double hummer) -qtune=450 : Optimizes object code for the 450 family processor

-qhot : enables and customizes high-order loop analysis andtransformation.

* Time study is performed on FronTier code.

-O3-qhot

278 / 282

-O3 -qtune=450-qarch=450/450d

Time(sec)*

280290420

-O3-O2-O0

Usage of -qsmp

• auto enables automatic parallelization• omp automatic parallelization is disabled. Only OpenMP parallelization

pragmas are recognized.• opt instructs the compiler to optimize as well as parallelize. The

optimization is -O2 -qhot.• Compiler can automatically locate and countable loop is automatically

parallelized if The order in which loop iterations start and end doesn’t affect the result The loop doesn’t contain I/O operations The code is compiled with a thread-safe version of compiler (_r suffix) -qnostrict_induction compiler option is in effect. -qsmp=auto option is in effect.

1612 sec420 secmpixlc_r

1145 sec

1024x1024

mpixlc_r -qsmp=auto -qnostrict_induction

Problem Size

298 sec

512x512

GNU Profiler

• GNU profiler is available on BG/P.• Compile and link with -g -pg.• Run the application. See gmon.out files on your working directory.• Analyze the data with

> gprof <yourexe> gmon.out.0 > report.0> vi report.0

Mathematical AccelerationSubsystem Libraries

• MASS libraries offer improved performance over the standardmathematical library routines, are thread-safe, and supportcompilations in C, C++, and Fortran applications.

• On the linker command, it is explicitly specified.-L/opt/ibmcmp/xlmass/bg/4.4/bglib -lmass

/opt/…/bglib contains the "cross-compiled" versions of thelibraries.

/opt/…/lib and /opt/../lib64 contain the "native" versions of thelibraries.

Overview

PART-I • BG/P Overview• Performance Tool• Debugging Tool

Tuning and AnalysisUtilities: TAU

-PROFILE-PROFILECALLPATH

-MULTIPLECOUNTERS …

• Performance evaluation tool

• Profiling and tracing toolkit for performance analysis ofparallel programs written in C, C++, Fortran, Java andPython.

• Support for multiple parallel programming paradigms:MPI, Multi-threading, Hybrid (MPI+Threads)

• Access to hardware counters.

• Automatically instruments your code.

How to use TAU?

• Set a couple of environment variables $PATH, $TAU_MAKEFILE, $TAU_OPTIONS

• Instrument the program by inserting TAU macros orautomatically.

• To take advantage of TAU's automatic instrumentationfeatures, Program Database Toolkit (PDT) providesaccess to the high-level interface of source code foranalysis tools and applications.

• For automatic instrumentation Replace the compiler with TAU compiler script.

TAU Configuration

• Each configuration labeled with the options used../configure -mpi -arch=bgp -pdt=<pdt-dir> -pdt=xlC

PROFILE(default)PROFILECALLPATH/MULTIPLECOUNTERS

• Each configuration creates a unique Makefile under /bgl/apps/TAUL/tau-2.18/bgl/lib/ /bgsys/apps/TAUP/tau-2.18/bgp/lib/

• TAU compiler scripts are installed in /bgl/apps/TAUL/tau-2.18/bgl/bin/ /bgsys/apps/TAUP/tau-2.18/bgp/bin/

• Add the bin directory to your path.

Set TAU_MAKEFILE

• Set the environment variable TAU_MAKEFILE to the location ofthe tau makefile.

• List of TAU’s MakefileMakefile.tau-mpi-pdtMakefile.tau-callpath-mpi-pdtMakefile.tau-multiplecounters-mpi-papi-pdt …

• Start with MPI instrumentation & PDT for automatic sourceinstrumentation.

> export TAU_MAKEFILE=/bgl/apps/TAUL/tau-2.18/bgl/lib/Makefile.tau-mpi-pdt> export TAU_MAKEFILE=/bgsys/apps/TAUP/tau-2.18/bgp/lib/Makefile.tau-mpi-pdt

TAU Shell Scripts

• Compile your code with TAU shell scripts.

• If your Fortran code is a fixed-format Fortran code, use “tau_f90.sh -qfixed”

• -D options to XLF: The XL Fortran compilers require a slightlydifferent syntax to define preprocessor macro symbols, use "-WF,-D”.

tau_f90.shmpixlf77/mpixlf77_rmpixlf90/mpxlf90_r

mpif77mpif90

tau_cxx.shmpixlcxx/mpixlcxx_rmpicxx

tau_cc.shmpixlc/mpixlc_rmpicc

TAU shell scriptsIBM XL CompilersGNU Compilers

Analyze Performance Data

• pprof ( for text based display ) sorts and displays profile data generated by TAU. Execute pprof in the directory where profile files are

located.• paraprof ( for GUI display)

TAU has Java based performance data viewer. Requires Java1.4 or above, add it to your path. --pack options pack the data into packed (.ppk) format and

it does not launch the paraprof GUI. > paraprof --pack filename To launch the GUI

> paraprof filename.ppk

Generate a Flat Profile Set environment variables> export PATH=/bgl/apps/TAUL/tau-2.18/bgl/bin:$PATH>export TAU_MAKEFILE=/bgl/apps/TAUL/tau-2.18/bgl/lib/Makefile.tau-mpi-pdt

Compile your code with TAU shell scripts.> make CC=tau_cc.sh CXX=tau_cxx.sh F90=`tau_f90.sh -qfixed`

Provide the full path to the directory where you want to store the profile files(profile.x.0.0). In your batch job script file, set the environment variable PROFILEDIR.

# @ arguments = -np 16 -env PROFILEDIR=<profile-dir> -exe …

Run your job. Go to the directory where you store the profile files. Pack the data intopacked (.ppk) format.

> paraprof --pack filename.ppk

Launch the GUI to analyze the data.> paraprof filename.ppk

Identify the routines that use the most time

• Configuration with -PROFILE(default) creates Makefile.tau-mpi-pdt.

Generate call path profiles

• f1 => f2 shows the time spent in f2 when it is called by f1

paraprof→Windows→Threads→Call Graph

-MULTIPLECOUNTERS-papi=<papi-dir>

• Blue Gene modern CPUs provide on-chiphardware performance counters that can recordseveral events. The number of instructions issued The number of L1, L2 and L3 data and instruction

cache misses, hits, access, read, write.• TAU uses the Performance Data Standard and

API (PAPI-Performance ApplicationProgramming Interface) to access theseperformance counters.

Generate HardwareCounter Profile

• Set the environment variable TAU_MAKEFILE to Makefile.tau-multiplecounters-mpi-papi-pdt

• Set the COUNTERx environment variables to specify the type of counterto profile in your job script file. # @ arguments = -np 1 -env PROFILEDIR=<profile-dir>-env “COUNTER1=GET_TIME_OF_DAY COUNTER2= PAPI_L1_DCM \COUNTER3=PAPI_L1_ICM COUNTER4=PAPI_L1_TCM” -exe …

• Following subdirectories will be created<profile-dir>/MULTI__GET_TIME_OF_DAY<profile-dir>/MULTI__PAPI_L1_DCM<profile-dir>/MULTI__PAPI_L1_ICM

<profile-dir>/MULTI__PAPI_L1_TCM

Performance Counters

Fast Blue Gene Timers

• Blue Gene systems have a special clock cyclecounter that can be used for low overhead timings.

-BGLTIMERS Use fast low-overhead timers on IBM BG/L

-BGPTIMERS Use fast low-overhead timers on IBM BG/P

PerfExplorer

• Framework for parallel performance data mining.

• Enables the development and integration of datamining operations that will be applies to large-scaleparallel performance profiles.

• Requires Java Run Time Environment 5

• Requires PerfDMF (Performance DataManagement Framework) from TAU.

Running PerfExplorer

• Make sure you have Java5 or better in your PATH.• Configure PerfDMF. To configure PerfDMF,run perfdmf_configure• Generate .ppk files.

> llsubmit tau_app16.ru> paraprof --pack tau_np32.ppk>…> llsubmit tau_app512.run> paraprof --pack tau_np512.ppk

• > paraprof Add trial to the DB.

Trial type: Paraprof Packed Profile Select File(s) -> OK. Uploading Trial.

• > perfexplorer Choose Experiments. The options under the Chart menu provide analysis.

Overview

Debugging onBlue Gene/P Systems

Debugging Architecture

• The Compute Node Kernel provides the low-level primitives to debug

• The Control and I/O Daemon (CIOD) running on theI/O Nodes supports debugger access into the compute nodes.

• The debug server running on the I/O Nodes code that interfaces with the CIOD

• The debug client running on a Front End Node code that the user interacts directly make remote request to the debug server

The GNU Debugger

• Debugging application on Blue Gene is differentfrom debugging on cluster because The application runs on the compute nodes. The application developer is logged into the front-end

nodes.• GDB has the ability to do remote debug.• Debug server called gdbserver allows the

GDB work with applications running on computenodes.

Compiling your program

• Compile your code with -g. This tells the compiler toinclude debug information.

• Do not use high level compiler optimization.• Enable compiler optimization with -O[N] where N is

zero. This tells the compiler to disable optimization.• Specify the complementary options -qarch and -qtune

values for PPC440 on BG/L and PPC450 on BG/P.

Starting gdbserver

• mpirun command is used to start gdbserver.• A parameter -start_gdbserver is used on the

mpirun command to start the program under thedebugger control.

• Open two separate console shells. The first shell -

Start the debugger server on the I/O Node

The second shells - Make remote request to debugger server and debug the

application on the Compute Node.

Step1: Starting the gdbserverusing mpirun

• Start your application using mpirun with -start_gdbserver

• Partition called B064TC00 is used.• The executable is gas_O0_gdb.

What happens

• The application does not start running immediately.

• The application is loaded to compute nodes.

• The debug server called gdbserver is started onthe I/O node in the specific partition.

• mpirun stops and wait you to connect GDB clientsto the compute nodes.

Step2: Finding the address of theCompute Node

• After starting the gdbserver, you will get a prompt from thegdb server.

• Find the IP address and the port number for the computenode that you want to debug. dump_proctable will give youthe list of port numbers for all the compute nodes you candebug.

> dump_proctableMPI Rank 0: Connect to 172.30.200.101:10063MPI Rank 1: Connect to 172.30.200.101:10058

Step 3: Starting gdb client

• Open a new window.• Start gdb on the front-end node and connect to gdbserver.• Do not use the default gdb (/usr/bin/gdb).

• After you get the gdb prompt use the target remote command ofgdb to connect to gdbserver using the address of the computenode. <IPaddress:PortNumber>

Step4: Start running the applicationon compute node

• When you connect the GDB clients to specificcompute node, press Enter to send signal tompirun command.

• Now you are debugging your application on theconfigured compute node.

• Now you can use the standart gdb commandsafter this point.

• You may want to quit gdb by ending the remotedebugging with detach command of gdb firstbefore quit. Otherwise gdb will kill theapplication.

Overview

Intel Multi-core Architecture

NEHALEM

Nehalem: Intel microarchitecture

• Intel white paper does say that a new dynamically and design-scalable microarchtitecture rewrites the book on energyefficiency and performance.

• To extract greater performance from this new microarchitecture,Intel introduces Intel QuickPath Technology. With IntelQuickPath Technology, each processor core features anintegrated memory controller and high-speed interconnect.

• Designed from the ground up to take advantages of 45-nanometer (nm) high-k metal gate silicon technology. helps to dramatically increases processor energy efficiency increases transistor switching speeds up to enable higher core and

bus clock frequencies Thus more performance in the same power and thermal envelope.

• How the processor uses available clock cycles and power,rather than just pushing up ever higher clock speeds andenergy needs.

• Nehalem’s biggest innovations come from new optimizationsof the individual cores and the overall multi-coremicroarchitecture to increase single-thread and multi-threadperformance.

Microarchitecture’s performance and powermanagement innovations include:

• Dynamically managed cores, threads, cache, interfaces, andpower.

• Microarchitecture’s Simultaneous multi-threading (SMT)capability enables running two simultaneous threads per core- 8 simultaneous threads per quad-core processor 16 simultaneous threads for dual-processor quad-core

designs.• Superior multi-level cache, including an inclusive shared L3

cache.• New high-end system architecture that delivers from two to

three times more peak bandwidth and up to four times morerealized bandwidth.

• Performance-enhanced dynamic power management

Nehalem Goals

• Low latency to retrieve data Keep execution engine fed without stalling

Latency: how long does it take for the first byte sent from onenode to reach its target node

• High data bandwidth Handle requests from multiple cores/threads seamlessly

Bandwidth: how many megabytes of data can one send from anode to another node in a second

• Scalability

Performance Improvements

Intel’s core enhancements to further improve theperformance of the individual processorcores(Nehalem).

• Instructions Per Cycle Improvements: The moreinstructions that can be run per each clock cycle, thegreater the performance.

• Enhanced Branch Prediction: Branch predictionattempts to guess whether a conditional branch will betaken or not. Branch predictors are crucial in today’sprocessors for achieving high performance. They allowprocessors to fetch and execute instructions withoutwaiting for a branch to be resolved.

Intel’s core enhancements

• Simultaneous Multi-Threading: Intelintroduces Hyper-Threading Technology (HT),a technique enabled a single execution core torun two threads at the same time. A quad-coreprocessor could run up to eight threadssimultaneously.

Intel’s core enhancements

• Intel Smart Cache Enhancement: enhances the Intel SmartCache by adding an inclusive shared L3 (last-level) cache thatcan be up to 8 MB in size. Inclusive cache policy for best performance

Address residing in L1/L2 must be present in third(last)level cache.

A miss of its inclusive shared L3 cache guarantees the datais outside the processor, not on-die and thus is designed toeliminate unnecessary core snoop traffic between cores toreduce latency and improve performance.

If a data request misses on exclusive shared L3 cache, eachprocessor core must be searched in case their individualcaches might contain the requested data. This canincrease latency and traffic between cores.

80 core Teraflops Research Chip

• The 80 core Teraflops Research Chip is the firstprogrammable chip to deliver more than one trillionfloating point operations per second (1 Teraflops) ofperformance while consuming very little power.

• This research project focuses on exploring new, energy-efficient designs for future multi-core chips, as well as approaches tointerconnect and core-to-core communications. http://techresearch.intel.com/articles/

Tera-Scale/1449.htm

Tukwila

• Tukwila is Intel's next-generation Itaniumprocessor with four cores

• 30MB total cache• Intel QuickPath Interconnect Technology• Dual Integrated Memory Controller• Expected in 2010

IBM Blue Gene/Q Architecture

• IBM is not releasing low-level details of the Blue Gene/Qarchitecture.

• Blue Gene/Q node will contain 16 cores. This can beimplemented as one 16-core chip or two 8-core chips or even four quad-core chips.

• Each node stands to have 16 GB.• In 2011 Lawrence Livermore National Laboratory will

install a 20 petaflop system.

Overview

Hybrid / Multi-coreProgramming

How to program machinesthat can be built?

• Distributed - Memory Machines Each node in the computer has a locally addressable

memory space. Parallel programs consists of cooperating processes, each

with its own memory. Processes send data to one another as messages.

Message Passing Interface (MPI) is a widely usedstandard for writing message-passing programs.

• Shared - Memory Machines Each core can access the entire data space. In shared memory multi-core architectures, OpenMP,

Pthreads can be used to implement parallelism.

Start with auto parallelization

IBM XL thread safe compilers will automaticallyparallelize your code if There is no branching into or out of the loop. The increment expression is not within a critical

section. The order in which loop iterations start and end

doesn’t affect the result. The loop doesn’t contain I/O operations The program is compiled with a thread-safe version

of compiler. (_r suffix : mpixlc_r, mpixlcxx_r,mpixlf77_r,…)

Shared Memory Model

• All threads have access tothe same, global shared,memory.

• Threads also have their ownprivate memory.

• Shared data is accessibleby all threads.

• Private data can be onlyaccessed by the thread thatowns it.

• Programmers areresponsible forsynchronizing access(protecting) globally shareddata.

Shared Memory Programming:Pthreads

Pthreads = POSIX threads Hardware vendors have their own versions of threads.

Difficult for programmers to develop portable threadedapplications.

For UNIX systems, a standardized programminginterface specified by the IEEE POSIX 1003.1cstandard. This standard are referred to as POSIX(Portable Operating System Interface) threads.

Lower level Unix library to build multi-threadedprograms.

Pthreads are defined a set of C programminglanguage types and function calls.

Shared Memory Programming:OpenMP

OpenMP = Open Multi-Processing The OpenMP Application Program Interface (API) for

writing shared memory parallel programs. Supports multi-platform shared-memory parallel

programming in C/C++ and Fortran on all architectures. Consists of

Compiler Directives Runtime Library Routines Environment Variables

OpenMP program is portable. Compilers have OpenMPsupport.

Requires little programming effort.

Labeling the data

• As programmers we need to think about where in thememory we can put the data Shared

All threads can read and write the data simultaneously. The changes are visible to all threads.

Private Thread has a copy of the data. Other thread can not access to this data. The changes are visible to only the thread that owns

the data.

Common modelfor threaded program

• Fork-Join Model

Master threads runs from startto end.

When the parallelism isspecified, the main thread getshelp from the threads that arecalled worker threads.

At the end of the parallelportion of the work, the threadssynchronize and terminate.

Only the main thread leaves.

Master Thread

Parallel Region

Simple Example

• For-Loop:

All the iterations are independent.

• OpenMP:

Specify the include file "omp.h” Compile with

> mpixlc_r -qsmp=omp ex1.c -o ex1 By default, the runtime environment uses all available

threads. The default execution mode is SMP(fourthreads/task).

It breaks the work between threads.

Setting Environment variables

• If you want to use fewer than the number ofavailable you need to set XLSMPOPTS=PARTHDS orOMP_NUM_THREADS environment variables.

• In your LoadLeveler batch job file

# @ arguments = -env XLSMPOPTS=PARTHDS=2 -exe …

OR# @ arguments = -env OMP_NUM_THREADS=2 -exe …

C/C++ Directives Format

#pragma omp directive-name[clause, ...] newline

• Case sensitive

• Directives follow conventions of the C/C++ standards forcompiler directives

• Only one directive-name may be specified per directive

• You can synchronize the threads

• Run-Time Library Routine OpenMP can perform a variety of functions

get the number of threads and set the number of threads touse. omp_get_num_threads();omp_set_num_threads(4);

get the thread ID - omp_get_thread_num(); wallclock timer functions locking functions

• Environment variables OpenMP provides environment variables for controlling the

execution of parallel code Set the number of threads - OMP_NUM_THREADS Scheduling: determines how iterations of the loop are

scheduled on processors - OMP_SCHEDULE

Matrix-vector multiplicationOpenMP

for (i=0; i<m; i++){

a[i] = 0.0;for (j=0; j<n; j++)

sum += b[i*n+j]*c[j];a[i] = sum;

#pragma omp parallel for default(none) \shared(m,n,a,b,c) private(i,j,sum)

•default(none) forces me to specify all the variables eithershared, private or some other kind.•private(i,j,sum) loop variables, and sum should be privateotherwise any thread can modify these variables.•shared(m,n,a,b,c) any thread can be able to access (readand write) its portion of the array.

OpenMP Performance

Small problemSize

If clause

{#pragma omp for

for (i=0; i<m; i++){

a[i] = 0.0;for (j=0; j<n; j++)

sum += b[i*n+j]*c[j];a[i] = sum;

} /*-- End of omp parallel for --*/}

#pragma omp parallel if(n > 256) default(none) \shared(m,n,a,b,c) private(i,j,sum)

Pthreads

• int pthread_create(pthread_t *thread,const pthread_attr_r *attr,void *(*start_routine)(void *),void *arg);

thread1: identifies the newly created thread, and this is a uniqueidentifier.

attr: a thread attribute object specifies various characteristics forthe new thread. NULL for the default values.

start_routine: the routine that the thread will start executing. arg: a parameter to be passed to the start_routine

• Causes the calling thread to wait for the specified thread’s termination.int pthread_join(Pthread thread,Void **value_prt)

• Specify the include file ”pthread.h”

Who am I? Who are you?

pthread io_thread;. . .main(){ . . .

pthread_create(&io_thread, . . . ); . . .}Io_routine(){

pthread_t thread;thread = pthread_self();if (pthread_equal(io_thread, thread){. . .

pthread_create returns athread handle of pthread_t.

pthread_self obtains thethread handle of the callingthread.

pthread_equal compares onethread handle to another threadhandle.

Matrix-vector multiplicationPthreads

void do_work(int row_m,int n,double *,double *,double *c);Void *do_work(void *argpth)main(){pthread_t thread[NUMTHREADS];. . .

for ( t = 0 ; t <NUMTHREADS; t++) pthread_create(&thread[t], NULL, do_work_pth, (void *) t);

for ( t =0; t < NUMTHREADS; t++) pthread_join(thread[t], NULL);. . .}void *do_work_pth(void *argpth){ long tid = (long)argpth;

. . . do_work(row_m, n, a, b, c)

Pthreads - OpenMPComparison

• To make use of Pthreads,developers must write theircode specifically for thisAPI. This means they mustinclude header files,declare Pthreads datastructures, and callPthreads- specificfunctions.

• Portable.

• Easy to implement.

• Portable, but Pthreadsoffers a much greaterrange of primitivefunctions that providefiner-grained controlover threadingoperations.

The Programming Model

• Shared-memory parallelism within the nodes• Distributed parallelism across the nodes

• Three mesh levels Finest (usual) mesh: for computation of PDE solutions Middle level (thread) mesh:

Each block belongs to one thread, and has a fraction ofmemory of chip for writing

Coarsest level (MPI) mesh: Each block belongs to one chip; messages between

blocks. Each thread can read from entire memory ofchip, but can write only to restricted memory of thread

Three Mesh Levels:4X4 (MPI) X 8X8 (threads) X 8x8

(PDE computational cells)

Thread meshblock (8x8 ofthese per processor block). Allows 64threads (cores)per processor.

Global read andlocal write across processormesh block

Processor Mesh block(4x4 of these)MPI across boundaries

single computationalcell for PDE solution.8x8 of these acrosssingle thread meshblock

FronTier - Riemann Problem

Implementation:

• For the directional sweep on the rectangular grid states.

• pthread_create() routine permits the programmer topass one argument to the thread start routine. To passmultiple arguments, creating a structure which containsall of the arguments overcome this.

• Reader/writer locks for shared data structure.

• pthread_join()

References

IBM REDBOOKS:• IBM System Blue Gene Solution: Blue Gene/P Application DevelopmentChapter 9.2: Debugging Applications

• Unfolding the IBM eServer Blue Gene SolutionChapter 6.5: Debugging

TAU Documentationwww.cs.uoregon.edu/research/tau/

PSC/Intel Multi-Core Programming and Performance Tuning WorkshopMarch 23 - 26, 2009 Pittsburgh Supercomputing Center

References

http://www.openmp.org

An Overview of OpenMP Video, Ruud van der Pas

Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon

PThreads Programming by Bradford Nichols, Dick Buttlar, Jacqueline Proulx Farrell

POSIX Threads ProgrammingBlaise Barney, Lawrence Livermore National Laboratoryhttps://computing.llnl.gov/tutorials/pthreads/

OpenMPBlaise Barney, Lawrence Livermore National Laboratoryhttps://computing.llnl.gov/tutorials/openMP/

- Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P...

Documents