Date post: | 19-Jun-2015 |
Category: |
Technology |
Upload: | birgit-ploetzeneder |
View: | 807 times |
Download: | 1 times |
Multicore Birgit Plötzeneder, 11/24/10
Intro (Why?)
Architecture
LanguagesOMP
MPI
Tools
Darling, I shrunk the
computer. *
• copyright by Prof. Erik Hagersten/ Uppsala, who does awesome work
Signal propagation delay » transistor delay
Not enough ILP for more transistors
Power consumption
O R
LY
?
You want FASTER code. NOW.
- prefetching - high comp load - image/video - fun
Intel Core 2 Quad
AMD Shanghai (K10)
Intel Dunnington (Xeon 74xx)
Intel i7
AMD Magny Cours
The Secret..
Moving from 1 core to 4 cores can give you a factor of
Moving from memory to L1 can give you a factor of
Disabling the L2 cache will reduce system performance more than disabling a second CPU core of a dual-core processor…
* see Iris Christadler, LRZ
OMP and MPI
Program start:only master thread runs
Parallel region: team of worker threads is generated (“fork”)
Threads synchronize when leaving parallel region (“join”)
OpenMP-Concept
A First Program
Work-sharing constructs
omp for or omp dosections single master
Data sharing attribute clausesshared: visible and accessible by all threads simultaneously. Default (!i). a[i]=a[i-1]..
private: each thread will have a local copy, value is not maintained for use outside
firstprivate: like private except initialized to original value.
lastprivate: like private except original value is updated after construct.
reduction (->reduction ops)
Scheduling clauses
schedule(type, chunk):
static
dynamic
guided
Other clausescritical: executed by only one thread at a time
atomic: similar to critical section, but may be better
ordered: executed in the order in which iterations would be executed in a sequential loop
barrier nowait
Using clauses
MPI-Concept
mpicc <options> prog.c
mpirun -arch <architecture> -np<np> prog
MPI
MPI
MPI program: 6 basic callsMPI_INITMPI_COMM_RANKMPI_COMM_SIZE
MPI_SENDMPI_RECVMPI_FINALIZE
MPI messages data (startbuf, count, datatype) envelope (destination/source, tag, communicatior)
Communicators
MPI
Communication modes
•Collective vs P2P▫One2All, All2All, All2One
•Blocking / Nonblocking•Synchronous / Asynchronous
Communication modes
•synchronous mode ("safest"): Is the receiver ready?•ready mode (lowest system overhead)- only if there is a receiver waiting (streaming)buffered mode (decouples sender from receiver), buffer size, buffer attachment!standard mode
Communication
Mode Blocking RoutinesNon-Blocking
Routines
synchronous MPI_SSEND MPI_ISSEND ready MPI_RSEND MPI_IRSEND buffered MPI_BSEND MPI_IBSEND standard MPI_SEND MPI_ISEND MPI_RECV MPI_IRECV MPI_SENDRECV
MPI_SENDRECV_REPLACE
Collective communicationBarrier
Broadcast
Gather
Scatter
Reduction
gprof
valgrind
PAPI
PAPI
PAPI is a library that monitors hardware events when a program runs. Papiex is a tool that makes it easy to get access to performance counters using PAPI.*
*http://icl.cs.utk.edu/papi/
papiex –e <EVENT> ./my_prog (to turn of optimizations (use the flag -O0) for some tests)
ProfilersTwo Types
Statistical Profilers Event Based Profilers
Statistical Profiling:
Interrupts at random
intervals and records
which program instruction
the CPU is executing.
Event Based Profiling:
Interrupts triggered by hardware counter events
are recorded. Measuring
profiles affects performance.
Still a lot of data saved.
Tracing
Wrappers for function calls (for example MPI_Recv)
Records when a function was called and with what parameters
Which nodes exchanged messages, message size…
Can affect performance
Intel tracing tools
Marmot MPI correctness and portability checker
MpiP - http://mpip.sourceforge.net/
Extrae + Paraver
module add paraver
mpi2prv -f TRACE.mpits -o MPImatrix.prvv
Scalasca
Screenshots and examples of profilers/tracing tools available – but not on the internet.v
This talk was given to the TumFUG Linux/Unix-User group at the TU München.
Contact me via [email protected]
You may use the pictures of the processors (not the screenshots, not the overview pic which I only adapted), but please do notify and credit me
accordingly. Some of the code was copy-pasted from Wikipedia.
I've removed copy-right problematic parts.