Date post: | 30-May-2018 |
Category: |
Documents |
Upload: | alparslan1 |
View: | 223 times |
Download: | 0 times |
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 1/63
1
ISCM-10
Taub Computing Center
High Performance Computing
for Computational Mechanics
Moshe GoldbergMarch 29, 2001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 2/63
2
High Performance Computing for CM
1) Overview
2) Alternative Architectures3) Message Passing
4) “Shared Memory”
5) Case Study
Agenda:
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 3/63
3
1) High Performance Computing
- Overview
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 4/63
4
* Understanding HPC concepts
* Why should programmers care
about the architecture?
* Do compilers make the right
choices?
* Nowadays, there are alternatives
Some Important Points
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 5/63
5
Trends in computer development
Speed of calculation is steadily increasMemory may not be in balance with hig
calculation speedsWorkstations are approaching speedsespecially efficient designs
Are we approaching the limit of the sp
of light?To get an answer faster, we must perf
calculations in parallel
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 6/63
6
Some HPC concepts
* HPC* HPF / Fortran90
* cc-NUMA* Compiler directives* OpenMP
* Message passing* PVM/MPI* Beowulf
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 7/63
7
MFLOPS for parix (origin2000), ax=b
0.0
1000.0
2000.0
3000.0
4000.0
1 2 3 4 5 6 7 8 9 10 11 12
processors
M F L O P S
n=2001
n=3501
n=5001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 8/63
8
ideal parallel speedup
1.0
3.0
5.0
7.0
9.0
11.0
1 2 3 4 5 6 7 8 9 10 11 12
processors
s p e
e u p
ideal
speedup =
(time for 1 cpu)
_____________
(time for (n) cpu's)
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 9/63
9
speedup for parix (origin2000), ax=b
1.0
3.0
5.0
7.0
9.0
11.0
1 2 3 4 5 6 7 8 9 10 11 12
processors
s p e e
u p
ideal
n=2001
n=3501
n=5001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 10/63
10
"or" - MFLOPS for matrix multiply (n=3001)
0.0
2000.0
4000.0
6000.0
8000.0
10000.0
12000.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
M F L O P S
source
blas
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 11/63
11
"or" - Speedup for Matrix multiply (n=3001)
1.0
3.0
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
25.0
27.0
29.0
31.0
33.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
s p e e d u
p
ideal
source
blas
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 12/63
12
"or" - solve linear equations
0.0
1000.0
2000.0
3000.0
4000.0
5000.0
6000.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
M F L O P S
n=2001
n=3501n=5001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 13/63
13
"or" - solve linear equations
1.0
3.0
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
25.0
27.0
29.0
31.0
33.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
s p e e d u p
ideal
n=2001
n=3501
n=5001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 14/63
14
2) Alternative Architectures
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 15/63
15
Units Shipped -- All Vectors
0
100
200
300
400
500
600
700
90 91 92 93 94 95 96 97 98 99 OO
S y s t e m s p e r Y e a r
Other NEC
FujitsuCray
Source: IDC, 2001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 16/63
16
Units Shipped -- Capability Vector
0
20
40
60
80
100
120
140
90 91 92 93 94 95 96 97 98 99 OO
S y s t e m s p e r Y e a r
Other NECFujitsuCray
Source: IDC, 2001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 18/63
18
IUCC (Machba) computers
Cray J90 -- 32 cpuMemory - 4 GB (500 MW)
Origin2000112 cpu (R12000, 400 MHz)28.7 GB total memory
PC cluster64 cpu (Pentium III, 550 MHz)Total memory - 9 GB
Mar 2001
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 19/63
19
Chris Hempel, hpc.utexas.edu
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 21/63
21
Chris Hempel, hpc.utexas.edu
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 22/63
22
CPU CPU
Memory
CPU CPU
Symmetric Multiple Processors
Examples: SGI Power Challenge, Cray J90/T90
Memory Bus
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 23/63
23
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Distributed Parallel Computing
Examples: SP2, Beowulf
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 28/63
28
call MPI_SEND(sum,1,MPI_REAL,ito,itag, MPI_COMM_WORLD,ierror)
call MPI_RECV(sum,1,MPI_REAL,ifrom,itag, MPI_COMM_WORLD,istatus,ierror)
MPI commands --
examples
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 29/63
29
Some basic MPI functions
Setup:
mpi_initmpi_finalize
Environment:
mpi_comm_sizempi_comm_rank
Communication:
mpi_send
mpi_receiveSynchronization:
mpi_barrier
Oth i t t MPI f ti
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 30/63
30
Other important MPI functionsAsynchronous communication:
mpi_isend mpi_irecv
mpi_iprobe mpi_wait/nowaitCollective communication:
mpi_barrier mpi_bcastmpi_gather mpi_scatter
mpi_reduce mpi_allreduceDerived data types:
mpi_type_contiguous mpi_type_vectormpi_type_indexed mpi_type_pack
mpi_type_commit mpi_type_freeCreating communicators:
mpi_comm_dup mpi_comm_splimpi_intercomm_create mpi_comm_free
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 31/63
31
4) “Shared Memory”
Fortran directives
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 32/63
32
CRAY: CMIC$ DO ALL
do i=1,na(i)=i
enddo
SGI:C$DOACROSSdo i=1,na(i)=i
enddo
OpenMP: C$OMP parallel dodo i=1,n
a(i)=ienddo
Fortran directives --examples
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 33/63
33
OpenMP Summary
OpenMP standard – first published Oct 1997
Directives
Run-time Library Routines
Environment Variables
Versions for f77, f90, c, c++
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 34/63
34
OpenMP Summary
Parallel Do Directive
c$omp parallel do private(I) shared(a)
c$omp end parallel do optional
do I=1,na(I)= I+1enddo
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 35/63
35
OpenMP Summary
Defining a Parallel Region - Individual Do Loops
c$omp parallel shared(a,b)
do j=1,na(j)=j
enddo
do k=1,n
b(k)=kenddo
c$omp do private(j)
c$omp end do nowaitc$omp do private(k)
c$omp end doc$omp end parallel
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 36/63
36
OpenMP Summary
Parallel Do Directive - Clauses
shared private
default(private|shared|none)reduction({operator|intrinsic}:var)if(scalar_logical_expression)ordered
copyin(var)
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 37/63
37
OpenMP Summary
Run-Time Library Routines
Execution environment
omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_num
omp_get_num_procsomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 38/63
38
OpenMP Summary
Run-Time Library Routines
Lock routines
omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lock
omp_test_lock
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 39/63
39
OpenMP Summary
Environment Variables
OMP_NUM_THREADSOMP_DYNAMICOMP_NESTED
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 40/63
40
RISC memory levels
CPU
Main memory
Cache
Single CPU
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 41/63
41
RISC memory levels
CPU
Main memory
Cache
Single CPU
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 42/63
42
RISC memory levels
Main memory
Multiple CPU’s
CPU
Cache 1
CPU0
1
Cache 0
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 44/63
44Main memory
Multiple CPU’s
CPU
Cache 1
CPU0
1
Cache 0
RISC Memory Levels
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 45/63
45
subroutine xmult (x1,x2,y1,y2,z1,z2,n)real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)
real a,b,c,d
do i=1,n
a=x1(i)*x2(i); b=y1(i)*y2(i)
c=x1(i)*y2(i); d=x2(i)*y1(i)
z1(i)=a-b; z2(i)=c+d
enddo
end
A sample program
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 46/63
46
subroutine xmult (x1,x2,y1,y2,z1,z2,n)real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)
real a,b,c,d
c$omp parallel do
do i=1,n
a=x1(i)*x2(i); b=y1(i)*y2(i)
c=x1(i)*y2(i); d=x2(i)*y1(i)
z1(i)=a-b; z2(i)=c+d
enddo
end
A sample program
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 47/63
47
Run on Technion origin2000
Vector length = 1,000,000Loop repeated 50 times
Compiler optimization: low (-O1)
Elapsed time, sec
threadsCompile 1 2 4
No parallel 15.0 15.3Parallel 16.0 26.0 26.8
Is this running in parallel?
A sample program
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 48/63
48
Run on Technion origin2000
Vector length = 1,000,000Loop repeated 50 times
Compiler optimization: low (-O1)
Elapsed time, sec
threadsCompile 1 2 4
No parallel 15.0 15.3Parallel 16.0 26.0 26.8
Is this running in parallel? WHY NOT?
A sample program
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 49/63
49
c$omp parallel do
do i=1,n
a=x1(i)*x2(i); b=y1(i)*y2(i)
c=x1(i)*y2(i); d=x2(i)*y1(i)
z1(i)=a-b; z2(i)=c+d
enddo
Is this running in parallel? WHY NOT?
Answer: by default, variables a,b,c,d
are defined as SHARED
A sample program
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 50/63
50
Elapsed time, sec
threadsCompile 1 2 4
No parallel 15.0 15.3Parallel 16.0 8.5 4.6
Solution: define a,b,c,d as PRIVATE:
c$omp parallel do private(a,b,c,d)
This is now running in parallel
A sample program
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 52/63
52
HPC in the Technion
SGI Origin2000 22 cpu (R10000) -- 250 MHz
Total memory -- 5.6 GB
PC cluster (linux redhat 6.1) 6 cpu (pentium II - 400MHz)
Memory - 500 MB/cpu
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 53/63
53
Fluent test case
-- Stability of a subsonic
turbulent jet
Source: Viktoria SuponitskyFaculty of Aerospace Engineering,
Technion
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 55/63
55
Reading "Case25unstead.cas"...
10000 quadrilateral cells, zone 1, binary.
19800 2D interior faces, zone 9, binary.
50 2D wall faces, zone 3, binary.
100 2D pressure-inlet faces, zone 7, binary.
50 2D pressure-outlet faces, zone 5, binary.
50 2D pressure-outlet faces, zone 6, binary.
50 2D velocity-inlet faces, zone 2, binary.
100 2D axis faces, zone 4, binary.
10201 nodes, binary.
10201 node flags, binary.
Fluent test case
10 time steps, 20 iterations per time step
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 58/63
58
Host spawning Node 0 on machine "parix".
ID Comm. Hostname O.S. PID Mach ID HW ID Name
-------------------------------------------------------------
host net parix irix 19732 0 7 Fluent Host
n7 smpi parix irix 19776 0 7 Fluent Node
n6 smpi parix irix 19775 0 6 Fluent Node
n5 smpi parix irix 19771 0 5 Fluent Node
n4 smpi parix irix 19770 0 4 Fluent Node
n3 smpi parix irix 19772 0 3 Fluent Node
n2 smpi parix irix 19769 0 2 Fluent Node
n1 smpi parix irix 19768 0 1 Fluent Node
n0* smpi parix irix 19767 0 0 Fluent Node
Fluent test case
SMP command: fluent 2d -t8 -psmpi -g < inp
Fluent test case
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 59/63
59
Fluent test case
Cluster command:
fluent 2d -cnf=clinux1,clinux2,clinux3,clinux4,clinux5,clinux6-t6 –pnet -g < inp
Node 0 spawning Node 5 on machine "clinux6".
ID Comm. Hostname O.S. PID Mach ID HW ID Name
-----------------------------------------------------------
n5 net clinux6 linux-ia32 3560 5 9 Fluent Node
n4 net clinux5 linux-ia32 19645 4 8 Fluent Node
n3 net clinux4 linux-ia32 16696 3 7 Fluent Node
n2 net clinux3 linux-ia32 17259 2 6 Fluent Node
n1 net clinux2 linux-ia32 18328 1 5 Fluent Node
host net clinux1 linux-ia32 10358 0 3 Fluent Host
n0* net clinux1 linux-ia32 10400 0 -1 Fluent Node
Fluent test - time for multiple cpu's
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 60/63
60
Fluent test - time for multiple cpu s
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
1 2 3 4 5 6 7 8
number of cpu's
t o t a l r u n t i m e
origin2000
pc cluster
Fl ent test speed p b cp 's
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 61/63
61
Fluent test - speedup by cpu's
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 3 4 5 6 7 8
numbe r of cpu's
s p e
e d u p
ideal
origin2000
pc cluster
8/14/2019 Cm Tut March01
http://slidepdf.com/reader/full/cm-tut-march01 62/63
62
TOP500 (November 2, 2000)