Cm Tut March01

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 1/63

1

ISCM-10

Taub Computing Center

High Performance Computing

for Computational Mechanics

Moshe GoldbergMarch 29, 2001



2

High Performance Computing for CM

1) Overview

2) Alternative Architectures3) Message Passing

4) “Shared Memory”

5) Case Study

Agenda:



3

1) High Performance Computing

- Overview



4

* Understanding HPC concepts

* Why should programmers care

about the architecture?

* Do compilers make the right

choices?

* Nowadays, there are alternatives

Some Important Points



5

Trends in computer development

Speed of calculation is steadily increasMemory may not be in balance with hig

calculation speedsWorkstations are approaching speedsespecially efficient designs

Are we approaching the limit of the sp

of light?To get an answer faster, we must perf

calculations in parallel



6

Some HPC concepts

* HPC* HPF / Fortran90

* cc-NUMA* Compiler directives* OpenMP

* Message passing* PVM/MPI* Beowulf



7

MFLOPS for parix (origin2000), ax=b

0.0

1000.0

2000.0

3000.0

4000.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

M F L O P S

n=2001

n=3501

n=5001



8

ideal parallel speedup

1.0

3.0

5.0

7.0

9.0

11.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

s p e

e u p

ideal

speedup =

(time for 1 cpu)

_____________

(time for (n) cpu's)



9

speedup for parix (origin2000), ax=b

1.0

3.0

5.0

7.0

9.0

11.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

s p e e

u p

ideal

n=2001

n=3501

n=5001



10

"or" - MFLOPS for matrix multiply (n=3001)

0.0

2000.0

4000.0

6000.0

8000.0

10000.0

12000.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

M F L O P S

source

blas



11

"or" - Speedup for Matrix multiply (n=3001)

1.0

3.0

5.0

7.0

9.0

11.0

13.0

15.0

17.0

19.0

21.0

23.0

25.0

27.0

29.0

31.0

33.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

s p e e d u

p

ideal

source

blas



12

"or" - solve linear equations

0.0

1000.0

2000.0

3000.0

4000.0

5000.0

6000.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

M F L O P S

n=2001

n=3501n=5001



13

"or" - solve linear equations

1.0

3.0

5.0

7.0

9.0

11.0

13.0

15.0

17.0

19.0

21.0

23.0

25.0

27.0

29.0

31.0

33.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

s p e e d u p

ideal

n=2001

n=3501

n=5001



14

2) Alternative Architectures



15

Units Shipped -- All Vectors

0

100

200

300

400

500

600

700

90 91 92 93 94 95 96 97 98 99 OO

S y s t e m s p e r Y e a r

Other NEC

FujitsuCray

Source: IDC, 2001



16

Units Shipped -- Capability Vector

0

20

40

60

80

100

120

140

90 91 92 93 94 95 96 97 98 99 OO

S y s t e m s p e r Y e a r

Other NECFujitsuCray

Source: IDC, 2001



17



18

IUCC (Machba) computers

Cray J90 -- 32 cpuMemory - 4 GB (500 MW)

Origin2000112 cpu (R12000, 400 MHz)28.7 GB total memory

PC cluster64 cpu (Pentium III, 550 MHz)Total memory - 9 GB

Mar 2001



19

Chris Hempel, hpc.utexas.edu



20



21

Chris Hempel, hpc.utexas.edu



22

CPU CPU

Memory

CPU CPU

Symmetric Multiple Processors

Examples: SGI Power Challenge, Cray J90/T90

Memory Bus



23

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Distributed Parallel Computing

Examples: SP2, Beowulf



24



25



26



27

3) Message Passing



28

call MPI_SEND(sum,1,MPI_REAL,ito,itag, MPI_COMM_WORLD,ierror)

call MPI_RECV(sum,1,MPI_REAL,ifrom,itag, MPI_COMM_WORLD,istatus,ierror)

MPI commands --

examples



29

Some basic MPI functions

Setup:

mpi_initmpi_finalize

Environment:

mpi_comm_sizempi_comm_rank

Communication:

mpi_send

mpi_receiveSynchronization:

mpi_barrier

Oth i t t MPI f ti



30

Other important MPI functionsAsynchronous communication:

mpi_isend mpi_irecv

mpi_iprobe mpi_wait/nowaitCollective communication:

mpi_barrier mpi_bcastmpi_gather mpi_scatter

mpi_reduce mpi_allreduceDerived data types:

mpi_type_contiguous mpi_type_vectormpi_type_indexed mpi_type_pack

mpi_type_commit mpi_type_freeCreating communicators:

mpi_comm_dup mpi_comm_splimpi_intercomm_create mpi_comm_free



31

4) “Shared Memory”

Fortran directives



32

CRAY: CMIC$ DO ALL

do i=1,na(i)=i

enddo

SGI:C$DOACROSSdo i=1,na(i)=i

enddo

OpenMP: C$OMP parallel dodo i=1,n

a(i)=ienddo

Fortran directives --examples



33

OpenMP Summary

OpenMP standard – first published Oct 1997

Directives

Run-time Library Routines

Environment Variables

Versions for f77, f90, c, c++



34

OpenMP Summary

Parallel Do Directive

c$omp parallel do private(I) shared(a)

c$omp end parallel do optional

do I=1,na(I)= I+1enddo



35

OpenMP Summary

Defining a Parallel Region - Individual Do Loops

c$omp parallel shared(a,b)

do j=1,na(j)=j

enddo

do k=1,n

b(k)=kenddo

c$omp do private(j)

c$omp end do nowaitc$omp do private(k)

c$omp end doc$omp end parallel



36

OpenMP Summary

Parallel Do Directive - Clauses

shared private

default(private|shared|none)reduction({operator|intrinsic}:var)if(scalar_logical_expression)ordered

copyin(var)



37

OpenMP Summary

Run-Time Library Routines

Execution environment

omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_num

omp_get_num_procsomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested



38

OpenMP Summary

Run-Time Library Routines

Lock routines

omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lock

omp_test_lock



39

OpenMP Summary

Environment Variables

OMP_NUM_THREADSOMP_DYNAMICOMP_NESTED



40

RISC memory levels

CPU

Main memory

Cache

Single CPU



41

RISC memory levels

CPU

Main memory

Cache

Single CPU



42

RISC memory levels

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0





44Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

RISC Memory Levels



45

subroutine xmult (x1,x2,y1,y2,z1,z2,n)real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)

real a,b,c,d

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)

z1(i)=a-b; z2(i)=c+d

enddo

end

A sample program



46

subroutine xmult (x1,x2,y1,y2,z1,z2,n)real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)

real a,b,c,d

c$omp parallel do

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)


enddo

end

A sample program



47

Run on Technion origin2000

Vector length = 1,000,000Loop repeated 50 times

Compiler optimization: low (-O1)

Elapsed time, sec

threadsCompile 1 2 4

No parallel 15.0 15.3Parallel 16.0 26.0 26.8

Is this running in parallel?

A sample program



48

Run on Technion origin2000

Vector length = 1,000,000Loop repeated 50 times

Compiler optimization: low (-O1)

Elapsed time, sec



Is this running in parallel? WHY NOT?

A sample program



49

c$omp parallel do

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)


enddo

Is this running in parallel? WHY NOT?

Answer: by default, variables a,b,c,d

are defined as SHARED

A sample program



50

Elapsed time, sec



Solution: define a,b,c,d as PRIVATE:

c$omp parallel do private(a,b,c,d)

This is now running in parallel

A sample program



51

5) Case Study



52

HPC in the Technion

SGI Origin2000 22 cpu (R10000) -- 250 MHz

Total memory -- 5.6 GB

PC cluster (linux redhat 6.1) 6 cpu (pentium II - 400MHz)

Memory - 500 MB/cpu



53

Fluent test case

-- Stability of a subsonic

turbulent jet

Source: Viktoria SuponitskyFaculty of Aerospace Engineering,

Technion



54



55

Reading "Case25unstead.cas"...

10000 quadrilateral cells, zone 1, binary.

19800 2D interior faces, zone 9, binary.

50 2D wall faces, zone 3, binary.

100 2D pressure-inlet faces, zone 7, binary.

50 2D pressure-outlet faces, zone 5, binary.

50 2D pressure-outlet faces, zone 6, binary.

50 2D velocity-inlet faces, zone 2, binary.

100 2D axis faces, zone 4, binary.

10201 nodes, binary.

10201 node flags, binary.

Fluent test case

10 time steps, 20 iterations per time step



56



57

Fl t t t



58

Host spawning Node 0 on machine "parix".

ID Comm. Hostname O.S. PID Mach ID HW ID Name

-------------------------------------------------------------

host net parix irix 19732 0 7 Fluent Host

n7 smpi parix irix 19776 0 7 Fluent Node







n0* smpi parix irix 19767 0 0 Fluent Node

Fluent test case

SMP command: fluent 2d -t8 -psmpi -g < inp

Fluent test case



59

Fluent test case

Cluster command:

fluent 2d -cnf=clinux1,clinux2,clinux3,clinux4,clinux5,clinux6-t6 –pnet -g < inp

Node 0 spawning Node 5 on machine "clinux6".

ID Comm. Hostname O.S. PID Mach ID HW ID Name

-----------------------------------------------------------

n5 net clinux6 linux-ia32 3560 5 9 Fluent Node





host net clinux1 linux-ia32 10358 0 3 Fluent Host

n0* net clinux1 linux-ia32 10400 0 -1 Fluent Node

Fluent test - time for multiple cpu's



60

Fluent test - time for multiple cpu s

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

1 2 3 4 5 6 7 8

number of cpu's

t o t a l r u n t i m e

origin2000

pc cluster

Fl ent test speed p b cp 's



61

Fluent test - speedup by cpu's

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 3 4 5 6 7 8

numbe r of cpu's

s p e

e d u p

ideal

origin2000

pc cluster



62

TOP500 (November 2, 2000)



TOP500 (November 2, 2000)

Date post:	30-May-2018
Category:	Documents
Upload:	alparslan1
View:	223 times
Download:	0 times

Cm Tut March01

Documents