+ All Categories
Home > Documents > Cm Tut March01

Cm Tut March01

Date post: 30-May-2018
Category:
Upload: alparslan1
View: 223 times
Download: 0 times
Share this document with a friend
63
1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001
Transcript

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 1/63

1

ISCM-10

Taub Computing Center

High Performance Computing

for Computational Mechanics

Moshe GoldbergMarch 29, 2001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 2/63

2

High Performance Computing for CM

1) Overview

2) Alternative Architectures3) Message Passing

4) “Shared Memory”

5) Case Study

Agenda:

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 3/63

3

1) High Performance Computing

- Overview

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 4/63

4

  * Understanding HPC concepts

* Why should programmers care

about the architecture?

* Do compilers make the right

choices?

  * Nowadays, there are alternatives

Some Important Points

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 5/63

5

Trends in computer development

Speed of calculation is steadily increasMemory may not be in balance with hig

calculation speedsWorkstations are approaching speedsespecially efficient designs

Are we approaching the limit of the sp

of light?To get an answer faster, we must perf 

calculations in parallel

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 6/63

6

Some HPC concepts

* HPC* HPF / Fortran90

* cc-NUMA* Compiler directives* OpenMP

* Message passing* PVM/MPI* Beowulf  

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 7/63

7

MFLOPS for parix (origin2000), ax=b

0.0

1000.0

2000.0

3000.0

4000.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

   M   F   L   O   P   S

n=2001

n=3501

n=5001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 8/63

8

ideal parallel speedup

1.0

3.0

5.0

7.0

9.0

11.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

  s  p  e

  e  u  p

ideal

speedup =

  (time for 1 cpu)

_____________ 

(time for (n) cpu's)

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 9/63

9

speedup for parix (origin2000), ax=b

1.0

3.0

5.0

7.0

9.0

11.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

  s  p  e  e

  u  p

ideal

n=2001

n=3501

n=5001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 10/63

10

"or" - MFLOPS for matrix multiply (n=3001)

0.0

2000.0

4000.0

6000.0

8000.0

10000.0

12000.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

   M   F   L   O   P   S

source

blas

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 11/63

11

"or" - Speedup for Matrix multiply (n=3001)

1.0

3.0

5.0

7.0

9.0

11.0

13.0

15.0

17.0

19.0

21.0

23.0

25.0

27.0

29.0

31.0

33.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

    s    p    e    e     d    u

    p

ideal

source

blas

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 12/63

12

"or" - solve linear equations

0.0

1000.0

2000.0

3000.0

4000.0

5000.0

6000.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

     M     F     L     O     P     S

n=2001

n=3501n=5001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 13/63

13

"or" - solve linear equations

1.0

3.0

5.0

7.0

9.0

11.0

13.0

15.0

17.0

19.0

21.0

23.0

25.0

27.0

29.0

31.0

33.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

  s  p  e  e   d  u  p

ideal

n=2001

n=3501

n=5001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 14/63

14

2) Alternative Architectures

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 15/63

15

Units Shipped -- All Vectors

0

100

200

300

400

500

600

700

90 91 92 93 94 95 96 97 98 99 OO

   S  y  s   t  e  m  s  p  e  r   Y  e  a  r

Other  NEC

FujitsuCray

Source: IDC, 2001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 16/63

16

Units Shipped -- Capability Vector

0

20

40

60

80

100

120

140

90 91 92 93 94 95 96 97 98 99 OO

   S  y  s   t  e  m  s  p  e  r   Y  e  a  r

Other  NECFujitsuCray

Source: IDC, 2001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 17/63

17

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 18/63

18

IUCC (Machba) computers

  Cray J90 -- 32 cpuMemory - 4 GB (500 MW)

Origin2000112 cpu (R12000, 400 MHz)28.7 GB total memory

PC cluster64 cpu (Pentium III, 550 MHz)Total memory - 9 GB

Mar 2001

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 19/63

19

Chris Hempel, hpc.utexas.edu

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 20/63

20

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 21/63

21

Chris Hempel, hpc.utexas.edu

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 22/63

22

CPU CPU

Memory

CPU CPU

Symmetric Multiple Processors

Examples: SGI Power Challenge, Cray J90/T90

Memory Bus

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 23/63

23

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Distributed Parallel Computing

Examples: SP2, Beowulf 

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 24/63

24

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 25/63

25

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 26/63

26

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 27/63

27

3) Message Passing

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 28/63

28

call MPI_SEND(sum,1,MPI_REAL,ito,itag, MPI_COMM_WORLD,ierror)

call MPI_RECV(sum,1,MPI_REAL,ifrom,itag, MPI_COMM_WORLD,istatus,ierror)

MPI commands --

examples

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 29/63

29

Some basic MPI functions

Setup:

  mpi_initmpi_finalize 

Environment:

  mpi_comm_sizempi_comm_rank

Communication:

  mpi_send 

mpi_receiveSynchronization:

  mpi_barrier

Oth i t t MPI f ti

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 30/63

30

Other important MPI functionsAsynchronous communication:

   mpi_isend mpi_irecv 

mpi_iprobe mpi_wait/nowaitCollective communication:

   mpi_barrier mpi_bcastmpi_gather mpi_scatter

mpi_reduce mpi_allreduceDerived data types:

   mpi_type_contiguous mpi_type_vectormpi_type_indexed mpi_type_pack

mpi_type_commit mpi_type_freeCreating communicators:

   mpi_comm_dup mpi_comm_splimpi_intercomm_create mpi_comm_free

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 31/63

31

4) “Shared Memory”

Fortran directives

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 32/63

32

CRAY: CMIC$ DO ALL

do i=1,na(i)=i

enddo

SGI:C$DOACROSSdo i=1,na(i)=i

enddo

OpenMP: C$OMP parallel dodo i=1,n

a(i)=ienddo

Fortran directives --examples

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 33/63

33

OpenMP Summary

OpenMP standard – first published Oct 1997

Directives

Run-time Library Routines

Environment Variables

Versions for f77, f90, c, c++

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 34/63

34

OpenMP Summary

Parallel Do Directive

c$omp parallel do private(I) shared(a)

c$omp end parallel do optional

do I=1,na(I)= I+1enddo

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 35/63

35

OpenMP Summary

Defining a Parallel Region - Individual Do Loops

c$omp parallel shared(a,b)

do j=1,na(j)=j

enddo

do k=1,n

 b(k)=kenddo

c$omp do private(j)

c$omp end do nowaitc$omp do private(k)

c$omp end doc$omp end parallel

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 36/63

36

OpenMP Summary

Parallel Do Directive - Clauses

shared  private

default(private|shared|none)reduction({operator|intrinsic}:var)if(scalar_logical_expression)ordered 

copyin(var)

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 37/63

37

OpenMP Summary

Run-Time Library Routines

Execution environment

omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_num 

omp_get_num_procsomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested 

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 38/63

38

OpenMP Summary

Run-Time Library Routines

Lock routines

omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lock

omp_test_lock

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 39/63

39

OpenMP Summary

Environment Variables

OMP_NUM_THREADSOMP_DYNAMICOMP_NESTED

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 40/63

40

RISC memory levels

CPU

Main memory

Cache

Single CPU

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 41/63

41

RISC memory levels

CPU

Main memory

Cache

Single CPU

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 42/63

42

RISC memory levels

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 43/63

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 44/63

44Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

RISC Memory Levels

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 45/63

45

subroutine xmult (x1,x2,y1,y2,z1,z2,n)real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)

real a,b,c,d 

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)

z1(i)=a-b; z2(i)=c+d 

enddo

end 

A sample program

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 46/63

46

subroutine xmult (x1,x2,y1,y2,z1,z2,n)real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)

real a,b,c,d 

c$omp parallel do

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)

z1(i)=a-b; z2(i)=c+d 

enddo

end 

A sample program

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 47/63

47

Run on Technion origin2000

Vector length = 1,000,000Loop repeated 50 times

Compiler optimization: low (-O1)

Elapsed time, sec

  threadsCompile 1 2 4

  No parallel 15.0 15.3Parallel 16.0 26.0 26.8

Is this running in parallel?

A sample program

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 48/63

48

Run on Technion origin2000

Vector length = 1,000,000Loop repeated 50 times

Compiler optimization: low (-O1)

Elapsed time, sec

  threadsCompile 1 2 4

  No parallel 15.0 15.3Parallel 16.0 26.0 26.8

Is this running in parallel? WHY NOT?

A sample program

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 49/63

49

c$omp parallel do

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)

z1(i)=a-b; z2(i)=c+d 

enddo

Is this running in parallel? WHY NOT?

Answer: by default, variables a,b,c,d

are defined as SHARED

A sample program

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 50/63

50

Elapsed time, sec

  threadsCompile 1 2 4

  No parallel 15.0 15.3Parallel 16.0 8.5 4.6

Solution: define a,b,c,d as PRIVATE:

  c$omp parallel do private(a,b,c,d)

This is now running in parallel

A sample program

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 51/63

51

5) Case Study

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 52/63

52

HPC in the Technion

SGI Origin2000  22 cpu (R10000) -- 250 MHz

Total memory -- 5.6 GB

PC cluster (linux redhat 6.1)  6 cpu (pentium II - 400MHz)

Memory - 500 MB/cpu

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 53/63

53

Fluent test case

-- Stability of a subsonic

turbulent jet

 Source: Viktoria SuponitskyFaculty of Aerospace Engineering,

Technion

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 54/63

54

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 55/63

55

Reading "Case25unstead.cas"...

10000 quadrilateral cells, zone 1, binary.

19800 2D interior faces, zone 9, binary.

50 2D wall faces, zone 3, binary.

100 2D pressure-inlet faces, zone 7, binary.

50 2D pressure-outlet faces, zone 5, binary.

50 2D pressure-outlet faces, zone 6, binary.

50 2D velocity-inlet faces, zone 2, binary.

100 2D axis faces, zone 4, binary.

10201 nodes, binary.

10201 node flags, binary.

Fluent test case

10 time steps, 20 iterations per time step

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 56/63

56

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 57/63

57

Fl t t t

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 58/63

58

Host spawning Node 0 on machine "parix".

ID Comm. Hostname O.S. PID Mach ID HW ID Name

-------------------------------------------------------------

host net parix irix 19732 0 7 Fluent Host

n7 smpi parix irix 19776 0 7 Fluent Node

n6 smpi parix irix 19775 0 6 Fluent Node

n5 smpi parix irix 19771 0 5 Fluent Node

n4 smpi parix irix 19770 0 4 Fluent Node

n3 smpi parix irix 19772 0 3 Fluent Node

n2 smpi parix irix 19769 0 2 Fluent Node

n1 smpi parix irix 19768 0 1 Fluent Node

n0* smpi parix irix 19767 0 0 Fluent Node

Fluent test case

SMP command: fluent 2d -t8 -psmpi -g < inp

Fluent test case

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 59/63

59

Fluent test case

Cluster command: 

fluent 2d -cnf=clinux1,clinux2,clinux3,clinux4,clinux5,clinux6-t6 –pnet -g < inp

Node 0 spawning Node 5 on machine "clinux6".

ID Comm. Hostname O.S. PID Mach ID HW ID Name

-----------------------------------------------------------

n5 net clinux6 linux-ia32 3560 5 9 Fluent Node

n4 net clinux5 linux-ia32 19645 4 8 Fluent Node

n3 net clinux4 linux-ia32 16696 3 7 Fluent Node

n2 net clinux3 linux-ia32 17259 2 6 Fluent Node

n1 net clinux2 linux-ia32 18328 1 5 Fluent Node

host net clinux1 linux-ia32 10358 0 3 Fluent Host

n0* net clinux1 linux-ia32 10400 0 -1 Fluent Node

Fluent test - time for multiple cpu's

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 60/63

60

Fluent test - time for multiple cpu s

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

1 2 3 4 5 6 7 8

number of cpu's

   t  o   t  a   l  r  u  n   t   i  m  e

 

origin2000

pc cluster 

Fl ent test speed p b cp 's

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 61/63

61

Fluent test - speedup by cpu's

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 3 4 5 6 7 8

numbe r of cpu's

  s  p  e

  e   d  u  p

ideal

origin2000

pc cluster 

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 62/63

62

TOP500 (November 2, 2000)

8/14/2019 Cm Tut March01

http://slidepdf.com/reader/full/cm-tut-march01 63/63

TOP500 (November 2, 2000)


Recommended