+ All Categories
Home > Documents > Parallelizing the Jacobi Iteration...

Parallelizing the Jacobi Iteration...

Date post: 07-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
21
Parallelizing the Jacobi Iteration Algorithm Alberto Rodriguez Serial Approach Using OpenMP 1D Decomposition 2D Decomposition Using MPI 1D Decomposition Parallelizing the Jacobi Iteration Algorithm Alberto Rodriguez Higher Institute of Technologies and Applied Sciences (InSTEC) October 14, 2016
Transcript
Page 1: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Parallelizing the Jacobi Iteration Algorithm

Alberto Rodriguez

Higher Institute of Technologies and Applied Sciences (InSTEC)

October 14, 2016

Page 2: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Outline

1 Serial Approach

2 Using OpenMP1D Decomposition2D Decomposition

3 Using MPI1D Decomposition

Page 3: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Serial Approach

Page 4: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Analyzing the Code

With a serial code we have:

Time Complexity O(I ∗ N2) (Nested loops to compute thematrix for each iteration).

Memory Complexity O(N2) (Size of the matrix).

We expect:

Long times to process matrix of bigger size and/or greatamount of iterations.

Size of the matrix is limit by system (Size of the memoryof the node) .

Page 5: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.000976562

0.00390625

0.015625

0.0625

0.25

1

4

16

64

128 256 512 1024 2048 4096 8192

t(s)

Matrix Size

O3

Page 6: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.000976562

0.00390625

0.015625

0.0625

0.25

1

4

16

64

128 256 512 1024 2048 4096 8192

t(s)

Matrix Size

O3mavx

Page 7: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

Data obtained from perf tool.

8

16

32

64

128

128 256 512 1024 2048 4096 8192

% o

f cache m

isses

Matrix Size

Page 8: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Using OpenMP

Page 9: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

What part of the code need parallelization?

for(iCount = 1; iCount <= Iterations; iCount + +)for(i = 0; i < Dimension; i + +)

for(j = 0; j < Dimension; j + +){...}

The outer loop can’t be parallelize because the new matrixdepends to the old matrix.

We can parallelize the inner loops.

Page 10: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Instructions using OpenMP

for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for private(i , j)for(i = 0; i < Dimension; i + +)

for(j = 0; j < Dimension; j + +){...}

Page 11: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0

0.005

0.01

0.015

0.02

0.025

0.03

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 256

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 1024

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 4096

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 16384

Page 12: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Instructions using OpenMP

for(iCount = 1; iCount <= Iterations; iCount + +)#pragma omp parallel for collapse(2) private(i , j)for(i = 0; i < Dimension; i + +)

for(j = 0; j < Dimension; j + +){...}

Page 13: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 256

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 1024

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 4096

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 16384

Page 14: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

Comparison between 1D and 2D decomposition.

0

0.005

0.01

0.015

0.02

0.025

0.03

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size: 256

1D2D

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size:1024

1D2D

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size:4096

1D2D

10

20

30

40

50

60

70

80

90

100

110

0 10 20 30 40 50 60 70

tim

e(s

)

Number of Threads

Matrix Size:16384

1D2D

Page 15: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Using MPI

Page 16: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

How to divide the matrix

We want:

Minimize cache misses:

Divide the matrix by groups of rows. (C)

Every process has roughly the same amount of job:

Every process will have the same amount of rows.If that is no posible the last ones are going to have 1 morerow.

Every process access some data that is store in theprevious process and in the next one:

We need to allocate 2 more rows for each process. (Ghostor boundary cells)

Page 17: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

How to divide the matrix

Page 18: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Communications

How many communications we are going to have betweenprocesses?

Every process (except the first an the last ones) needs tosend data to the previous process and to the next one.

Every process (except the first an the last ones) needs toreceive data from the previous process and the next one inorder to update its ghost cells.

Page 19: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Communications

How many values are communicated per iteration?

The program has N processes.

Every process needs to communicate four times.

In every communication an entire row is communicated.

Every row has M elements.

Math: N ∗ 4 ∗ 1 ∗M = 4 ∗ N ∗M

Page 20: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

50 100 150 200 250 300 350 400 450 500 550

tim

e(s

)

Number of processes

Matrix Size: 512

0

5

10

15

20

25

30

50 100 150 200 250 300 350 400 450 500 550

tim

e(s

)

Number of processes

Matrix Size: 2048

200

400

600

800

1000

1200

1400

50 100 150 200 250 300 350 400 450 500 550

tim

e(s

)

Number of processes

Matrix Size: 8192

Page 21: Parallelizing the Jacobi Iteration Algorithmindico.ictp.it/event/7659/session/19/contribution/86/material/slides/0.pdfParallelizing the Jacobi Iteration Algorithm Alberto Rodriguez

Parallelizingthe JacobiIterationAlgorithm

AlbertoRodriguez

SerialApproach

UsingOpenMP

1DDecomposition

2DDecomposition

Using MPI

1DDecomposition

Results

Comparison between AMD and Intel machines.

0

0.5

1

1.5

2

2.5

3

60 80 100 120 140 160 180 200 220 240 260

tim

e(s

)

Number of processes

Matrix Size: 512

AMDIntel

0

10

20

30

40

50

60

70

80

90

60 80 100 120 140 160 180 200 220 240 260

tim

e(s

)

Number of processes

Matrix Size: 2048

AMDIntel

0

500

1000

1500

2000

2500

3000

3500

4000

4500

60 80 100 120 140 160 180 200 220 240 260

tim

e(s

)

Number of processes

Matrix Size: 8192

AMDIntel


Recommended