Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 5
The Sieve of EratosthenesThe Sieve of Eratosthenes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter Objectives
Analysis of block allocation schemesAnalysis of block allocation schemes Function MPI_BcastFunction MPI_Bcast Performance enhancementsPerformance enhancements
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
Sequential algorithmSequential algorithm Sources of parallelismSources of parallelism Data decomposition optionsData decomposition options Parallel algorithm development, analysisParallel algorithm development, analysis MPI programMPI program BenchmarkingBenchmarking OptimizationsOptimizations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sequential Algorithm
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
2 4 6 8 10 12 14 16
18 20 22 24 26 28 30
32 34 36 38 40 42 44 46
48 50 52 54 56 58 60
3 9 15
21 27
33 39 45
51 57
5
25
35
55
7
49
Complexity: (n ln ln n)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pseudocode1. Create list of unmarked natural numbers 2, 3, …, n2. k 23. Repeat
(a) Mark all multiples of k between k2 and n(b) k smallest unmarked number > k
until k2 > n4. The unmarked numbers are primes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sources of Parallelism
Domain decompositionDomain decompositionDivide data into piecesDivide data into piecesAssociate computational steps with dataAssociate computational steps with data
One primitive task per array elementOne primitive task per array element
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Making 3(a) Parallel
Mark all multiples of k between k2 and n
for all j where k2 j n do if j mod k = 0 then mark j (it is not a prime) endifendfor
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Making 3(b) ParallelFind smallest unmarked number > k
Min-reduction (to find smallest unmarked number > k)
Broadcast (to get result to all tasks)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Agglomeration Goals
Consolidate tasksConsolidate tasks Reduce communication costReduce communication cost Balance computations among processesBalance computations among processes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Data Decomposition Options
Interleaved (cyclic)Interleaved (cyclic)Easy to determine “owner” of each indexEasy to determine “owner” of each indexLeads to load imbalance Leads to load imbalance for this problemfor this problem
BlockBlockBalances loadsBalances loadsMore complicated to determine owner if More complicated to determine owner if
nn not a multiple of not a multiple of pp
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Block Decomposition Options
Want to balance workload when Want to balance workload when nn not a not a multiple of multiple of pp
Each process gets either Each process gets either n/pn/p or or n/pn/p elementselements
Seek simple expressionsSeek simple expressionsFind low, high indices given an ownerFind low, high indices given an ownerFind owner given an indexFind owner given an index
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Method #1
Let Let r r = = nn mod mod pp If If rr = 0, all blocks have same size = 0, all blocks have same size ElseElseFirst First rr blocks have size blocks have size n/pn/pRemaining Remaining p-rp-r blocks have size blocks have size n/pn/p
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Examples17 elements divided among 7 processes
17 elements divided among 5 processes
17 elements divided among 3 processes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Method #1 Calculations
First element controlled by process First element controlled by process ii
Last element controlled by process Last element controlled by process ii
Process controlling element Process controlling element jj
),min(/ ripni
1),1min(/)1( ripni
)//)(,)1//(min( pnrjpnj
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Method #2
Scatters larger blocks among processesScatters larger blocks among processes First element controlled by process First element controlled by process ii
Last element controlled by process Last element controlled by process ii
Process controlling element Process controlling element jj
pin /
1/)1( pni
njp /)1)1(
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Examples17 elements divided among 7 processes
17 elements divided among 5 processes
17 elements divided among 3 processes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparing Methods
24Low index
47Owner
46High index
Method 2Method 1Operations
Assuming no operations for “floor” function
Our choice
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz
Illustrate how block decomposition method Illustrate how block decomposition method #2 would divide 13 elements among 5 #2 would divide 13 elements among 5 processes.processes.
13(0)/ 5 = 0
13(1)/5 = 2
13(2)/ 5 = 5
13(3)/ 5 = 7
13(4)/ 5 = 10
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Block Decomposition Macros#define BLOCK_LOW(id,p,n) ((i)*(n)/(p))
#define BLOCK_HIGH(id,p,n) \ (BLOCK_LOW((id)+1,p,n)-1)
#define BLOCK_SIZE(id,p,n) \ (BLOCK_LOW((id)+1)-BLOCK_LOW(id))
#define BLOCK_OWNER(index,p,n) \ (((p)*(index)+1)-1)/(n))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Local vs. Global Indices
L 0 1
L 0 1 2
L 0 1
L 0 1 2
L 0 1 2
G 0 1 G 2 3 4
G 5 6
G 7 8 9 G 10 11 12
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Looping over Elements
Sequential programSequential programfor (i = 0; i < n; i++) {for (i = 0; i < n; i++) { … …}}
Parallel programParallel programsize = BLOCK_SIZE (id,p,n);size = BLOCK_SIZE (id,p,n);for (i = 0; i < size; i++) {for (i = 0; i < size; i++) { gi = i + BLOCK_LOW(id,p,n);gi = i + BLOCK_LOW(id,p,n);}}
Index i on this process…
…takes place of sequential program’s index gi
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Decomposition Affects Implementation
Largest prime used to sieve is Largest prime used to sieve is nn First process has First process has nn//pp elements elements It has all sieving primes if It has all sieving primes if pp < < nn First process always broadcasts next sieving First process always broadcasts next sieving
primeprime No reduction step neededNo reduction step needed
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fast Marking
Block decomposition allows same marking as Block decomposition allows same marking as sequential algorithm:sequential algorithm:
jj, , j j + + kk, , j j + 2+ 2kk, , j j + 3+ 3kk, …, …
instead ofinstead of
for all for all jj in block in blockif if jj mod mod kk = 0 then mark = 0 then mark jj (it is not a prime) (it is not a prime)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Algorithm Development1. Create list of unmarked natural numbers 2, 3, …, n
2. k 2
3. Repeat
(a) Mark all multiples of k between k2 and n
(b) k smallest unmarked number > k
until k2 > m
4. The unmarked numbers are primes
Each process creates its share of listEach process does this
Each process marks its share of list
Process 0 only
(c) Process 0 broadcasts k to rest of processes
5. Reduction to determine number of primes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function MPI_Bcastint MPI_Bcast (
void *buffer, /* Addr of 1st element */
int count, /* # elements to broadcast */
MPI_Datatype datatype, /* Type of elements */
int root, /* ID of root process */
MPI_Comm comm) /* Communicator */
MPI_Bcast (&k, 1, MPI_INT, 0, MPI_COMM_WORLD);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Task/Channel Graph
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Analysis
is time needed to mark a cellis time needed to mark a cell Sequential execution time: Sequential execution time: nn ln ln ln ln nn Number of broadcasts: Number of broadcasts: n n / ln / ln n n Broadcast time: Broadcast time: log log pp Expected execution time:Expected execution time:
pnnpnn log)ln/(/lnln
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Code (1/4)#include <mpi.h>#include <math.h>#include <stdio.h>#include "MyMPI.h"#define MIN(a,b) ((a)<(b)?(a):(b))
int main (int argc, char *argv[]){ ... MPI_Init (&argc, &argv); MPI_Barrier(MPI_COMM_WORLD); elapsed_time = -MPI_Wtime(); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p);if (argc != 2) { if (!id) printf ("Command line: %s <m>\n", argv[0]); MPI_Finalize(); exit (1);}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Code (2/4) n = atoi(argv[1]); low_value = 2 + BLOCK_LOW(id,p,n-1); high_value = 2 + BLOCK_HIGH(id,p,n-1); size = BLOCK_SIZE(id,p,n-1); proc0_size = (n-1)/p; if ((2 + proc0_size) < (int) sqrt((double) n)) { if (!id) printf ("Too many processes\n"); MPI_Finalize(); exit (1); }
marked = (char *) malloc (size); if (marked == NULL) { printf ("Cannot allocate enough memory\n"); MPI_Finalize(); exit (1); }
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Code (3/4) for (i = 0; i < size; i++) marked[i] = 0; if (!id) index = 0; prime = 2; do { if (prime * prime > low_value) first = prime * prime - low_value; else { if (!(low_value % prime)) first = 0; else first = prime - (low_value % prime); } for (i = first; i < size; i += prime) marked[i] = 1; if (!id) { while (marked[++index]); prime = index + 2; } MPI_Bcast (&prime, 1, MPI_INT, 0, MPI_COMM_WORLD); } while (prime * prime <= n);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Code (4/4)
count = 0; for (i = 0; i < size; i++) if (!marked[i]) count++; MPI_Reduce (&count, &global_count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); elapsed_time += MPI_Wtime(); if (!id) { printf ("%d primes are less than or equal to %d\n", global_count, n); printf ("Total elapsed time: %10.6f\n", elapsed_time); } MPI_Finalize (); return 0;}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking
Execute sequential algorithmExecute sequential algorithm Determine Determine = 85.47 nanosec = 85.47 nanosec Execute series of broadcastsExecute series of broadcasts Determine Determine = 250 = 250 secsec
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Execution Times (sec)
4.2223.9278
4.6874.3717
5.1594.9646
5.9935.7945
7.0556.7684
9.0398.8433
13.01112.7212
24.90024.9001
Actual (sec)PredictedProcessors
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Improvements
Delete even integersDelete even integers Cuts number of computations in halfCuts number of computations in half Frees storage for larger values of Frees storage for larger values of nn
Each process finds own sieving primesEach process finds own sieving primes Replicating computation of primes to Replicating computation of primes to nn Eliminates broadcast stepEliminates broadcast step
Reorganize loopsReorganize loops Increases cache hit rateIncreases cache hit rate
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reorganize Loops
Cache hit rate
Lower
Higher
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparing 4 Versions
1.585
1.820
2.127
2.559
3.201
4.272
6.378
12.466
Sieve 3 Sieve 4
0.3422.8563.9278
0.3913.0594.3717
0.4563.2704.9646
0.5433.6525.7945
0.6794.0726.7684
0.9015.0198.8433
1.3306.60912.7212
2.54312.23724.9001
Procs Sieve 2Sieve 110-fold improvement
7-fold improvement
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary
Sieve of Eratosthenes: parallel design uses Sieve of Eratosthenes: parallel design uses domain decompositiondomain decomposition
Compared two block distributionsCompared two block distributionsChose one with simpler formulasChose one with simpler formulas
Introduced Introduced MPI_BcastMPI_Bcast Optimizations reveal importance of Optimizations reveal importance of
maximizing single-processor performancemaximizing single-processor performance