+ All Categories
Home > Education > Presentation

Presentation

Date post: 13-Jul-2015
Category:
Upload: chemseddine-chohra
View: 103 times
Download: 0 times
Share this document with a friend
Popular Tags:
120
Transcript

Numerical reproducibility for exascale HPC

Chemseddine CHOHRA

Université de Perpignan Via Domitia (UPVD)

16 Juin 2014

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 1 / 67

Introduction and problematic

Introduction and problematic

Limited machine precision.Using �oating point number as approximation.x −→ X = �(x) if x /∈ F or x si x ∈ F.X + Y 6= X ⊕ Y = �(X + Y).

Non-associativity of addition.A ⊕ (B ⊕ C) 6= (A ⊕ B) ⊕ C.For instance : M = 253; (-M ⊕ M) ⊕ 1 6= -M ⊕ (M ⊕ 1)

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 2 / 67

Introduction and problematic

Introduction and problematic

Figure 1.1 : No reproducibility of summation

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67

Introduction and problematic

Introduction and problematic

Figure 1.1 : No reproducibility of summation

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67

Introduction and problematic

Introduction and problematic

Figure 1.1 : No reproducibility of summation

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67

Introduction and problematic

Introduction and problematic

Figure 1.1 : No reproducibility of summation

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67

Introduction and problematic

Introduction and problematic

Non-reproducibility of summation on parallel systems.Problem for debuging.Problem for validating results.

Guarantee the reproducibility for BLAS.Level 1 : max, min, scal, axpy, norm, asum, dot.dot can be transformed to a summation problem.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 4 / 67

Introduction and problematic

Introduction and problematic

Non-reproducibility of summation on parallel systems.Problem for debuging.Problem for validating results.

Guarantee the reproducibility for BLAS.Level 1 : max, min, scal, axpy, norm, asum, dot.dot can be transformed to a summation problem.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 4 / 67

Introduction and problematic

Sommaire

1 Introduction and problematic

2 Solution

3 Optimization

4 Parallelism

5 Conclusion and Work in progress

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 5 / 67

Solution

Sommaire

1 Introduction and problematic

2 SolutionAccSumFastAccSumiFastSumHybridSumOnlineExactCompare algorithms

3 Optimization

4 Parallelism

5 Conclusion and Work in progress

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 6 / 67

Solution

Solution

Ensure reproducibility.1 Static scheduling and deterministic reduction.2 Demmel and Nguyen's solutions (2013).

Use an exact summation algorithm (Always reproducible).

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 7 / 67

Solution

Exact summation

How to calculate a RTN (rounded to nearest) sum.But we know how since 1970s.

Several algorithms have been proposed.FastSum (2006).AccSum (2008).FastAccSum (2008).iFastSum (2009).HybridSum (2009).OnlineExact (2010).

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 8 / 67

Solution

Exact summation

How to calculate a RTN (rounded to nearest) sum faster.But we know how since 1970s.

Several algorithms have been proposed.FastSum (2006).AccSum (2008).FastAccSum (2008).iFastSum (2009).HybridSum (2009).OnlineExact (2010).

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 8 / 67

Solution

Error free transformation of two �oats

Algorithms to compute rounding errors.

TwoSum : requires 6 �op.

FastTwoSum : requires 3 �op but ordered summands.

TwoSum(A, B) = (S, E) such as A ⊕ B = S and A + B = S + E.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 9 / 67

Solution

TwoSum and FastTwoSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 10 / 67

Solution

Exact summation

Given a vector of n �oating-point numbers with. We present somme exact summationalgorithms.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 11 / 67

Solution AccSum

AccSum

Iterative algorithm.

Based on vector error free transformation.

Adapts automatically to the condition number of the sum.

Extract and sum the high order parts in each iteration.

Requires 4n �op for each itération.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 12 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67

Solution AccSum

ExtractScalar

Figure 2.1 : ExtractScalar

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 14 / 67

Solution AccSum

ExtractScalar

Figure 2.1 : ExtractScalar

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 14 / 67

Solution AccSum

n cond Iterations

103 108 2

103 1024 3

106 1024 4

Table 2.1 : Number of iterations of the algorithm AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 15 / 67

Solution FastAccSum

FastAccSum

Improvement for AccSum.

FastAccSum requires only 3n �op for each iteration.theorically 25% faster.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 16 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

AccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67

Solution FastAccSum

FastAccSum VS AccSum

Table 2.2 : Ratio of computing times AccSum / FastAccSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 18 / 67

Solution iFastSum

iFastSum

Pure distillation algorithm.

Delete zeros at the end of each iteration to reduce the size of vector.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 19 / 67

Solution iFastSum

iFastSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67

Solution iFastSum

iFastSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67

Solution iFastSum

iFastSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67

Solution iFastSum

iFastSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67

Solution iFastSum

iFastSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67

Solution HybridSum

HybridSum

Splits the summands so the standard �oating-point numbers can be considered as ahigh accumulators.

Accumulate the summands with the same exponent in an appropriate accumulator.

Use iFastSum to sum the intermediate accumulators.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 21 / 67

Solution HybridSum

HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 22 / 67

Solution HybridSum

HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 22 / 67

Solution HybridSum

HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 22 / 67

Solution OnlineExact

OnlineExact

Use the same idea of HybridSum but using two �oating-point numbers asaccumulator instead of spliting summands.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 23 / 67

Solution OnlineExact

OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 24 / 67

Solution OnlineExact

OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 24 / 67

Solution OnlineExact

OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 24 / 67

Solution Compare algorithms

Hardware

Hardware

Two sockets.

Xeon E5 (2,2 Ghz, 8 cores).

Cache :

L1 : 32 KB.L2 : 256 KB.L3 : 20 MB Shared.

Memory max bandwidth 51,2 GB/s.

Turbo boost and multithreading are turned o�.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 25 / 67

Solution Compare algorithms

Compiler

Compiler

ICC -O3 -axCORE-AVX-I -fp-model double -fp-model strict

-funroll-all-loops

-axCORE-AVX-I : To indicate instruction set.

-fp-model double : Rounds intermediate results to 53-bit precision.

-fp-model strict : Disable contractions.

-funroll-all-loops : Unroll loops.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 26 / 67

Solution Compare algorithms

Compare algorithms for cond = 108

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 27 / 67

Solution Compare algorithms

Compare algorithms for cond = 1032

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 28 / 67

Solution Compare algorithms

Compare algorithms for de�rent condition numbers

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 29 / 67

Solution Compare algorithms

Runtime of AccSum for entries with di�erent condition numbers

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 30 / 67

Solution Compare algorithms

Runtime of FastAccSum for entries with di�erent condition numbers

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 31 / 67

Solution Compare algorithms

Runtime of HybridSum for entries with di�erent condition numbers

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 32 / 67

Solution Compare algorithms

Runtime of OnlineExact for entries with di�erent condition numbers

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 33 / 67

Optimization

Sommaire

1 Introduction and problematic

2 Solution

3 OptimizationHybridSumOnlineExact

4 Parallelism

5 Conclusion and Work in progress

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 34 / 67

Optimization HybridSum

HybridSum

ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare an intermediate array C.

2 FOREACH element of A as a do.

1 split(a, ah, al).2 i = exponent(ah).3 Ci += ah.4 i = exponent(al).5 Ci += al.

END FOREACH.

3 RETURN iFastSum(C).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67

Optimization HybridSum

HybridSum

ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare an intermediate array C.

2 FOREACH 8 element of A as a do.

1 split(a, ah, al).2 i = exponent(ah).3 Ci += ah.4 i = exponent(al).5 Ci += al.

END FOREACH.

3 RETURN iFastSum(C).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67

Optimization HybridSum

HybridSum

ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare an intermediate array C.

2 FOREACH 8 element of A as a do.

1 prefetch data for the next loops.2 split(a, ah, al).3 i = exponent(ah).4 Ci += ah.5 i = exponent(al).6 Ci += al.

END FOREACH.

3 RETURN iFastSum(C).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67

Optimization HybridSum

HybridSum

ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare an intermediate array C.

2 FOREACH 8 element of A as a do.

1 prefetch data for the next loops.2 split(a, ah, al).3 i = exponent(ah).4 Ci += ah.5 i = i - 27.6 Ci += al.

END FOREACH.

3 RETURN iFastSum(C).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67

Optimization HybridSum

Progress in optimization of HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 36 / 67

Optimization HybridSum

Progress in optimization of HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 37 / 67

Optimization OnlineExact

OnlineExact

ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare two intermediate arrays C1, C2.

2 FOREACH element of A as a do.

1 i = exponent(a).2 (C1i, a) = 2Sum(C1i, a).3 C2i += a.

END FOREACH.

3 RETURN iFastSum(C1 ∪ C2).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67

Optimization OnlineExact

OnlineExact

ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare two intermediate arrays C1, C2.

2 FOREACH 8 element of A as a do.

1 i = exponent(a).2 (C1i, a) = 2Sum(C1i, a).3 C2i += a.

END FOREACH.

3 RETURN iFastSum(C1 ∪ C2).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67

Optimization OnlineExact

OnlineExact

ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare two intermediate arrays C1, C2.

2 FOREACH 8 element of A as a do.

1 prefetch data for the next loops.2 i = exponent(a).3 (C1i, a) = 2Sum(C1i, a).4 C2i += a.

END FOREACH.

3 RETURN iFastSum(C1 ∪ C2).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67

Optimization OnlineExact

OnlineExact

ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.

1 Declare an intermediate arrays C.

2 FOREACH 8 element of A as a do.

1 prefetch data for the next loops.2 i = exponent(a).3 (C2∗i, a) = 2Sum(C2∗i, a).4 C2∗i+1 += a

END FOREACH.

3 RETURN iFastSum(C).

END.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67

Optimization OnlineExact

Progress in optimization of OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 39 / 67

Optimization OnlineExact

Progress in optimization of OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 40 / 67

Parallelism

Sommaire

1 Introduction and problematic

2 Solution

3 Optimization

4 ParallelismImplementation and tests

5 Conclusion and Work in progress

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 41 / 67

Parallelism

Architecture

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 42 / 67

Parallelism

OpenMP

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67

Parallelism

OpenMP

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67

Parallelism

OpenMP

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67

Parallelism

OpenMP

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67

Parallelism

latency

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 44 / 67

Parallelism

bandwidth

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 45 / 67

Parallelism

MPI

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67

Parallelism

MPI

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67

Parallelism

MPI

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67

Parallelism

MPI

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67

Parallelism

Hybrid

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 47 / 67

Parallelism

Hybrid

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 47 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67

Parallelism Implementation and tests

Parallel OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 49 / 67

Parallelism Implementation and tests

HybridSum

Figure 4.1 : Scaling of HybridSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 50 / 67

Parallelism Implementation and tests

OnlineExact

Figure 4.2 : Scaling of OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 51 / 67

Parallelism Implementation and tests

Scaling

Algorithm 1 core 2 cores 4 cores 8 cores 16 cores

HybridSum (cycles) 249855884 141459028 72300156 37207844 19066140

OnlineExact (cycles) 259167668 156856764 91386036 46004832 23156420

Hybrid / seq 1 0,5661 0,2893 0,1489 0,0763

Online / seq 1 0,6052 0,3526 0,1775 0,0893

Online / Hybrid 1,0372 1,1088 1,2639 1,2364 1,2145

Table 4.1 : HybridSum vs OnlineExact

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 52 / 67

Parallelism Implementation and tests

HybridSum weak scalingData size = 220 * number of cores

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 53 / 67

Parallelism Implementation and tests

OnlineExact weak scalingData size = 220 * number of cores

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 54 / 67

Parallelism Implementation and tests

Compare to other algorithms

Optimized sum.dasum : Optimized by Intel in the library MKL.

reproducible solutions.ReprodSum : Guarantee reproducibility of results (based on "AccSum").FastReprodSum : Faster than ReprodSum but requires direct rounding mode (basedon "FastAccSum").

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 55 / 67

Parallelism Implementation and tests

ReprodSum

Figure 4.3 : How does it work ?

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 56 / 67

Parallelism Implementation and tests

ReprodSum

Figure 4.3 : How does it work ?

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 56 / 67

Parallelism Implementation and tests

Sequential results

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 57 / 67

Parallelism Implementation and tests

Sequential results

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 57 / 67

Parallelism Implementation and tests

Sequential results

1 FastReprodSum is 2,2times slower than dasum.

2 ReprodSum is 2,8 timesslower than dasum.

3 Hybrid and Online are 4times slower than dasum.

4 Hybrid and Online are 2times slower thanFastReprodSum.

5 Hybrid and Online are 1,5times slower thanReprodSum.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 57 / 67

Parallelism Implementation and tests

4 cores parallel results

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 58 / 67

Parallelism Implementation and tests

4 cores parallel results

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 58 / 67

Parallelism Implementation and tests

4 cores parallel results

1 The same ratios exceptfor OnlineExact.

2 Poor scaling ofOnlineExact.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 58 / 67

Parallelism Implementation and tests

16 cores parallel results

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 59 / 67

Parallelism Implementation and tests

16 cores parallel results

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 59 / 67

Parallelism Implementation and tests

16 cores parallel results

1 HybridSum is as fast asReprodSum.

2 Due to limit of bandwidth.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 59 / 67

Parallelism Implementation and tests

ParallelismScaling

Algorithm 1 core 2 cores 4 cores 8 cores 16 cores

HybridSum 1 1.72 3.46 6.72 13.11

OnlineExact 1 1,66 2,84 5,63 11.20

FastReprodSum 1 1.92 3.46 6.51 8.68

ReprodSum 1 1,95 2,84 5,63 9.43

dasum 1 1.89 3.47 5.21 7.80

Table 4.2 : Scaling of algorithms relatively to number of cores

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 60 / 67

Parallelism Implementation and tests

Scaling of ReprodSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 61 / 67

Parallelism Implementation and tests

Scaling of FastReprodSum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 62 / 67

Parallelism Implementation and tests

Scaling of dasum

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 63 / 67

Conclusion and Work in progress

Sommaire

1 Introduction and problematic

2 Solution

3 Optimization

4 Parallelism

5 Conclusion and Work in progress

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 64 / 67

Conclusion and Work in progress

Conclusion

More precision requires more computing time.

The Fastest algorithms are neither precise nor reproducibe.

We are trying to develop a reproducibe BLAS thar guarantees the best reportPrecision / Performance.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 65 / 67

Conclusion and Work in progress

Work in progress

Tests on machines with more sockets and cores.

Generalize to dot.

Auto tuning.

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 66 / 67

Conclusion and Work in progress

Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 67 / 67


Recommended