Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | alison-stone |
View: | 217 times |
Download: | 1 times |
Mini Symposium
Adaptive Algorithms for Scientific computing
•9h45 Adaptive algorithms - Theory and applicationsJean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France
•10h15 Hybrids in exact linear algebraDave Saunders U. Delaware, USA
•10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, Gudula Runger, U. Bayreuth, Germany
•11h15 Cache-Oblivious algorithmsMichael Bender, Stony Brook U., USA
• Adaptive, hybrids, oblivious : what do those terms mean ?• Taxonomy of autonomic computing [Ganek & Corbi 2003] :
– Self-configuring / self-healing / self-optimising / self-protecting
• Objective: towards an analysis based on the algorithm performance
Adaptive algorithmsTheory and applications
Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram
IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France
Contents
I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation
Why adaptive algorithms and how?
€
7 3 6
0 1 8
0 0 5
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
Input data varyResources availability is versatile
Adaptation to improve performances
Scheduling• partitioning • load-balancing• work-stealing
Measures onresources
Measures on data
Calibration • tuning parameters block size/ cache choice of instructions, …• priority managing
Choices in the algorithm • sequential / parallel(s) • approximated / exact• in memory / out of core• …
An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem
Modeling an hybrid algorithm
• Several algorithms to solve a same problem f : – Eg : algo_f1, algo_f2(block size), … algo_fk : – each algo_fk being recursive
Adaptationto choose algo_fj for
each call to f
algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; …}
.
E.g. “practical” hybrids: • Atlas, Goto, FFPack• FFTW• cache-oblivious B-tree• any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…
• How to manage overhead due to choices ? • Classification 1/2 :
– Simple hybrid iff O(1) choices [eg block size in Atlas, …]
– Baroque hybrid iff an unbounded number of choices
[eg recursive splitting factors in FFTW]
• choices are either dynamic or pre-computed based on input properties.
• Choices may or may not be based on architecture parameters.
• Classification 2/2. :
an hybrid is– Oblivious: control flow does not depend neither on static properties of the resources nor on the input
[eg cache-oblivious algorithm [Bender]
– Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ]
• Engineered tuned or self tuned[eg ATLAS and GOTO libraries, FFTW, …][eg [LinBox/FFLAS] [ Saunders&al]
– Adaptive : self-configuration of the algorithm, dynamlc• Based on input properties or resource circumstances discovered at run-time
[eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]
Examples
• BLAS libraries– Atlas: simple tuned (self-tuned)– Goto : simple engineered (engineered tuned)– LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al]
• FFTW– Halving factor : baroque tuned– Stopping criterion : simple tuned
• Parallel algorithm and scheduling :– Choice of parallel degree : eg Tlib [Rauber&Rünger] – Work-stealing schedile : baroque hybrid
Adaptive algorithmsTheory and applications
Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram
INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France
Contents
I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation
Work-stealing (1/2)
«Depth »
W = #ops on a critical path
(parallel time on resources)
• Workstealing = “greedy” schedule but distributed and
randomized
• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a
remote -non idle- victim processor (randomly chosen)
« Work »
W1= #total
operations performed
Work-stealing (2/2)
«Depth »
W = #ops on a critical path
(parallel time on resources)
« Work »
W1= #total
operations performed
• Interests :-> suited to heterogeneous architectures with slight modification [Bender-
Rabin02]
-> with good probability, near-optimal schedule on p processors with average speeds ave
Tp < W1/(p ave) + O ( W / ave )
NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02]
• Implementation: work-first principle [Cilk, Kaapi]• Local parallelism is implemented by sequential function call• Restrictions to ensure validity of the default sequential schedule
- serie-parallel/Cilk - reference order/Kaapi
Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks
transparently to the application with provable performances• Support to addition of new resources• Support to resilience of resources and fault-tolerance (crash faults, network, …)
• Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …]
• “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms
• a sequential (local) algorithm : depth-first (default choice)• A parallel algorithm : breadth-first• Choice is performed at runtime, depending on resource idleness
• Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]:
• Parallel Divide&Conquer computations • Tree searching, Branch&X …
-> suited when both sequential and parallel algorithms perform (almost) the same number of operations
• Solution: to mix both a sequential and a parallel algorithm
• Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one• Problem : W increases
also, the number of migration … and the inefficiency ;o(
• Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92]
Careful interplay of both algorithms to build one withboth W small and W1 = O( Wseq )
• Divide the sequential algorithm into block• Each block is computed with the (non-optimal) parallel algorithm• Drawback : sequential at coarse grain and parallel at fine grain ;o(
• Adaptive granularity : dual approach : • Parallelism is extracted at run-time from any sequential task
But often parallelism has a cost !
Self-adaptive grain algorithmBased on the Work-first principle :
Executes always a sequential algorithm to reduce parallelism overhead=> use parallel algorithm only if a processor becomes idle by
extracting parallelism from a sequential computation
Hypothesis : two algorithms : • - 1 sequential : SeqCompute
- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm
– Examples : - iterated product [Vernizzi 05] - gzip / compression [Kerfali 04]- MPEG-4 / H264 [Bernard 06] - prefix computation [Traore 06]
SeqCompute
Extract_parLastPartComputation
SeqCompute
Adaptive algorithmsTheory and applications
Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram
INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France
Contents
I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation
• Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ;
• Parallel algorithm [Ladner-Fischer]:
Prefix computation : an example where parallelism always costs
1 = a0*a1 2=a0*a1*a2 … n=a0*a1*…*an
W =2. log n
but W1 = 2.n
Twice more expensive than the sequential …
a0 a1 a2 a3 a4 … an-1 an
* * **
Prefix of size n/2 1 3 … n
2 4 … n-1
** *
W1 = W = n
Adaptive prefix computation
– Any (parallel) prefix performs at least W1 2.n - W ops
– Strict-lower bound on p identical processors: Tp 2n/(p+1)
block algorithm + pipeline [Nicolau&al. 2000]
Application of adaptive scheme :– One process performs the main “sequential” computation
– Other work-stealer processes computes parallel « segmented » prefix
–Near-optimal performance on processors with changing speeds :
Tp < 2n/((p+1). ave) + O ( log n / ave) lower bound
Scheme of the proof• Dynamic coupling of two algorithms that completes simultaneously:
– Sequential: (optimal) number of operations S– Parallel : : performs X operations
• dynamic splitting always possible till finest grain BUT local sequential• Scheduled by workstealing on p-1 processors
– Critical path small (log X)– Each non constant time task can be splitted (variable speeds)
• Analysis :
• Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal
• Comparison to the lower bound on the number of operations.
0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
Work-stealer 1
MainSeq.
Work-stealer 2
Adaptive Prefix on 3 processors
1
Steal request
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
a9 a10 a11 a127
3
Steal request
2
6 i=a5*…*ai
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
8
4
Preempt
10 i=a9*…*ai
8
8
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
8
Preempt 11
118
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8 11 a12
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
12
10
7
118
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8 11 a12
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
12
10
7
118
Implicit critical path on the sequential process
Adaptive prefix : some experiments
Single user contextAdaptive is equivalent to:
- sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors
Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm
Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm
External charge
Parallel
Adaptive
Parallel
Adaptive
Prefix of 10000 elements on a SMP 8 procs (IA64 / linux)
#processorsT
ime
(s)
Tim
e (s
)
#processors
Joint work with Daouda Traore
The Prefix race: sequential/parallel fixed/ adaptive
Race between 9 algorithms (44 processes) on an octo-SMPSMP
0 5 10 15 20 25
1
2
3
4
5
6
7
8
9
Execution time (seconds)
Série1
Adaptative 8 proc.
Parallel 8 proc.
Parallel 7 proc.
Parallel 6 proc.Parallel 5 proc.
Parallel 4 proc.
Parallel 3 proc.
Parallel 2 proc.
Sequential
On each of the 10 executions, adaptive completes first
With * = double sum ( r[i]=r[i-1] + x[i] )
Single user Processors with variable speeds
Remark for n=4.096.000 doubles :- “pure” sequential : 0,20 s- minimal ”grain” = 100 doubles : 0.26s on 1 proc
and 0.175 on 2 procs (close to lower bound)
Finest “grain” limited to 1 page = 16384 octets = 2048 double
E.g.Triangular system solving0 .x = b
• Sequential algorithm : T1 = n2/2; T = n (fine grain)
0 .x = b
A
1/ x1 = - b1 / a11
2/ For k=2..n bk = bk - ak1.x1
0 .x = b
system of dimension n-1
system of dimension n
E.g.Triangular system solving0 .x = b
• Sequential algorithm : T1 = n2/2; T = n (fine grain)
• Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain)
0
A21 A22
A11
-1
=0
S A22
A11
-1
-1
S= -A22.A21.A11
-1 -1with A =-1
and x=A-1.b
• Self-adaptive granularity algorithm : T1 = n2; T = n.log n
0 .x = b
ExtractParand self-adaptive scalar product
self adaptive sequential algorithm
self-adaptivematrix inversion
choice of h = m
hm
Conclusion
Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices
- achieves near-optimal performance processor oblivious
Generic adaptive scheme to implement parallel algorithms with provable performance
Mini Symposium
Adaptive Algorithms for Scientific computing
•9h45 Adaptive algorithms - Theory and applicationsJean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France
•10h15 Hybrids in exact linear algebraDave Saunders, U. Delaware, USA
•10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, U. Bayreuth, Germany
•11h15 Cache-Obloivious algorithmsMichael Bender, Stony Brook U., USA
• Adaptive, hybrids, oblivious : what do those terms mean ?• Taxonomy of autonomic computing [Ganek & Corbi 2003] :
– Self-configuring / self-healing / self-optimising / self-protecting
• Objective: towards an analysis based on the algorithm performance
Questions ?
Some examples (1/2)• Adaptive algorithms used empirically an theoretically :
– Atlas [2001] dense linear algebra library• Instruction set and instruction schedule • Self-camobration pg yjr blpvk idr §<“§!uuuuuuuuuu de la taille des blocs à l’installation sur la machine
– FFTW (1998, … ) ; FFT (n) <= p FFT(q) and q FFT(n) • For any n, for any recursive call FFT(n) : pre-compite the nest value for p • Pré-calcul de la découpe optimale pour la taille n du vecteur sur la machine
– Cache-oblivious B-trees : • Block recursive splitting to minimize #page faults• Self adaptation to memory hierarchy
– Workstealing (Cilk (1998, …) (2000, …) : recursive parallelism• Choice between sequential depth-first schedule and breadth-first schedule • « Work-first principle » : to optimize local sequentilal execution and put overhead on
rare steals from idle processors .• Implicitly adaptive
– Moldable tasks : Ordonnancement bi-critère avec garantie [Trystram&al 2004]
• Combinaison récursive alternatiive d’approximation pour chaque critère• Auto-adaptation avec performance garantie pour chaque critère
– Algorithmes « Cache-Oblivious » [Bender&al 2004]
• Découpe récursive par bloc qui minimise les défauts de page• Auto-adaptation à la hiérarchie mémoire (B-tree)
– Algorithmes « Processor-Oblivious » [Roch&al 2005]
• Combinaison récursive de 2 algorithmes séquentiel et parallèle• Auto-adaptation à l’inactivité des ressources
Some examples (2/2)
Best case : parallel algorithm is efficient
W is small and W1 = Wseq
The parallel algorithm is an optimal sequential oneExemples: parallel D&C algorithms
Implementation: work-first principle- no overhead when local execution of tasks
Examples :Cilk : THE protocol
Kaapi : Compare&swap only
Experimentation: knary benchmark
SMP ArchitectureOrigin 3800 (32 procs)
Cilk / Athapascan
Distributed Archi.iCluster
Athapascan
#procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Ts = 2397 s T1 = 2435
In « practice »: coarse granularitySplitting into p = #resourcesDrawback : heterogeneous architecture, dynamic:
i(t) : speed of processor i at time t
In « theory »: fine granularity Maximal parallelismDrawback : overhead of tasks management
How to choose/adapt granularity ?
a b
H(a) O(b,7)
F(2,a) G(a,b) H(b)
High potentialdegree
of parallelism
How to obtain an efficientfine-grain algorithm ?
• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)
• Problem :• Fine grain (T small) parallel algorithms may
involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead
Self-grain Adaptive algorithms
• Recursive computations– Local sequential computation
• Special case: – recursive extraction of parallelism when a resource becomes idle– But local execution of a sequential algorithm
• Hypothesis : two algorithms : • - 1 sequential : SeqCompute• - 1 parallel : LastPartComputation => at any time, it is possible to
extract parallelism from the remaining computations of the sequential algorithm
• Example : – - iterated product [Vernizzi] - gzip / compression [Kerfali]– - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]
Adaptive Prefix versus optimalon identical processors
Illustration: adaptive parallel prefix
• Adaptive parallel computing on non-uniform and shared resources
• Example of adaptive prefix computation
• Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; W1 = n
• Parallel algorithm [Ladner-Fischer]:
Indeed parallelism often costs ...
eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an
a0 a1 a2 ana3
* * *
an-1
Prefix ( n / 2 )
P1 P3 Pn*
P4
*
P2
*
Pn-1
W =2. log n
but W1 = 2.n
Twice more expensive than the sequential