08 -
(1) Parallel algorithm structure design space
2
Organization by Tasks
(1.3) Task Parallelism
(1.4) Recursive Splitting
Organization by Data
Organization by Data Flow
(1.1) Geometric Decomposition
(1.2) Recursive Data
(1.5) Pipeline
08 -
(1.3) Task parallelism
3
Context:• Problem can be naturally decomposed into a
collection of tasks.• Tasks are (mostly) independent• Task is typically computationally bound, data
access may not be regular (as e.g. in data parallelism)
08 -
Example: Ray tracing
4
• 3D computer graphics• Rendering of shadows and reflection in
modeled images
“Juggler” (Amiga 1986): 4096 colors, 320x200 pixels sequential rendering 1 image ~1h
Source: [4]
• Simple geometric shapes (e.g. spheres, see Juggler)
• Vertex figures
• Triangulated shapes
08 -
Image model
5Source: [8, 9]
08 -
Ray tracing principles
6
Light Source
Shadow RayView Ray
Scene object
ImageCamera
• Each pixel corrsponds to one view ray.• Simulate rays of light relecting within a
mathematically defined scene. • Computation of pixels/ray traces are independent
Parallel Programming, Summer 2010Source: [5]
08 - 7
Raytracing for movies (e.g. Pixar)• 100s of lights• 1.000s of textures • 10.000s of scene objects• 100.000.000 of ploygons• Shaders with 10000 lines of code
(computation of shadow rays)• Resolution: 2048x1536 pixel
Challenge: Scene and auxiliary structures (textures) should fit in memory.
Source: [3]
Ever growing appetite for computing resources ...
Toy Story (1995) Cars (2005)Image resolution: 1536 x 922 2048 x 1536 Rendering algorithm: simple, scanline complex [3]Time per frame 2h 15h (!)
Next: Ray tracing animations:Juggler could be computed at 30fps today in real-time (in 1986: 1 frame/h)
08 - 8Source: [3,5]
Ray tracing has also important applications in medical imaging, e.g., analysis and correction of PET and CT images• Simulation of how radiation propagates through
the body and hits a camera model• Distinction of different types of tissues
based on their propagation of radiation at different energy levels.
08 - 9Source: [1, http://www.siemens.com/press/de/pressebilder/?press=/de/pp_cc/2007/10_oct/sosep200729_32_1465320.htm]
08 -
Scalability
10
Almost perfect (strong and weak) scaling on today’s architectures.Example:
Source: [5]
Animation of 256x256 pixels (Quake 4 game) on Intel Quad Core: • 4 cores 16.9 fps 3.84x speedup• 2 cores 8.6 fps 1.96x speedup • 1 core 4.4 fps
08 -
Other Examples
11
Molecular dynamics: Simulation of interaction between atoms in a chemical reaction.
Source: [1, http://www.almaden.ibm.com/st/computational_science/MSA/]
Computation proceeds in phases; each phase consists of a number of parallel tasks:1) For each atom: Find and compute forces that
impact movement (access to neighbour atoms).2) For each atom: Upate position and velocity3) For each atom: Update neighbor list
08 - 12
Forces• Task scheduling is the assignment of tasks
to units of execution units (UE)– typically: #tasks >> #UEs– goal: load balancing
a cb de
f
Example: tasks with different computational requirements should be mapped to 4 UEs.
Source: [1]
async { a(); }async { b(); }async { c(); }async { d(); }async { e(); }async { f(); }
08 - 13
Challenges: • Schedule is typically determined at run time• Execution time of tasks is not known in advance
unbalanced balanced
ac
e
b d f a ce
df
08 - 14
Forces (cont.)• Tasks may have dependences:
- limited choice of scheduler. E.g. if dependence a → e
is the best schedule-management of task order (task queue)
ac
e
b d f async { a(); async { e(); } }async { b(); }async { c(); }async { d(); }async { e(); }async { f(); }
08 - 15
Consequences• Task decomposition should
make sure data is efficiently accessible • Ray tracing: part of the scene and textures
required to compute one ray trace should fit into main memory of one UE.
• Several generic scheduling algorithms are known and commonly used in practice:• Thread-pool scheduler• Work-stealing scheduler
08 - 16
Scheduler by example:• Thread-pool
(current implementation of X10)• Work-stealing
08 - 17
Thead-pool scheduler
Global task queue:(typically a doubly-linked list)
Thread pool (2 UE):
head
tail
Each thread ...• ... repeatedly takes
unscheduled task from the head
• ... inserts new tasks at the tail
08 - 18
headtail
thread-1 thread-2
0 async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
08 - 19
head
tail
thread-1 thread-2
10
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
08 - 20
head
thread-1 thread-2
10
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
tail
08 - 21
thread-1 thread-2
1 2
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
tail
head
08 - 22
thread-1 thread-2
1 2
3
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
tail
head
08 - 23
thread-1 thread-2
2
3
4
1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
tail
head
08 - 24
thread-1 thread-2
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
2
3
4
tail
head
08 - 25
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1 thread-2
0
2
3
4
tail
head
08 - 26
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1 thread-2
2
3
4
tail
head
08 - 27
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1 thread-2
2
3 4
tail
head
08 - 28
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1 thread-2
3 4
tail
head
08 - 29
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1 thread-2
3
4 tailhead
08 - 30
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1 thread-2
tailhead
08 - 31
Work-stealing schedulerEach “thread” (2 UE) has its own task dequeue (double ended queue)
head
tail
Each thread ...• ... repeatedly takes task from head of its dequeue
- if local queue is empty, thread steals from the tail of another queue
• ... inserts new tasks at the head of its queue
headtail
08 - 32
thread-2
0 async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-1
deque: owner accesses at head, others access at tail
08 - 33
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
1
thread-1 thread-2
headtail
08 - 34
thread-1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
1
thread-2
headtail ste
al
08 - 35
thread-1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
1
thread-2
3headtail
08 - 36
thread-1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
1
thread-2
4
deqeue: owner inserts new task at head
3
head
tail
08 - 37
thread-1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
1
thread-2
2
4
3
head
tailheadtail
deqeue: owner inserts new task at head
08 - 38
thread-1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
2
4
3
head
tailheadtail
08 - 39
thread-1
0
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
24
3
head
tailheadtail
deqeue: owner takes task from head
08 - 40
thread-1
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
24
3
head
tailheadtail
08 - 41
thread-1
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
2
4
deque: owner takes task from head
3headtail
headtail
08 - 42
thread-1
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
4
3headtail
headtail
08 - 43
thread-1
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
3
4
3headtail
headtailste
al
deque: other thread steals from tail
08 - 44
thread-1
async { //task-0 a(); async { //task-1 b(); async { //task-2 c(); } d(); } async { //task-3 e(); } async { //task-4 f(); } g();}
thread-2
headtail
headtail
deques empty, execution done
08 - 45
Benefits of the work-stealing scheduler: Data locality:
- of accesses to the (local) task-queue- of data accesses by the tasks (task usually
accesses some data initialized by parent task)
Work-stealing was developed for the programming langage CILK
- continuation-passing style - provable efficient in space and communication
[2]
08 -
Another Example
46Source: [4]
08 -
• Google computes an index that maps search terms to URLs:- continuation → {http://en.wikipedia.org/wiki/
Continuation, ..., http://en.wikipedia.org/wiki/Continuation-passing_style, ...}
- passing → {http://books.google.de/books?isbn=0486437132..., http://en.wikipedia.org/wiki/Continuation-passing_style, ...}
- style → {http://www.style.com/, ... http://en.wikipedia.org/wiki/Continuation-passing_style, ...}
• Search performs lookup and combines results
47
08 -
• Input: Large amounts of raw data (petabytes)– web pages– logs– web page meta data– ...
• Input data cannot fit on a single machine
48
How is the index computed?
08 -
Architecture for analyzing large amounts of raw data• By J. Dean and S. Ghemawat, OSDI, 2004• Used by Google for daily data analysis
and index creation• Today: Various frameworks and
implementations (C++, Java, ...)
49
Map-Reduce
08 -
1) Partition and distribute input data(100s or 1000s of parts)
2) Map tasks operate independently on each partition and compute partial results
3) Reduce tasks combine and aggregate the output of the map operations to a final result.
50
Map-Reduce principles
08 - 51Source: [4]
everett.txt
Education is a better safeguard of liberty than a standing army.
aristotle.txt
The roots of education are bitter, but the fruit is sweet.
hugo.txt
He who opens a school door, closes a prison.
he, 1who, 1opens, 1a, 1school, 1door, 1closes, 1a, 1prison, 1
the, 1roots, 1of, 1education, 1are, 1bitter, 1but, 1the, 1fruit, 1is, 1sweet, 1
map (3 tasks)
education, 1is, 1a, 1better, 1safeguard, 1of, 1liberty, 1than, 1a, 1standing, 1army, 1
08 - 52Source: [4]
partitionhe, 1who, 1opens, 1a, 1school, 1door, 1closes, 1a, 1prison, 1
the, 1roots, 1of, 1education, 1are, 1bitter, 1but, 1the, 1fruit, 1is, 1sweet, 1
education, 1is, 1a, 1better, 1safeguard, 1of, 1liberty, 1than, 1a, 1standing, 1army, 1
a, 1are, 1army, 1a, 1better, 1bitter, 1a, 1but, 1a, 1
of, 1is, 1eduction, 1liberty, 1opens, 1of, 1prison, 1
safeguard, 1sweet, 1than, 1the, 1roots, 1who, 1the, 1school, 1standing, 1
he, 1 closes, 1door, 1education, 1fruit, 1
a, [1, 1, 1, 1]are, [1]army, [1]better, [1]bitter, [1]but, [1]
safeguard, [1]school, [1]standing, [1]sweet, [1]than, [1]the, [1, 1]roots, [1]who, [1]
08 - 53Source: [4]
sort + group by key
is, [1, 1]liberty, [1]opens, [1]of, [1, 1]prison, [1]
closes, [1]door, [1]education, [1, 1]fruit, [1]he, [1]
a, 1are, 1army, 1a, 1better, 1bitter, 1a, 1but, 1a, 1
of, 1is, 1eduction, 1liberty, 1opens, 1of, 1prison, 1
safeguard, 1sweet, 1than, 1the, 1roots, 1who, 1the, 1school, 1standing, 1
he, 1 closes, 1door, 1education, 1fruit, 1
08 - 54Source: [4]
reduce (4 tasks)
a, [1, 1, 1, 1]are, [1]army, [1]better, [1]bitter, [1]but, [1]
safeguard, [1]school, [1]standing, [1]sweet, [1]than, [1]the, [1, 1]roots, [1]who, [1]
is, [1, 1]liberty, [1]opens, [1]of, [1, 1]prison, [1]
closes, [1]door, [1]education, [1, 1]fruit, [1]he, [1]
a, 4are, 1army, 1better, 1bitter, 1but, 1
safeguard, 1school, 1standing, 1sweet, 1than, 1the, 2roots, 1who, 1
is, 2liberty, 1opens, 1of, 2prison, 1
closes, 1door, 1education, 2fruit, 1he, 1
08 -
Input & Output: each a set of key/value pairsProgrammer specifies two functions:map (InKey, InValue): List[Pair[OutKey, IntermediateValue]]
• Processes input key/value pair• Produces set of intermediate pairs
reduce (OutKey, List[IntermediateValue]): List[OutValue]
• Combines all intermediate values for a particular key
• Produces a set of merged output values (usually just one)
55
Programming model
Source: [3]
08 -
Input & Output: each a set of key/value pairsProgrammer specifies two functions:map (InKey, InValue): List[Pair[OutKey, IntermediateValue]]
• Processes input key/value pair• Produces set of intermediate pairs
reduce (OutKey, List[IntermediateValue]): List[OutValue]
• Combines all intermediate values for a particular key
• Produces a set of merged output values (usually just one)
56
Programming model
Source: [3]
Inspired
by funct
ional pro
gramming
language
s (e.g. LI
SP)
From the
program
mer’s pers
pective,
model is s
equential
.
08 -
Word Counting: map function
57
/* @param inkey document name * @param invalue document contents * @param result collect intermediate result */def map (inkey: String, invalue: String, result: List[Pair[String, Int]]) { val chars = invalue.chars(); var sb: StringBuilder = new StringBuilder(); for (c in chars.region) { val tmp = chars(c); if (tmp.isLetterOrDigit()) sb.add(tmp.toLowerCase()); else { // emit result result.add(Pair[String, Int](sb.result(), 1)); sb = new StringBuilder(); } }}
map (InKey, InValue): List[Pair[OutKey, IntermediateValue]]
08 -
Word Counting: reduce function
58
/* * @param outkey word * @param intermediate_values occurrence of word in all texts * @return sum of occurrences */def reduce (outkey: String, intermediate_values: List[Int]): List[Int] { var acc: Int = 0; val ret = new List[Int](); for (i in intermediate_values) acc += i; ret.add(acc); return ret;}
reduce (OutKey, List[IntermediateValue]): List[OutValue]
08 -
Word Counting: main function
59
public static def main(Array[String]) { // (1) initialize val mr = new MapReduceImpl[String, String, Int, String, Int](3,3); val m = (ik:String, iv:String, result:List[Pair[String, Int]]) => { mr.map(ik, iv, result); }; val r = (ok: String, iv: List[Int]): List[Int] => { return mr.reduce(ok, iv); }; mr.registerReduceFun(r); mr.registerMapFun(m); mr.setInput(input);
// (2) run it mr.run();
// (3) print result val result <: List[Pair[String,Int]] = mr.getResult(); for (p in result) Console.OUT.println(p.first + ", " + p.second);}
08 -
Generic MapReduce
60
public interface MapReduce [InKey, InVal, IntermediateVal, OutKey, OutVal] {
property def num_map_tasks(): Int; property def num_reduce_tasks(): Int;
def setInput(i: Array[Pair[InKey, InVal]]); def registerMapFun(m: (InKey, InVal, List[Pair[OutKey, IntermediateVal]]) => void); def registerReduceFun(r: (OutKey, List[IntermediateVal]) => List[OutVal]); def run(); def getResult(): List[Pair[OutKey, List[OutVal]]];}
08 -
Programmer specifies two functions:map (InKey, InValue): List[Pair[OutKey, IntermediateValue]]
reduce (OutKey, List[IntermediateValue]): List[OutValue]
Parallelization, load-balancing and fault-tolerance are handled by the programming framework.
61
Programming model
Source: [3]
08 -
More examples of MapReduce programs:
• distributed grep• distributed sort• web link-graph reversal• term-vector per host• web access log stats• inverted index construction• document clustering• machine learning• ...
62Source: [3]
08 -
Task granularity and pipelining (Google, 2004)
• Often use 200.000 map 5000 reduce tasks on 2000 machines
• Fine granularity tasks: many more map tasks than machines- Minimizes time for fault recovery
(single task is small)- Better dynamic load balancing- Can pipeline shuffling (partition, sort/group)
with execution of map tasks.
63Source: [3]
08 -
...cont.• Fine granularity tasks: many more map tasks than
machines- Can pipeline shuffling (partition, sort/group)
with execution of map tasks.
64Source: [3]
“shuffling”: reduce tasks read/sort/group intermediate data
08 -
(1) Parallel algorithm structure design space
65
Organization by Tasks
(1.3) Task Parallelism
(1.4) Recursive Splitting
Organization by Data
Organization by Data Flow
(1.1) Geometric Decomposition
(1.2) Recursive Data
(1.5) Pipeline
08 -
(1.4) Recursive splitting
66
Context:• The solution to a problem is naturally
described through a recursive algorithm– also called “Divide and Conquer”– solution of a large problem can be
synthesized from the solution of smaller problems.
08 - 67
Context (cont.)• Each recursion step can be regarded as a
task – Tasks at different levels in the invocation
hierarchy are dependent, – Tasks at the same level of the hierarchy
are (mostly) independent
08 -
Example: Merge sort
68Source: [4]
4 × sequential sort
divide
divide merge
merge
13 7 5 8 2 1 13 9
13 7 5 8
13 7 5 8
2 1 13 9
2 1 13 9 7 13 5 8 1 2 139
13875 1 2 139
1313987521
08 -
Example: Merge sort
69
13 7 5 8 2 1 13 9
13 7 5 8
13 7 5 8
2 1 13 9
2 1 13 9 7 13 5 8 1 2 139
13875 1 2 139
1313987521
merge tasks sort tasks
1 2 9 13
08 - 70
sort tasks
5 7 8 13 1 2 9 13
13875 1 2 139
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
merge tasks
Task graph
7 13 5 8
partial order due to data dependency (input/output)
08 - 71
Forces • Task can have different size
– load balancing– simple merge sort: last “large” task is
sequential
• Tasks can have dependences– due to input/output relation– manage dependences with explicit
synchronization (typically barriers)
08 - 72
Forces • Tasks granularity
– merge sort: threshold determines the input size below which sorting is done sequentially (quicksort)
• Scheduling (mapping of of tasks to UE)should support data locality– Work stealing is typically good– merge sort: better schema exist.
08 - 73
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
Scheduling for data locality
Examples forpossible task schedules on single UE:
Constraint: partialorder
5
7
6
4321
4321 765
3521 764
1)2)
08 - 74
Assumptions:• Sort tasks perform sequential in-place sorting• Cache line is two values wide• Cache has capacity of two lines
• cache is fully associative
08 - 75
Schedule 1)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 765
cache contents cache hitcache miss X
!
X
X
08 - 76
Schedule 1)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 765
cache contents cache hitcache miss X
!
X
X X X
08 - 77
Schedule 1)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 765
cache contents cache hitcache miss X
!
X
X X X
X X
08 - 78
Schedule 1)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 765
cache contents cache hitcache miss X
!
X
X X X
X X X X
08 - 79
Schedule 1)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 765
cache contents cache hitcache miss X
!
X X X X
X X X X
X X !
!
!
2 hits10 misses(simplified)
08 - 80
Schedule 2)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 765
cache contents cache hitcache miss X
!
X X
!
08 - 81
Schedule 2)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 76
cache contents cache hitcache miss X
!
X X
!
! !
5
08 - 82
Schedule 2)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 76
cache contents cache hitcache miss X
!
X X
!
! !
5
X X
08 - 83
Schedule 2)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 7
cache contents cache hitcache miss X
!
X X
!
! !
5
X X
! !
6
08 - 84
Schedule 2)
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
4321 7
cache contents cache hitcache miss X
!
X X
!
! !
5
X X
! !
6
! !X X
6 hits6 misses(simplified)
Schedu
le 2)
has be
tter
data l
ocality
than
1) !
08 - 85
Task scheduling on multiple UEs:
• Parallel breadth first (PBF): {1,2,3,4,5,6,7}
• Parallel depth first (PDF): {1,2,5,3,4,6,7}
• Choice of schedule especially relevant for performance on CMPs (Chip-Multi-Processors)
• For merge sort, PDF is preferable. Details in [10]
5 7 8 13 1 2 9 13
1313987521
7 13
13 7
5 8
5 8
1 2
2 1
139
913
5
7
6
4321
08 -
Example: Traveling salesman (TSP)
86
A B
C D
A B
C D
A B
C DA, B, C, D, A A, C, B, D, A A, D, B, C, A
A B
C D
A B
C D
A B
C D
A, B, D, C, A A, C, D, B, A A, D, C, B, A
Input: Weighted, fully connected graphOutput: Shortest cycle through all nodes of the graph. Without loss of generality, tour starts and ends in .A
08 - 87
• TSP is an optimization problem• Naive strategy: Exhaustive search.
– One task per tour. – For N nodes, there are (N-1)! different
tours – Number of tours grows exponentially.
• Theory says: Problem is NP-complete.
08 - 88
... phrase problem as recursion
Problem formalization:• N = {0, ...., n-1} nodes• dist(i, j) = <direct distance from i to j>• L(i, A) = <length of shortest path from i to
0 through nodes A ⊆ N>
A better approach ...
08 - 89
• Recursion: –L(i, ∅) = dist(i, 0) –L(i, A) =
• Result: L(0, {1, ..., n-1})
Recursive procedure
j∈Amin ( dist(i, j) + L(j, A \ {j}) )
recursive call
08 - 90
Naive computation of the recursionL(0, {1,...,n-1})
L(1, {2,...,n-1}) L(2, {1,3,...,n-1}) L(n-1, {1,...,n-2})...
...... ...
L(1, ∅) L(1, ∅)... ...
L(3, {4,...,n-1})
• n-1 hierarchy levels• tasks at the same level can
be compted independently
L(3, {4,...,n-1})
task
08 - 91
Naive computation of the recursion
L(0, {1,...,n-1})
L(1, {2,...,n-1}) L(2, {1,3,...,n-1}) L(n-1, {1,...,n-2})...
...... ...
L(1, ∅) L(1, ∅)... ...
L(3, {4,...,n-1})
Same work is done several times!(in parallel)
L(3, {4,...,n-1})
08 - 92
• Key idea: Store intermediate results, and reuse them in other parts of the computation.
• For TSP: Maintain table for L(i, A), i∈{0...n-1}, A∈ P ({1...n-1}) (P: Powerset)– Problem: Size of that table is nx2n
Dynamic programming
08 - 93
• Recursion: –L(i, ∅) = dist(i, 0) –L(i, A) =
• Result: L(0, {1, ..., n-1}
Dynamic programming
j∈Amin (dist(i, j) + L(j, A \ {j})
recursive call or table lookup
08 - 94
Dynamic programming
L(0, {1,...,n-1})
L(1, {2,...,n-1}) L(2, {1,3,...,n-1}) L(n-1, {1,...,n-2})...
...... ...
L(1, ∅) L(1, ∅)... ...
L(3, {4,...,n-1}) L(3, {4,...,n-1})computation lookup
08 - 95
Dynamic programming
• Compute task tree bottom up
• Enumerate possible tasks at each hierarchy level; tasks at the same level can be done concurrently
• Barrier when ascending the hierarchy to the next level
L(0, {1,...,n-1})
L(1, {2,...,n-1}) L(2, {1,3,...,n-1}) L(n-1, {1,...,n-2})......... ...
L(1, ∅) L(1, ∅)... ...
L(3, {4,...,n-1}) L(3, {4,...,n-1})comptation lookup
08 -
[1] Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill: Patterns for Parallel Programming, Addison Wesley 2005.
[2] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson! Keith H. Randall!Yuli Zhou: Cilk - An Efficient Multithreaded Runtime System, PPoPP, 1995
[3] Jeffrey Dean und Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004. http://labs.google.com/papers/mapreduce.html
[4] Per H. Christensen, Julian Fong, David M. Laur, Dana Batali (Pixar Animation Studios): Ray tracing for the movie ‘Cars’, Ray Tracing Symposium 2006. http://www.sci.utah.edu/~wald/RT06/papers/raytracing06per.pdf http://graphics.pixar.com/library/RayTracingCars/paper.pdf
96
Sources
08 -
[5] Ernie Wright: Amiga Juggler animation. http://home.comcast.net/~erniew/juggler.html
[6] Jeff Atwood: Real-Time Raytracing: Blog Entry March 10, 2008.
[7] Nate Nystrom, X10, work stealing, Lecture CSE 3302, April 2010.
[8] William Welch and Andrew Witkin. Free form shape design using triangulated surfaces. Computer Graphics (SIGGRAPH), 1994.
[9] Wolfram Math World: http://mathworld.wolfram.com/VertexFigure.html
[10] S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. Symp. on Parallel Algorithms and Architectures (SPAA), 2007.
97
Sources
08 - 98
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work
• Under the following conditions:– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.