+ All Categories
Home > Documents > Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

Date post: 23-Feb-2016
Category:
Upload: zora
View: 57 times
Download: 0 times
Share this document with a friend
Description:
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models . Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012. Multi-core Computing Lectures: Progress-to-date on Key Open Questions. How to formally model multi-core hierarchies? - PowerPoint PPT Presentation
Popular Tags:
45
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012
Transcript
Page 1: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

Multi-core ComputingLecture 2

MADALGO Summer School 2012Algorithms for Modern Parallel and Distributed Models

Phillip B. GibbonsIntel Labs Pittsburgh

August 21, 2012

Page 2: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

2 © Phillip B. Gibbons

Multi-core Computing Lectures: Progress-to-date on Key Open Questions

• How to formally model multi-core hierarchies?• What is the Algorithm Designer’s model?• What runtime task scheduler should be used?• What are the new algorithmic techniques?• How do the algorithms perform in practice?

Page 3: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

3 © Phillip B. Gibbons

Lecture 1 Summary• Multi-cores: today, future trends, challenges

• Computations & Schedulers– Modeling computations in work-depth framework– Schedulers: Work Stealing & PDF

• Cache miss analysis on 2-level parallel hierarchy– Private caches OR Shared cache

• Low-depth, cache-oblivious parallel algorithms– Sorting & Graph algorithms

Page 4: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

4 © Phillip B. Gibbons

Lecture 2 Outline• Modeling the Multicore Hierarchy– PMH model

• Algorithm Designer’s model exposing Hierarchy– Multi-BSP model

• Quest for a Simplified Hierarchy Abstraction

• Algorithm Designer’s model abstracting Hierarchy– Parallel Cache-Oblivious (PCO) model

• Space-Bounded Schedulers– Revisit PCO model

Page 5: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

5 © Phillip B. Gibbons

up to 1 TB Main Memory

4…

24MB Shared L3 Cache

2 HW threads

32KB

256KB

2 HW threads

32KB

256KB

8…

socket32-core Xeon 7500 Multi-core

24MB Shared L3 Cache

2 HW threads

32KB

256KB

2 HW threads

32KB

256KB

8…

socket

Page 6: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

6 © Phillip B. Gibbons

up to 0.5 TB Main Memory

4…

12MB Shared L3 Cache

P

64KB

512KB

P

64KB

512KB

12…

socket48-core AMD Opteron 6100

12MB Shared L3 Cache

P

64KB

512KB

P

64KB

512KB

12…

socket

Page 7: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

7 © Phillip B. Gibbons

How to Model the Hierarchy (?)

……

…… …… …

…… …… …

……

… …… …

…… …… …

“Tree of Caches” abstraction capturesexisting multi-core hierarchies

Parallel Memory Hierarchy (PMH) model [Alpern, Carter, Ferrante ‘93]

Page 8: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

8 © Phillip B. Gibbons

(Symmetric) PMH Model

capacity Mi block size Bi miss cost Ci fanout fi

Page 9: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

9 © Phillip B. Gibbons

PMH captures• PEM model [Arge, Goodrich, Nelson, Sitchinava ‘08]

p-processor machine with private caches

• Shared Cache model discussed in Lecture 1

• Multicore Cache model [Blelloch et al. ‘08]

L2 CacheShared L2 Cache

CPU

L1

CPU

L1

CPU

L1

Memory

h=3

Page 10: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

10 © Phillip B. Gibbons

Lecture 2 Outline• Modeling the Multicore Hierarchy– PMH model

• Algorithm Designer’s model exposing Hierarchy– Multi-BSP model

• Quest for a Simplified Hierarchy Abstraction

• Algorithm Designer’s model abstracting Hierarchy– Parallel Cache-Oblivious (PCO) model

• Space-Bounded Schedulers– Revisit PCO model

Page 11: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

11 © Phillip B. Gibbons

How to Design Algorithms (?)

Design to Tree-of-Caches abstraction:• Multi-BSP Model [Valiant ’08]– 4 parameters/level:

cache size, fanout,latency/sync cost, transfer bandwidth

– Bulk-Synchronous ……

… …… …

…… …… …

Page 12: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

12 © Phillip B. Gibbons

Bridging Models

Hardware Software

Multi-BSP (p1, L1, g1, m1, ………….)

Key: “ “ = “can efficiently simulate on”Slides from Les Valiant

Page 13: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

13 © Phillip B. Gibbons

.. pj components ..

Multi-BSP: Level j component

Level j -1 component Level j -1 component

Level j memory mj

gj data rate

Lj - synch. cost

gj -1 data rate

Page 14: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

14 © Phillip B. Gibbons

.. p1 processors ..Level 0 = processor Level 0 = processor

Level 1 memory m1

g1 data rate

L1 = 0 - synch. cost

g0 =1 data rate

Multi-BSP: Level 1 component

Page 15: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

15 © Phillip B. Gibbons

Like BSP except,1. Not 1 level, but d level tree2. Has memory (cache) size m as further parameter at each level.

i.e. Machine H has 4d+1 parameters: e.g. d = 3, and(p1, g1, L1, m1) (p2, g2, L2, m2) (p3, g3, L3, m3)

Multi-BSP

Page 16: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

16 © Phillip B. Gibbons

Optimal Multi-BSP AlgorithmsA Multi-BSP algorithm A* is optimal with respect to algorithm A if (i) Comp(A*) = Comp(A) + low order terms, (ii) Comm(A*) = O(Comm(A)) (iii) Synch(A*) = O(Synch(A))where Comm(A), Synch(A) are optimal among Multi-BSP implementations, and Comp is total computational cost, and O() constant is independent of the model parameters.

Presents optimal algorithms for Matrix Multiply, FFT, Sorting, etc. (simple variants of known algorithms and lower bounds)

Page 17: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

17 © Phillip B. Gibbons

Lecture 2 Outline• Modeling the Multicore Hierarchy– PMH model

• Algorithm Designer’s model exposing Hierarchy– Multi-BSP model

• Quest for a Simplified Hierarchy Abstraction

• Algorithm Designer’s model abstracting Hierarchy– Parallel Cache-Oblivious (PCO) model

• Space-Bounded Schedulers– Revisit PCO model

Page 18: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

18 © Phillip B. Gibbons

How to Design Algorithms (?)Design to Tree-of-Caches abstraction:• Multi-BSP Model– 4 parameters/level:

cache size, fanout,latency/sync cost, transfer bandwidth

– Bulk-Synchronous

Our Goal: Be Hierarchy-savvy• ~ Simplicity of Cache-Oblivious Model

– Handles dynamic, irregular parallelism– Co-design with smart thread schedulers

……

… …… …

…… …… …

Page 19: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

19 © Phillip B. Gibbons

Abstract Hierarchy: Simplified View

What yields good hierarchy performance?• Spatial locality: use what’s brought in– Popular sizes: Cache lines 64B; Pages 4KB

• Temporal locality: reuse it• Constructive sharing: don’t step on others’

toes

How might one simplify the view?• Approach 1: Design to a 2 or 3 level hierarchy

(?)• Approach 2: Design to a sequential hierarchy

(?)• Approach 3: Do both (??)

Page 20: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

20 © Phillip B. Gibbons

Sequential Hierarchies: Simplified View• External Memory Model– See [Vitter ‘01]

Simple modelMinimize I/OsOnly 2 levelsOnly 1 “cache”

Main Memory (size M)

External Memory

Block size B

External Memory Model

Can be good choice if bottleneck is last level

Page 21: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

21 © Phillip B. Gibbons

Sequential Hierarchies: Simplified View• Cache-Oblivious Model [Frigo et al. ’99]

Main Memory (size M)

External Memory

Block size B

Ideal Cache Model

Twist on EM Model: M & B unknown

to Algorithm simple model

Key Goal Guaranteed good cache performance at all levels of hierarchy

Single CPU only (All caches shared)

Key Algorithm Goal: Good performance for any M & B

Encourages Hierarchical Locality

Page 22: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

22 © Phillip B. Gibbons

Example Paradigms Achieving Key Goal• Scan: e.g., computing the sum of N items

N/B misses, forany B (optimal)

B11

B21

• Divide-and-Conquer: e.g., matrix multiply C=A*B

A11*B11+

A12*B21A11 A12

O(N /B + N /(B*√M)) misses (optimal)2 3

A11*B12+

A12*B22= *

A21*B11+

A22*B21

A21*B12+

A22*B22A21 A22

B12

B22

Divide: Recursively compute A11*B11,…, A22*B22

Conquer: Compute 4 quadrant sums Uses

RecursiveZ-orderLayout

Page 23: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

23 © Phillip B. Gibbons

Multicore Hierarchies: Key Challenge• Theory underlying Ideal Cache Model falls

apart once introduce parallelism: Good performance for any M & B on 2 levels

DOES NOTimply good performance at all levels of hierarchy

Key reason: Caches not fully sharedKey reason: Caches not fully shared

L2 CacheShared L2 Cache

CPU2

L1

CPU1

L1

CPU3

L1

What’s good for CPU1 isoften bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same timeB

Page 24: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

24 © Phillip B. Gibbons

Multicore HierarchiesKey New Dimension: Scheduling

Key new dimension: Scheduling of parallel threads

Key reason: Caches not fully shared

Has LARGE impact on cache performance

L2 CacheShared L2 Cache

CPU2

L1

CPU1

L1

CPU3

L1

Can mitigate (but not solve)if can schedule the writes

to be far apart in time

Recall our problem scenario: all CPUs want to write B at ≈ the same time

B

Page 25: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

25 © Phillip B. Gibbons

Constructive SharingDestructive• compete for the limited

on-chip cache

Constructive• share a largely

overlapping working setP

L1

P

L1

P

L1

Shared L2 Cache

Interconnect

P

L1

P

L1

P

L1

Shared L2 Cache

Interconnect

“Flood” off-chip

PINs

Page 26: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

26 © Phillip B. Gibbons

Recall: Low-Span + Cache-Oblivious• Guarantees on scheduler’s cache performance

depend on the computation’s depth D– E.g., Work-stealing on single level of private caches:

Thrm: For any computation w/ fork-join parallelism, O(M P D / B) more misses on P cores than on 1 core

• Approach: Design parallel algorithms with – Low span, and– Good performance on Cache-Oblivious Model

Thrm: For any computation w/ fork-join parallelismfor each level i, only O(M P D / B ) more misses

than on 1 core, for hierarchy of private cachesi i

But: No such guarantees for general tree-of-caches

Page 27: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

27 © Phillip B. Gibbons

Lecture 2 Outline• Modeling the Multicore Hierarchy– PMH model

• Algorithm Designer’s model exposing Hierarchy– Multi-BSP model

• Quest for a Simplified Hierarchy Abstraction

• Algorithm Designer’s model abstracting Hierarchy– Parallel Cache-Oblivious (PCO) model

• Space-Bounded Schedulers– Revisit PCO model

Page 28: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

28 © Phillip B. Gibbons

Handling the Tree-of-Caches

To obtain guarantees for general tree-of-caches:

• We define a Parallel Cache-Oblivious Model and • A corresponding Space-Bounded Scheduler

Page 29: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

29 © Phillip B. Gibbons

A Problem with Using CO Model

Misses in CO modelM/B misses

Any greedy parallelschedule (Mp = M):

All processors suffer all misses in parallel

P M/B misses

Memory

CPU…CPU

Shared Cache Mp

Memory

CPU

M

P subtasks:each reading

same M/B blocks in same order

… …

a1

a2

a1

a2

aMaM

…P of these

Carry Forward ruleis too optimistic

Page 30: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

30 © Phillip B. Gibbons

Parallel Cache-Oblivious (PCO) Model

Memory

M,B

P

Case 1: task fits in cache

All three subtasks start with

same state

At join, merge state

and carry forward

Carry forward cache state according to some sequential order

• Differs from cache-oblivious model in how cache state is carried forward

[Blelloch, Fineman, G, Simhadri ‘11]

Page 31: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

31 © Phillip B. Gibbons

Parallel Cache-Oblivious Model (2)• Differs from cache-oblivious model in

how cache state is carried forward

Memory

M,B

P

Case 2: Task does not fit

in cache

All three subtasks start with

empty state

Cache set to empty

at join

Page 32: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

32 © Phillip B. Gibbons

PCO Cache Complexity Q*

• Bounds assume M = Ω(B )• All algorithms are work optimal• Q* bounds match both CO bounds

and best sequential algorithm bounds

2

See [Blelloch, Fineman, G, Simhadri ‘11] for details

Page 33: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

33 © Phillip B. Gibbons

Lecture 2 Outline• Modeling the Multicore Hierarchy– PMH model

• Algorithm Designer’s model exposing Hierarchy– Multi-BSP model

• Quest for a Simplified Hierarchy Abstraction

• Algorithm Designer’s model abstracting Hierarchy– Parallel Cache-Oblivious (PCO) model

• Space-Bounded Schedulers– Revisit PCO model

Page 34: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

34 © Phillip B. Gibbons

Space-Bounded Scheduler

Key Ideas:• Schedules a dynamically unfolding parallel

computation on a tree-of-caches hierarchy• Computation exposes lots of parallelism• Assumes space use (working set sizes) of tasks

are known (can be suitably estimated)• Assigns a task to a cache C that fits

the task’s working set. Reserves the space in C. Recurses on the subtasks, using the CPUs and caches that share C (below C in the diagram)

…… …… …

C

[Chowdhury, Silvestri, Blakeley, Ramachandran ‘10]

Page 35: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

35 © Phillip B. Gibbons

Space-Bounded Scheduler

Advantages over WS scheduler• Avoids cache overloading for shared caches• Exploits cache affinity for private caches

Page 36: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

36 © Phillip B. Gibbons

Problem with WS Scheduler:Cache overloading

shared cache: 10MB

CPU CPU

CPU 1CPU 2

time

10MB 10MB8MB

Parallel subtasks sharing read data

8MB 8MB 8MB

Overloaded cache introduces more cache (capacity) misses

Hierarchy(focus on 1 cache)

Computation

Page 37: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

37 © Phillip B. Gibbons

Space-Bounded Scheduleravoids cache overloading

shared cache: 10MB

CPU CPU

CPU 1CPU 2

time

10MB 10MB8MB

Parallel subtasks sharing read data

8MB 8MB 8MB

Does not overload the cache, so fewer cache misses

Popular

Computation

Hierarchy(focus on 1 cache)

Computation

Page 38: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

38 © Phillip B. Gibbons

Problem with WS Scheduler (2):Ignoring cache affinity

Parallel tasks reading same

dataShared memory

4MB

5MB1MBeach

5MB 5MB 5MB 5MB

CPU CPU CPU CPU

Schedules any available task when a processor is idle All experience all cache misses and run slowly

time

CPU 1CPU 2CPU 3CPU 4

ComputationHierarchy

Page 39: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

39 © Phillip B. Gibbons

Parallel tasks reading same

dataShared memory

4MB

5MB1MBeach

5MB 5MB 5MB 5MB5MB 5MB 5MB

CPU CPU CPU CPU

Schedules any available task when a processor is idleAll experience all cache misses and run slowly

time

CPU 1CPU 2CPU 3CPU 4

Problem with WS Scheduler:Ignoring cache affinity

Hierarchy Computation

Page 40: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

40 © Phillip B. Gibbons

Parallel tasks reading same

dataShared memory

4MB

5MB1MB

5MB 5MB 5MB 5MB5MB 5MB 5MB

CPU CPU CPU CPU

time

CPU 1CPU 2CPU 3CPU 4

Pin task to cache to exploit affinity among subtasks 5MB

Space-Bounded Schedulerexploits cache affinity

Hierarchy Computation

Popular

Page 41: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

41 © Phillip B. Gibbons

Analysis ApproachGoal: Algorithm analysis should remain lightweight and agnostic of the machine specifics

Analyze for a single cache level using PCO model

Infinite Size Main Memory

size-M cache

Unroll algorithm to tasks that fit in cacheAnalyze each such task separately, starting from an empty cache

≤M

Cache complexity Q*(M) = Total # of misses, summed across all tasks

Page 42: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

42 © Phillip B. Gibbons

Analytical Bounds

Cache costs: optimal ∑levels Q*(Mi) x Ci

where Ci is the miss cost for level i cachesRunning time: for “sufficiently balanced” computations: optimal O(∑levelsQ*(Mi) x Ci / P) time on P cores

Our theorem on running time also allows arbitrary imbalance, with the performance depending on an imbalance penalty

• Guarantees provided by our Space-Bounded Scheduler:

[Blelloch, Fineman, G, Simhadri ‘11]

Page 43: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

43 © Phillip B. Gibbons

Motivation for Imbalance PenaltyTree-of-Caches• Each subtree has a given amount of compute &

cache resources• To avoid cache misses from migrating tasks,

would like to assign/pin task to a subtree• But any given program task may not match

both– E.g., May need large cache but few processors

• We extend PCO with a cost metric that charges for such space-parallelism imbalance

– Attribute of algorithm, not hierarchy– Need minor additional assumption on hierarchy

Page 44: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

44 © Phillip B. Gibbons

Multi-core Computing Lectures: Progress-to-date on Key Open Questions

• How to formally model multi-core hierarchies?• What is the Algorithm Designer’s model?• What runtime task scheduler should be used?• What are the new algorithmic techniques?• How do the algorithms perform in practice?

NEXT UPLecture #3: Extensions

Page 45: Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

45 © Phillip B. Gibbons

References[Alpern, Carter, Ferrante ‘93] B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. Programming Models for Massively Parallel Computers, 1993[Arge, Goodrich, Nelson, Sitchinava ‘08] L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. ACM SPAA, 2008[Blelloch et al. ‘08] G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. ACM-SIAM SODA, 2008[Blelloch, Fineman, G, Simhadri ‘11] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling Irregular Parallel Computations on Hierarchical Caches. ACM SPAA, 2011[Chowdhury, Silvestri, Blakeley, Ramachandran ‘10] R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. IPDPS, 2010[Frigo et al. ’99] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-Oblivious Algorithms. IEEE FOCS 1999[Valiant ‘08] L. G. Valiant. A bridging model for multi-core computing. ESA, 2008[Vitter ‘01] J. S. Vitter. External memory algorithms and data structures. ACM Computing Surveys 33:2, (2001)


Recommended