Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | claire-burns |
View: | 231 times |
Download: | 1 times |
1
EECE571R -- Harnessing Massively
Parallel Processors
http://www.ece.ubc.ca/~matei/EECE571/
Lecture 1: Introduction
Matei Ripeanumatei at ece dot ubc dot ca
Acknowledgement: some slides borrowed from presentations by Jim Demel, Horst Simon, Mark D. Hill, David Patterson, Saman Amarasinghe, Richard Brunner, Luddy Harrison, Jack Dongarra
2
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of this course
Other problems too
Including your laptops and handhelds
all
But things are improving
3
Units of Measure
• High Performance Computing (HPC) units are:- Flop: floating point operation
- Flops/s: floating point operations per second
- Bytes: size of data (a double precision floating point number is 8bytes)
• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes
Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes
Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes
Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes
Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes
Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes
Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes
• Current fastest (public) machine ~ 2.6 Pflop/s- Up-to-date list at www.top500.org
Performance evolution – Top500 List
5
Why powerful computers are
parallel
circa 1991-2006
all (2007)
6
Tunnel Vision by Experts
• “I think there is a world market for maybe five computers.”
- Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a computer in their home”
- Ken Olson, president and founder of Digital Equipment Corporation, 1977.
• “640K [of memory] ought to be enough for anybody.”- Bill Gates, chairman of Microsoft,1981.
• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
- Ken Kennedy, CRPC Directory, 1994
Slide source: Warfield et al.
7
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
8
Microprocessor Transistors per Chip
i4004
i80286
i80386
i8080
i8086
R3000R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tran
sist
ors
• Growth in transistors per chip • Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000
Year
Clo
ck R
ate
(MH
z)
9
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks by a factor of x ?
• Clock rate goes up by x because wires are shorter, smaller gate capacities
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !- typically x3 is devoted to either on-chip
- parallelism: hidden parallelism such as ILP- locality: caches
• So most programs significantly faster, without changing them
10
But there are limiting forces
• Cost - Moore’s 2nd law (Rock’s law):
costs go up
• Yield- The percentage of the chips that
are usable
-E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield
Manufacturing costs and yield problems limit use of density
11
Power Density Limits Serial Performance
12
Revolution is Happening Now
• Chip density is continuing increase ~2x every 2 years
- Clock speed is not!
- Number of processor cores may double instead
• There is little or no more hidden parallelism (ILP) to be found
• Parallelism must be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
13
Parallelism in 2011?
• These arguments are no longer theoretical
• All major processor vendors are producing multicore chips- Every machine will soon be a parallel machine
- To keep doubling performance, parallelism must double
• Which commercial applications can use this parallelism?- Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?- New software model needed
- Try to hide complexity from most programmers – eventually
- In the meantime, need to understand it
• Computer industry betting on this change- But does not have all the answers
14
More Limits: How fast can a serial computer be?
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory to processor.
- To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each bit occupies about 1 square Angstrom, or the size of a small atom.
• No choice but parallelism
r = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
Performance Development
0.11
10
1001000
10000100000
1E+061E+071E+081E+09
1E+101E+11
1996
2002
2008
2014
2020
1 Eflop/ s
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s SUM
N=1
N=500
2.6PFlops (2010)
Minimum
Average
Maximum
1
10
100
1,000
10,000
100,000
1,000,000
# p
rocessors
.
Concurrency Levels
notebook computer
Moore’s Law reinterpreted
• Number of cores per chip will double every two years
• Clock speed will not increase (possibly decrease)
• Need to deal with systems with large number of concurrent threads - Millions for HPC systems
- Thousands for servers
- Hundreds for workstations and notebooks
• Need to deal with inter-chip parallelism as well as intra-chip parallelism
18
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
19
Computational Science
Nature, March 23, 2006
“An important development in sciences is occurring at the intersection of computer science and the sciences that has the potential to have a profound impact on science. It is a leap from the application of computing … to the integration of computer science concepts, tools, and theorems into the very fabric of science.” -Science 2020 Report, March 2006
20
Drivers for Change
• Continued exponential increase in computational power simulation is becoming third pillar of science, complementing
theory and experiment
• Continued exponential increase in experimental data techniques and technology in data analysis, visualization,
analytics, networking, and collaboration tools are becoming essential in all data rich scientific applications
21
Simulation: The Third Pillar of Science
• Traditional scientific and engineering method:(1) Do theory or paper design
(2) Perform experiments or build system
• Limitations: –Too difficult—build large wind tunnels
–Too expensive—build a throw-away passenger jet
–Too slow—wait for climate or galactic evolution
–Too dangerous—weapons, drug design, climate
experimentation
• Computational science and engineering paradigm:(3) Use high performance computer systems
to simulate and analyze the phenomenon
- Based on known physical laws and efficient numerical methods
- Analyze simulation results with computational tools and methods beyond what is used traditionally for experimental data analysis
Simulation
Theory Experiment
22
What Supercomputers Do
One Example- simulation replacing experiment that is too dangerous
23
Global Climate Modeling Problem
• Problem is to compute:f(latitude, longitude, elevation, time) “weather” =
(temperature, pressure, humidity, wind velocity)
• Approach:- Discretize the domain, e.g., a measurement point every 10 km
- Devise an algorithm to predict weather at time t+t given t
• Uses:- Predict major events,
e.g., El Nino
- Use in setting air emissions standards
- Evaluate global warming scenarios
Source: http://www.epm.ornl.gov/chammp/chammp.html
24
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere- Solve Navier-Stokes equations
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours) 56 Gflop/s
- Climate prediction (50 years in 30 days) 4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours) 288 Tflop/s
• To double the grid resolution / time resolution, computation is 8x to 16x
• State of the art models require integration of atmosphere, clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more
• Current models are coarser than this
25
High Resolution Climate Modeling on NERSC-3 – P. Duffy,
et al., LLNL
26
Which commercial applications require parallelism?
Embed SPEC DB Games ML HPC1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body
10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid
• Claim: parallel architecture, language, compiler … must do at least these well to run future parallel apps well• Note: MapReduce is embarrassingly parallel; FSM embarrassingly sequential?
Analyzed in detail in “Berkeley View” reportwww.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
27
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
28
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming even harder than sequential programming.
29
“Automatic” Parallelism in Modern Machines• Bit level parallelism
- within floating point operations, etc.
• Instruction level parallelism (ILP)- multiple instructions execute per clock cycle
• Memory system parallelism- overlap of memory operations with computation
• OS parallelism- multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
Amdahl’s Law
• Simple software assumption - Fraction F of execution time perfectly parallelizable
- No Overhead for
– Scheduling
– Synchronization
– Communication, etc.
- Fraction 1 – F Completely Serial
• Time on 1 core = (1 – F) / 1 + F / 1 = 1
• Time on N cores = (1 – F) / 1 + F / N
Finding Enough Parallelism
• Implications:- Attack the common case when introducing parallelization's: When F
is small, optimizations will have little effect.
- Even if the parallel part speeds up perfectly performance is limited by the sequential part
- As N approaches infinity, speedup is bound by 1/(1 – F ).
- The aspects you ignore will limit speedup
• Discussion:- Can you ever obtain super-linear speedups?
Amdahl’s Speedup =1
+1 - F
1
F
N
32
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to getting desired speedup
• Parallelism overheads include:- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work
33
Locality and Parallelism
• Large memories are slow, fast memories are small
• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache- the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
34
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Goal: find algorithms that minimize communication, not necessarily arithmetic
35
Load Imbalance
• Load imbalance is the time that some processors in the system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
• Examples of the latter- adapting to “interesting parts of a domain”
- tree-structured computations
- fundamentally unstructured problems
• Algorithm needs to balance load- Sometimes can determine work load, divide up evenly, before starting
- “Static Load Balancing”
- Sometimes work load changes dynamically, need to rebalance dynamically
- “Dynamic Load Balancing”
36
Building Parallel Software
• 2 types of programmers 2 layers
• Efficiency Layer (10% of today’s programmers)- Expert programmers build Libraries implementing motifs, “Frameworks”,
OS, ….
- Highest fraction of peak performance possible
• Productivity Layer (90% of today’s programmers)- Domain experts / Naïve programmers productively build parallel applications
by composing frameworks & libraries
- Hide as many details of machine, parallelism as possible
- Willing to sacrifice some performance for productive programming
• You may want to work at either level- Goal of this course: understand enough of the efficiency layer to use
parallelism effectively
37
Improving Real Performance
0.1
1
10
100
1,000
2000 2004T
eraf
lop
s1996
Peak Performance grows exponentially, a la Moore’s Law
But efficiency (the performance relative to the hardware peak) has declined
was 40-50% on the vector supercomputers of 1990s
Can be as little as 5-10% on parallel supercomputers of today
Close the gap through ... Mathematical methods and algorithms that
achieve high performance on a single processor and scale to thousands of processors
More efficient programming models and tools for massively parallel supercomputers
PerformanceGap
Peak Performance
Real Performance
38
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
Hybrid architectures in Top 500 sites
Amdahl’s Law [1967]
Amdahl’s Speedup =1
+1 - F
1
F
N
Designing a milticore platform
• Designers must confront single-core design options- Instruction fetch, wakeup, select
- Execution unit configuration & operand bypass
- Load/queue(s) & data cache
- Checkpoint, log, runahead, commit.
• As well as additional degrees of freedom- How many cores? How big each?
- Shared caches: How many levels? How many banks?
- On-chip interconnect: bus, switched?
Want Simple Multicore Hardware Model
To Complement Amdahl’s Simple Software Model
(1) Chip Hardware Roughly Partitioned into- (I) Multiple Cores (with L1 caches)
- (II) The Rest (L2/L3 cache banks, interconnect, pads, etc.)- Assume: Changing Core Size/Number does NOT change The Rest
(2) Resources for Multiple Cores Bounded- Bound of N resources per chip for cores
- Due to area, power, cost ($$$), or multiple factors
- Bound = Power? (but pictures here use Area)
Simple Multicore Hardware Model, (cont)
(3) Architects can improve single-core performance using more of the bounded resource
• A Simple Base Core- Consumes 1 Base Core Equivalent (BCE) resources
- Provides performance normalized to 1
• An Enhanced Core (in same process generation)- Consumes R x BCEs
- Performance as a function or R: Perf(R)
• What does function Perf(R) look like?
More on Enhanced Cores
• (Performance Perf(R) consuming R BCEs resources)
• If Perf(R) > R Always enhance core- Cost-effectively speedups both sequential & parallel
• Therefore, equations assume Perf(R) < R
• Graphs Assume Perf(R) = square root of R- 2x performance for 4 BCEs, 3x for 9 BCEs, etc.
- Why? Models diminishing returns with “no coefficients”
• Two options symmetric / asymmetric multicore chips
Q1: How Many (Symmetric) Cores per Chip?
• Each Chip Bounded to N BCEs (for all cores)
• Each Core consumes R BCEs
• Assume Symmetric Multicore = All Cores Identical
• Therefore, N/R Cores per Chip — (N/R)*R = N
• For an N = 16 BCE Chip:
Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core
Performance of Symmetric Multicore Chips
• Serial Fraction 1-F uses 1 core at rate Perf(R)
• Serial time = (1 – F) / Perf(R)
• Parallel Fraction uses N/R cores at rate Perf(R) each
• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
• Therefore, w.r.t. one base core:
• Implications?
Symmetric Speedup =1
+1 - F
Perf(R)
F * R
Perf(R)*N
Enhanced Cores speed Serial & Parallel
Symmetric Multicore Chip, N = 16 BCEs
F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))
Need more parallelism to have multicore optimal!
(16 cores) (8 cores) (2 cores) (1 core)
F=0.5R=16,Cores=1,Speedup=4
(4 cores)
Symmetric Multicore Chip, N = 16 BCEs
At F=0.9, Multicore optimal, but speedup limited
Need to obtain even more parallelism!
F=0.5R=16,Cores=1,Speedup=4
F=0.9, R=2, Cores=8, Speedup=6.7
Symmetric Multicore Chip, N = 16 BCEs
F matters: Amdahl’s Law applies to multicore chips
Researchers should target parallelism F first
F1, R=1, Cores=16, Speedup16
Symmetric Multicore Chip, N = 16 BCEs
As Moore’s Law enables N to go from 16 to 256 BCEs,
More core enhancements? More cores? Or both?
Recall F=0.9, R=2, Cores=8, Speedup=6.7
Symmetric Multicore Chip, N = 256 BCEs
As Moore’s Law increases N, often need enhanced core designs
Researcher should target single-core performance too
F=0.9R=28 (vs. 2)Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)CORE ENHANCEMENTS!
F1R=1 (vs. 1)Cores=256 (vs. 16)Speedup=204 (vs. 16) MORE CORES!
F=0.99R=3 (vs. 1)
Cores=85 (vs. 16)Speedup=80 (vs. 13.9)
CORE ENHANCEMENTS& MORE CORES!
Outline
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
Asymmetric (Heterogeneous) Multicore Chips
• Symmetric Multicore Required All Cores Equal
• Why Not Enhance Some (But Not All) Cores?
• For Amdahl’s Simple Software Assumptions- One Enhanced Core
- Others are Base Cores
• How does this affect our hardware model?
How Many Cores per Asymmetric Chip?
• Each Chip Bounded to N BCEs (for all cores)
• One R-BCE Core leaves N-R BCEs
• Use N-R BCEs for N-R Base Cores
• Therefore, 1 + N - R Cores per Chip
• For an N = 16 BCE Chip:
Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core& Twelve 1-BCE base cores
Performance of Asymmetric Multicore Chips
• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)
• Parallel Fraction F- One core at rate Perf(R)
- N-R cores at rate 1
- Parallel time = F / (Perf(R) + N - R)
• Therefore, w.r.t. one base core:
Asymmetric Speedup =1
+1 - F
Perf(R)
F
Perf(R) + N - R
Asymmetric Multicore Chip, N = 256 BCEs
Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?
(256 cores) (253 cores) (193 cores) (1 core)(241 cores)
57
Outline
• Why powerful computers must be parallel processors
• Science problems require powerful computers
• Why writing (fast) parallel programs is hard
• Why hybrid architectures (e.g., GPU based platforms)?
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
Course Mechanics
Course page: http://www.ece.ubc.ca/~matei/EECE571/
Office hours: after each class or by appointment (email me)
Email: matei @ ece.ubc.caOffice: KAIS 4033
EECE 571R: Course Goals
• Primary- Gain understanding of fundamental issues that affect design of:
- Parallel applications for massively multicore processors
- Survey main current research themes
• Secondary- By studying a set of outstanding papers, build knowledge of how to do
& present research
- Learn how to read papers & evaluate ideas
What I’ll Assume You Know• You are familiar with
- C programming; programming for traditional multicore machines threading, synchronization
- Traditional CPU architectures: caching, pipelines,
• If there are things that don’t make sense, ask!
This class
• Weekly schedule (tentative)
• Note: Dates are tentative- But always on Mondays 5-8pm, KAIS 4018
Class structure:
• Lectures – 1h- Technology
- Research topic
• Paper discussion – 1h
• Projects – 1h
Administravia: Grading
• Paper reviews:25%
• Class participation 15%
• Discussion leading: 10%
• Project: 50%
Administravia: Paper Reviewing (1)
• Goals:- Think of what you read
- Expand your knowledge beyond the papers that are assigned
- Get used to writing paper reviews
• Reviews due by midnight the day before the class
• Have an eye on the writing style / Be professional in your writing- Clarity
- Beware of traps: learn to use them in writing and detect them in reading
- Detect (and stay away from) trivial claims.
Administravia: Paper Reviewing (2)
Follow the format provided when relevant.
• Summarize the main contribution of the paper
• Critique the main contribution: - Significance:
- Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution).
- More importantly: Explain your rating in a sentence or two.
- Rate how convincing the methodology is.
- Do the claims and conclusions follow from the experiments?
- Are the assumptions realistic?
- Are the experiments well designed?
- Are there different experiments that would be more convincing?
- Are there other alternatives the authors should have considered?
- (And, of course, is the paper free of methodological errors?)
- What is the most important limitation of the approach?
Administravia: Paper Reviewing (3)
• What are the two strongest and/or most interesting ideas in the paper?
• What are the two most striking weaknesses in the paper?
• Name two questions that you would like to ask the authors.
• Detail an interesting extension to the work not mentioned in the future work section.
• Optional comments on the paper that you’d like to see discussed in class.
Administravia: Discussions Leading
• Come prepared! - Prepare a 3-5 minute summary of the paper
- Prepare discussion outline
- Prepare questions:
- “What if”s
- Unclear aspects of the solution proposed
- …
- Similar ideas in different contexts
- Initiate short brainstorming sessions
• Leaders do NOT need to submit paper reviews
• Main goals: - Keep discussion flowing
- Keep discussion relevant
- Engage everybody (I’ll have an eye on this, too)
Administravia: Projects
• It’s just a class, not your PhD.
• Aim high!- Goal: With one/two extra months of work you project should be ready to be
submitted to a decent conference / publication
- It is doable!
• Combine with your research if relevant to this course
• Get informal approval from all instructors if you overlap final projects:
- Don’t sell the same piece of work twice
- You can get more than twice as many results with less than twice as much work
Administravia: Project timeline (tentative)
• 3rd week – 3-slide idea presentation : “What is the (research) question you aim to answer?”
• 5th week: 2-page project proposal
--------------------- 1-week --- spring reading period
• 8th week: 4-page Midterm project due- Have a clear image of what’s possible/doable
- Report preliminary results
- Includes related work
• Final week [see schedule] In-class project presentation- Presentation
- Demo, if appropriate
- 8-page write-up
Next Class (Mon, 17/01)
• Note room change: KAIS 4018
• Discussion of - Project ideas
- Papers
To do:
• Subscribe to mailing list
• Volunteers for discussion leaders for class next week
69
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Overview of programming models (software) and tools for massively parallel processors
• Some important parallel applications and the algorithms
• Performance analysis and tuning
• Exposure to various open research questions
Questions?
71
Extra slides
Open research questions (Dongarra)
Open research questions (Hwu)
• Accelerate computations that have no good parallel structure today.
- E.g., graph analysis
• Cata distriubtions cause catastrophical load imbalances in parallel algorithms
- free-scal egraphs, MRI spiral scan, astronomy
• Computatios with no data reuse - E.g., matrix vector multiplcation
• Algorithm optimizations that are hard to do today- how do I extract locality and regularity of access
74
Applications
• Applications:1. What are the apps?
2. What are kernels of apps?
• Hardware:3. What are the HW building blocks?
4. How to connect them?
• Programming Model / Systems Software:
5. How to describe apps and kernels?
6. How to program the HW?
• Evaluation: 7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
CS267 focus is here
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rMotifs/Dwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
Primitives
Par Lab Research OverviewEasy to write correct programs that run efficiently on
manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivi
ty Layer
Efficienc
y Layer Cor
rect
ness
Applicatio
nsComposition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Efficiency Language Compilers