The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer
Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT
Students: just remember to take ENEE459P: Parallel Algorithms, fall’10
- What is a parallel algorithm?- Why should I care?
Taste of a Parallel Algorithm Example: Exchange Problem
2 Bins A and B. Exchange contents of A and B. Ex. A=2,B=5A=5,B=2.Algorithm (serial or parallel): X:=A;A:=B;B:=X. 3 Ops. 3 Steps. Space 1.
Array Exchange Problem 2n bins A[1..n], B[1..n]. Replace A(i) and B(i), i=1..n.Serial Alg: For i=1 to n do /*serial exchange through eye-of-a-needle X:=A(i);A(i):=B(i);B(i):=X 3n Ops. 3n Steps. Space 1Parallel Alg: For i=1 to n pardo /*2-bin exchange in parallel X(i):=A(i);A(i):=B(i);B(i):=X(i) 3n Ops. 3 Steps. Space n
Discussion - Parallelism tends to require some extra space- Par Alg clearly faster than Serial Alg.- What is “simpler” and “more natural”: serial or parallel? Small sample of people: serial, but only if you .. majored in CS
Eye-of-a-needle: metaphor for the von-Neumann mental & operational bottleneckReflects extreme scarcity of HW. Less acute now
Commodity computer systemsChapter 1 19462003: Serial. 5KHz4GHz.
Chapter 2 2004--: Parallel. #”cores”: ~dy-2003 Apple 2004: 1 core2013: >100 coresWindows 7: scales to 256 cores…How to use the other 255?Did I mention ENEE459P? BIG NEWSClock frequency growth: flat. If you want your program to run significantly faster … you’re going
to have to parallelize it Parallelism: only game in town#Transistors/chip 19802011: 29K30B!
Programmer’s IQ? Flat..40 years of parallel computingThe world is yet to see a successful general-purpose parallel
computer: Easy to program & good speedups
Intel Platform 2015, March05
Is performance at a plateau?
Historic SPECint 2000 Performance
Year
Source: published SPECInt data
?
Students: Make yourself ready for the job market. Serial computing <1% of computing power. Will serial computing be taught for … history majors?
Welcome to the 2010 ImpasseAll vendors committed to multi-cores. Yet, their
architecture and how to program them for single program completion time not clear
The software spiral (HW improvements SW imp HW imp) – growth engine for IT (A. Grove, Intel); Alas, now broken!
SW vendors avoid investment in long-term SW development since may bet on the wrong horse. Impasse bad for business.
Parallel programming education: Does CS&E degree mean: being trained for a 50yr career dominated by parallelism by programming yesterday’s serial computers?
ENEE459P Teach: (i) common denominator, and (ii) main approaches.
Serial Abstraction & A Parallel Counterpart• Rudimentary abstraction that made serial computing simple: that any
single instruction available for execution in a serial program executes immediately
Abstracts away different execution time for different operations (e.g., memory hierarchy) . Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively)
• Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE)
Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step.
What could I do in parallel at each step assuming unlimited hardware
# ops
.. ..
time
#ops
.. ..
.... ..
time
Time = Work Work = total #ops Time << Work
Serial Execution, Based on Serial Abstraction
Parallel Execution, Based on Parallel Abstraction
Explicit Multi-threading (XMT)1979- : THEORY figure out how to think algorithmically in parallel Outcome in a nutshell: above abstraction
1997- : XMT@UMD: derive specs for architecture; design and build
UV: Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism,
http://www.umiacs.umd.edu/users/vishkin/XMT/cacm2010.pdf, to appear in CACM
Not just talkingAlgorithms
PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase
“Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill.
Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01
Programming & WorkflowRudimentary yet stable compiler
PRAM-On-Chip HW Prototypes64-core, 75MHz FPGA of XMT(Explicit Multi-Threaded) architecture
SPAA98..CF08
128-core intercon. network IBM 90nm: 9mmX5mm, 400 MHz [HotI07]
• FPGA designASIC • IBM 90nm: 10mmX10mm
• 150 MHz
Architecture scales to 1000+ cores on-chip
ParticipantsGrad students:, Aydin Balkan, PhD, George Caragea, James Edwards, David
Ellison, Mike Horak, MS, Fuat Keceli, Beliz Saybasili, Alex Tzannes, Xingzhi Wen, PhD
• Industry design experts (pro-bono).• Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant.• Gang Qu, VLSI and Power. Co-advisor.• Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team
grant. • Ron Tzur, Purdue U., K12 Education. Co-advisor. 2008 NSF seed fundingK12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city)
Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools• Marc Olano, UMBC, Computer graphics. Co-advisor.• Tali Moreshet, Swarthmore College, Power. Co-advisor.• Marty Peckerar, Microelectronics• Igor Smolyaninov, Electro-optics• Funding: NSF, NSA 2008 deployed XMT computer, NIH• Industry partner: Intel Started from core CS. Built HW+Compiler foundation. Ready for
~10 timely CS PhD theses, ~2 Education, and ~10 ECE.
More on ENEE459P, fall 2010• Parallel algorithmic thinking (PAT) based on first principles. More challenging to
self-study• Mainstream computingparallelism: chaotic. Hence: Pluralism valuable.• ENEE459: jointly taught by 2 instructors, video conferencing, U. Illinois • CS@Illinois: top 5. Parallel@Illinois: #1. • Joint course on timely topic : extremely rare opportunity. • More than “2 for the price of one“. 2 courses, each with 1 instructors would lack
the interaction. • Advanced by Google, Intel and Microsoft, the introduction of parallelism into the
curriculum dominated the recent flagship Computer Science Education Conference. Several speakers, including a Keynote by the Director of Education at Intel, reported that:
(1) In job interviews, employers now expect an intelligent discussion of parallelism; and
(2) (2) International competition recognizes that: 85% of the people that have been trained in parallel programming are outside the U.S.
Membership in Intel Academic Community
85% outside USA
Implementing parallel computing into CS curriculum
Source: M. Wrinn, Intel
The Pain of Parallel Programming• Parallel programming is currently too difficult:- To many users programming existing parallel computers is “as
intimidating and time consuming as programming in assembly language” [NSF Blue-Ribbon Panel on Cyberinfrastructure].
- AMD/Intel: “Need PhD in CS to program today’s multicores”.
• The real problem: Parallel architectures built using the following “methodology”: build-first figure-out-how-to-program-later.
[J. Hennessy: “Many of the early ideas were motivated by observations of what was easy to implement in the hardware rather than what was easy to use”]
Input: (i) All world airports. (ii) For each, all its non-stop flights.Find: smallest number of flights from
DCA to every other airport.
Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet
unvisited” airports as requiring i flights (note nesting)
Serial: forces eye-of-a-needle queue; need to prove that still the same as the parallel version.
O(T) time; T – total # of flights
Parallel: parallel data-structures. Inherent serialization: S.
Gain relative to serial: (first cut) ~T/S!Decisive also relative to coarse-grained
parallelism.
Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm
(ii) No “decomposition”/”partition” Speed-up wrt GPU: same-silicon area
for highly parallel input 5.4X! (iii) But, SMALL CONFIG on 20-way
parallel input: 109X wrt same GPU
Mental effort of PRAM-like programming1. sometimes easier than serial 2. considerably easier than for any parallel
computer currently sold. Understanding falls within the common denominator of other approaches.
2nd Example of PRAM-like Algorithm
Back to the education crisisCTO of NVidia and the official Intel leader of multi-cores
at Intel: teach parallelism as early as you.Reason: we don’t only under teach. We misteach, since
students acquire bad habits. Current situation is unacceptable. Sort of malpractice.
Some possibilities1. Teach as a major elective. 2. Teach all CS&E undergrads. 3. Teach CS&E Freshmen and invite all Eng, Math,
and Science; sends message “CS&E is where the action is”.
NeedA general-purpose parallel computer framework [“successor to the
Pentium for the multi-core era”] that:(i) is easy to program; (ii) gives good performance with any amount of parallelism
provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code;
(iii) supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming; and
(iv) fits current chip technology and scales with it.(in particular: strong speed-ups for single-task completion time)
Main Point of talk: PRAM-On-Chip@UMD is addressing (i)-(iv).
The PRAM Rollercoaster ride
Late 1970’s Theory work beganUP Won the battle of ideas on parallel algorithmic thinking.
No silver or bronze!Model of choice in all theory/algorithms communities. 1988-90: Big
chapters in standard algorithms textbooks.DOWN FCRC’93: “PRAM is not feasible”. [‘93+ despair no
good alternative! Where vendors expect good enough alternatives to come from in 2008?]; Device changed it all:
UP Highlights: eXplicit-multi-threaded (XMT) FPGA-prototype computer (not simulator), SPAA’07,CF’08; 90nm ASIC tape-outs: int. network, HotI’07, XMT. # on-chip transistors
How come? crash “course” on parallel computingHow much processors-to-memories bandwidth?Enough: Ideal Programming Model (PRAM)Limited: Programming difficulties
How does it work“Work-depth” Algs Methodology (source SV82) State all ops you can do in
parallel. Repeat. Minimize: Total #operations, #rounds The rest is skill. • Program single-program multiple-data (SPMD). Short (not OS) threads.
Independence of order semantics (IOS). XMTC: C plus 3 commands: Spawn+Join, Prefix-Sum Unique First parallelism. Then decomposition
Programming methodology Algorithms effective programs. Extend the SV82 Work-Depth framework from PRAM to XMTC
Or Established APIs (VHDL/Verilog, OpenGL, MATLAB) “win-win proposition” Compiler minimize length of sequence of round-trips to memory; take
advantage of architecture enhancements (e.g., prefetch). [ideally: given XMTC program, compiler provides decomposition: “teach the compiler”]
Architecture Dynamically load-balance concurrent threads over processors. “OS of the language”. (Prefix-sum to registers & to memory. )
Basic Algorithm (sometimes informal)
Serial program (C)
Add data-structures (for serial algorithm)
Decomposition
Assignment
Orchestration
Mapping
Add parallel data-structures(for PRAM-like algorithm)
Parallel Programming(Culler-Singh)
Parallel program (XMT-C)
XMT Computer(or Simulator)
Parallel computer
Standard Computer
31
2
4
• 4 easier than 2 • Problems with 3• 4 competitive with 1: cost-effectiveness; natural
PERFORMANCE PROGRAMMING & ITS PRODUCTIVITY
Low overheads!
Serial program (C)
Decomposition
Assignment
Orchestration
Mapping
Parallel Programming(Culler-Singh)
Parallel program (XMT-C)
XMT architecture(Simulator)
Parallel computer
Standard Computer
Application programmer’s interfaces (APIs)(OpenGL, VHDL/Verilog, Matlab)
compiler
Automatic? YesYesMaybe
APPLICATION PROGRAMMING & ITS PRODUCTIVITY
Naming Contest for New Computer
Paraleapchosen out of ~6000 submissions
Single (hard working) person (X. Wen) completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. No prior design experience. Attests to: basic simplicity of the XMT architecture faster time to market, lower implementation cost.
Experience with High School Students, Fall’07
Gave 1-day parallel algorithms tutorial to 12 HS students. Some (2 10th graders) managed 8 programming assignments, including 5 of the 6 in the grad course. Only help: 1 office hour/week by undergrad TA. No school credit. Part of a computer club after 8 periods/day.
May-June 08: 23 HS students, by self-taugh HS teacher, Alexandria, VA
Spring’08: Course to non-major Freshmen (UMD Honor). How will programmers have to think by the time you graduate.
Spring’08: Course to seniors.
NEW: Software releaseAllows to use your own computer for programming on an XMT environment and experimenting with it, including:(i)Cycle-accurate simulator of the XMT machine(ii)Compiler from XMTC to that machine
Also provided, extensive material for teaching or self-studying parallelism, including(i)Tutorial + manual for XMTC (150 pages)(ii)Classnotes on parallel algorithms (100 pages)(iii)Video recording of 9/15/07 HS tutorial (300 minutes)
Next Major ObjectiveIndustry-grade chip and production quality compiler. Requires 10X in funding.