+ All Categories
Home > Documents > The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

Date post: 19-Mar-2016
Category:
Upload: kaili
View: 37 times
Download: 2 times
Share this document with a friend
Description:
The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer. Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT Students: just remember to take ENEE459P: Parallel Algorithms, fall’10 - What is a parallel algorithm? - Why should I care?. - PowerPoint PPT Presentation
26
The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT Students: just remember to take ENEE459P: Parallel Algorithms, fall’10 - What is a parallel algorithm? - Why should I care?
Transcript
Page 1: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT

Students: just remember to take ENEE459P: Parallel Algorithms, fall’10

- What is a parallel algorithm?- Why should I care?

Page 2: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Taste of a Parallel Algorithm Example: Exchange Problem

2 Bins A and B. Exchange contents of A and B. Ex. A=2,B=5A=5,B=2.Algorithm (serial or parallel): X:=A;A:=B;B:=X. 3 Ops. 3 Steps. Space 1.

Array Exchange Problem 2n bins A[1..n], B[1..n]. Replace A(i) and B(i), i=1..n.Serial Alg: For i=1 to n do /*serial exchange through eye-of-a-needle X:=A(i);A(i):=B(i);B(i):=X 3n Ops. 3n Steps. Space 1Parallel Alg: For i=1 to n pardo /*2-bin exchange in parallel X(i):=A(i);A(i):=B(i);B(i):=X(i) 3n Ops. 3 Steps. Space n

Discussion - Parallelism tends to require some extra space- Par Alg clearly faster than Serial Alg.- What is “simpler” and “more natural”: serial or parallel? Small sample of people: serial, but only if you .. majored in CS

Eye-of-a-needle: metaphor for the von-Neumann mental & operational bottleneckReflects extreme scarcity of HW. Less acute now

Page 3: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Commodity computer systemsChapter 1 19462003: Serial. 5KHz4GHz.

Chapter 2 2004--: Parallel. #”cores”: ~dy-2003 Apple 2004: 1 core2013: >100 coresWindows 7: scales to 256 cores…How to use the other 255?Did I mention ENEE459P? BIG NEWSClock frequency growth: flat. If you want your program to run significantly faster … you’re going

to have to parallelize it Parallelism: only game in town#Transistors/chip 19802011: 29K30B!

Programmer’s IQ? Flat..40 years of parallel computingThe world is yet to see a successful general-purpose parallel

computer: Easy to program & good speedups

Intel Platform 2015, March05

Page 4: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Is performance at a plateau?

Historic SPECint 2000 Performance

Year

Source: published SPECInt data

?

Students: Make yourself ready for the job market. Serial computing <1% of computing power. Will serial computing be taught for … history majors?

Page 5: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Welcome to the 2010 ImpasseAll vendors committed to multi-cores. Yet, their

architecture and how to program them for single program completion time not clear

The software spiral (HW improvements SW imp HW imp) – growth engine for IT (A. Grove, Intel); Alas, now broken!

SW vendors avoid investment in long-term SW development since may bet on the wrong horse. Impasse bad for business.

Parallel programming education: Does CS&E degree mean: being trained for a 50yr career dominated by parallelism by programming yesterday’s serial computers?

ENEE459P Teach: (i) common denominator, and (ii) main approaches.

Page 6: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Serial Abstraction & A Parallel Counterpart• Rudimentary abstraction that made serial computing simple: that any

single instruction available for execution in a serial program executes immediately

Abstracts away different execution time for different operations (e.g., memory hierarchy) . Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively)

• Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE)

Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step.

What could I do in parallel at each step assuming unlimited hardware

# ops

.. ..

time

#ops

.. ..

.... ..

time

Time = Work Work = total #ops Time << Work

Serial Execution, Based on Serial Abstraction

Parallel Execution, Based on Parallel Abstraction

Page 7: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Explicit Multi-threading (XMT)1979- : THEORY figure out how to think algorithmically in parallel Outcome in a nutshell: above abstraction

1997- : XMT@UMD: derive specs for architecture; design and build

UV: Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism,

http://www.umiacs.umd.edu/users/vishkin/XMT/cacm2010.pdf, to appear in CACM

Page 8: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Not just talkingAlgorithms

PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase

“Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill.

Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01

Programming & WorkflowRudimentary yet stable compiler

PRAM-On-Chip HW Prototypes64-core, 75MHz FPGA of XMT(Explicit Multi-Threaded) architecture

SPAA98..CF08

128-core intercon. network IBM 90nm: 9mmX5mm, 400 MHz [HotI07]

• FPGA designASIC • IBM 90nm: 10mmX10mm

• 150 MHz

Architecture scales to 1000+ cores on-chip

Page 9: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

ParticipantsGrad students:, Aydin Balkan, PhD, George Caragea, James Edwards, David

Ellison, Mike Horak, MS, Fuat Keceli, Beliz Saybasili, Alex Tzannes, Xingzhi Wen, PhD

• Industry design experts (pro-bono).• Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant.• Gang Qu, VLSI and Power. Co-advisor.• Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team

grant. • Ron Tzur, Purdue U., K12 Education. Co-advisor. 2008 NSF seed fundingK12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city)

Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools• Marc Olano, UMBC, Computer graphics. Co-advisor.• Tali Moreshet, Swarthmore College, Power. Co-advisor.• Marty Peckerar, Microelectronics• Igor Smolyaninov, Electro-optics• Funding: NSF, NSA 2008 deployed XMT computer, NIH• Industry partner: Intel Started from core CS. Built HW+Compiler foundation. Ready for

~10 timely CS PhD theses, ~2 Education, and ~10 ECE.

Page 10: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

More on ENEE459P, fall 2010• Parallel algorithmic thinking (PAT) based on first principles. More challenging to

self-study• Mainstream computingparallelism: chaotic. Hence: Pluralism valuable.• ENEE459: jointly taught by 2 instructors, video conferencing, U. Illinois • CS@Illinois: top 5. Parallel@Illinois: #1. • Joint course on timely topic : extremely rare opportunity. • More than “2 for the price of one“. 2 courses, each with 1 instructors would lack

the interaction. • Advanced by Google, Intel and Microsoft, the introduction of parallelism into the

curriculum dominated the recent flagship Computer Science Education Conference. Several speakers, including a Keynote by the Director of Education at Intel, reported that:

(1) In job interviews, employers now expect an intelligent discussion of parallelism; and

(2) (2) International competition recognizes that: 85% of the people that have been trained in parallel programming are outside the U.S.

Page 11: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Membership in Intel Academic Community

85% outside USA

Implementing parallel computing into CS curriculum

Source: M. Wrinn, Intel

Page 12: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

The Pain of Parallel Programming• Parallel programming is currently too difficult:- To many users programming existing parallel computers is “as

intimidating and time consuming as programming in assembly language” [NSF Blue-Ribbon Panel on Cyberinfrastructure].

- AMD/Intel: “Need PhD in CS to program today’s multicores”.

• The real problem: Parallel architectures built using the following “methodology”: build-first figure-out-how-to-program-later.

[J. Hennessy: “Many of the early ideas were motivated by observations of what was easy to implement in the hardware rather than what was easy to use”]

Page 13: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Input: (i) All world airports. (ii) For each, all its non-stop flights.Find: smallest number of flights from

DCA to every other airport.

Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet

unvisited” airports as requiring i flights (note nesting)

Serial: forces eye-of-a-needle queue; need to prove that still the same as the parallel version.

O(T) time; T – total # of flights

Parallel: parallel data-structures. Inherent serialization: S.

Gain relative to serial: (first cut) ~T/S!Decisive also relative to coarse-grained

parallelism.

Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm

(ii) No “decomposition”/”partition” Speed-up wrt GPU: same-silicon area

for highly parallel input 5.4X! (iii) But, SMALL CONFIG on 20-way

parallel input: 109X wrt same GPU

Mental effort of PRAM-like programming1. sometimes easier than serial 2. considerably easier than for any parallel

computer currently sold. Understanding falls within the common denominator of other approaches.

2nd Example of PRAM-like Algorithm

Page 14: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer
Page 15: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer
Page 16: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer
Page 17: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer
Page 18: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Back to the education crisisCTO of NVidia and the official Intel leader of multi-cores

at Intel: teach parallelism as early as you.Reason: we don’t only under teach. We misteach, since

students acquire bad habits. Current situation is unacceptable. Sort of malpractice.

Some possibilities1. Teach as a major elective. 2. Teach all CS&E undergrads. 3. Teach CS&E Freshmen and invite all Eng, Math,

and Science; sends message “CS&E is where the action is”.

Page 19: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

NeedA general-purpose parallel computer framework [“successor to the

Pentium for the multi-core era”] that:(i) is easy to program; (ii) gives good performance with any amount of parallelism

provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code;

(iii) supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming; and

(iv) fits current chip technology and scales with it.(in particular: strong speed-ups for single-task completion time)

Main Point of talk: PRAM-On-Chip@UMD is addressing (i)-(iv).

Page 20: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

The PRAM Rollercoaster ride

Late 1970’s Theory work beganUP Won the battle of ideas on parallel algorithmic thinking.

No silver or bronze!Model of choice in all theory/algorithms communities. 1988-90: Big

chapters in standard algorithms textbooks.DOWN FCRC’93: “PRAM is not feasible”. [‘93+ despair no

good alternative! Where vendors expect good enough alternatives to come from in 2008?]; Device changed it all:

UP Highlights: eXplicit-multi-threaded (XMT) FPGA-prototype computer (not simulator), SPAA’07,CF’08; 90nm ASIC tape-outs: int. network, HotI’07, XMT. # on-chip transistors

How come? crash “course” on parallel computingHow much processors-to-memories bandwidth?Enough: Ideal Programming Model (PRAM)Limited: Programming difficulties

Page 21: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

How does it work“Work-depth” Algs Methodology (source SV82) State all ops you can do in

parallel. Repeat. Minimize: Total #operations, #rounds The rest is skill. • Program single-program multiple-data (SPMD). Short (not OS) threads.

Independence of order semantics (IOS). XMTC: C plus 3 commands: Spawn+Join, Prefix-Sum Unique First parallelism. Then decomposition

Programming methodology Algorithms effective programs. Extend the SV82 Work-Depth framework from PRAM to XMTC

Or Established APIs (VHDL/Verilog, OpenGL, MATLAB) “win-win proposition” Compiler minimize length of sequence of round-trips to memory; take

advantage of architecture enhancements (e.g., prefetch). [ideally: given XMTC program, compiler provides decomposition: “teach the compiler”]

Architecture Dynamically load-balance concurrent threads over processors. “OS of the language”. (Prefix-sum to registers & to memory. )

Page 22: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Basic Algorithm (sometimes informal)

Serial program (C)

Add data-structures (for serial algorithm)

Decomposition

Assignment

Orchestration

Mapping

Add parallel data-structures(for PRAM-like algorithm)

Parallel Programming(Culler-Singh)

Parallel program (XMT-C)

XMT Computer(or Simulator)

Parallel computer

Standard Computer

31

2

4

• 4 easier than 2 • Problems with 3• 4 competitive with 1: cost-effectiveness; natural

PERFORMANCE PROGRAMMING & ITS PRODUCTIVITY

Low overheads!

Page 23: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Serial program (C)

Decomposition

Assignment

Orchestration

Mapping

Parallel Programming(Culler-Singh)

Parallel program (XMT-C)

XMT architecture(Simulator)

Parallel computer

Standard Computer

Application programmer’s interfaces (APIs)(OpenGL, VHDL/Verilog, Matlab)

compiler

Automatic? YesYesMaybe

APPLICATION PROGRAMMING & ITS PRODUCTIVITY

Page 24: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Naming Contest for New Computer

Paraleapchosen out of ~6000 submissions

Single (hard working) person (X. Wen) completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. No prior design experience. Attests to: basic simplicity of the XMT architecture faster time to market, lower implementation cost.

Page 25: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

Experience with High School Students, Fall’07

Gave 1-day parallel algorithms tutorial to 12 HS students. Some (2 10th graders) managed 8 programming assignments, including 5 of the 6 in the grad course. Only help: 1 office hour/week by undergrad TA. No school credit. Part of a computer club after 8 periods/day.

May-June 08: 23 HS students, by self-taugh HS teacher, Alexandria, VA

Spring’08: Course to non-major Freshmen (UMD Honor). How will programmers have to think by the time you graduate.

Spring’08: Course to seniors.

Page 26: The eXplicit MultiThreading (XMT)  Easy-To-Program Parallel Computer

NEW: Software releaseAllows to use your own computer for programming on an XMT environment and experimenting with it, including:(i)Cycle-accurate simulator of the XMT machine(ii)Compiler from XMTC to that machine

Also provided, extensive material for teaching or self-studying parallelism, including(i)Tutorial + manual for XMTC (150 pages)(ii)Classnotes on parallel algorithms (100 pages)(iii)Video recording of 9/15/07 HS tutorial (300 minutes)

Next Major ObjectiveIndustry-grade chip and production quality compiler. Requires 10X in funding.


Recommended