Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art...

Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus

State-of-the-Art ProcessorGeorge C. Caragea – University of MarylandA. Beliz Saybasili – LCB Branch, NHLBL, NIH

Xingzhi Wen – NVIDIA CorporationUzi Vishkin – University of Maryland

www.umiacs.umd.edu/users/vishkin/XMT

Hardware prototypes of PRAM-On-Chip

Objective of current paper Meaningful comparison of1. Our FPGA design, with2. State-of-the-Art (Intel) Processor

64-core, 75MHz FPGA prototype[SPAA’07, Computing Frontiers’08]Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” ~7 years later)

Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07]

Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype

The design scales to 1000+ cores on-chip

XMT: A PRAM-On-Chip Vision• Manycores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to

program, good speedups, up&down scalable). IF you could program it great speedups. XMT: Fix the IF

• XMT: Designed from the ground up to address that for on-chip parallelism

• Unlike matching current HW (Some other SPAA papers)• Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel

algorithmic theory. Latent, though not widespread, knowledgebase

• This paper: ~10X relative to Intel Core 2 DuoIf there is time: Really serious about ease of programming

Objective for programmer’s model• Emerging: not sure, but the analysis should be work-depth.

But, why not design for your analysis? (like serial)

• XMT: Design for work-depth. Unique among manycores.- 1 operation now. Any #ops next time unit.- Competitive on nesting. (To be published.)- No need to program for locality.

What could I do in parallel at each step assuming

unlimited hardware

# ops

.. ..time

#ops

.. ..

.... ..

time

Time = Work Work = total #ops

Time << Work

Serial Paradigm Natural (Parallel) Paradigm

Programmer’s Model: Engineering Workflow• Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model • SPMD reduced synchrony

– Threads advance at own speed, not lockstep– Main construct: spawn-join block. Note: can start any number of processes at once– Prefix-sum (ps). Independence of order semantics (IOS).

– Establish correctness & complexity by relating to WD analyses.– Circumvents “The problem with threads”, e.g., [Lee].

• Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07]

• Trial&error contrast: similar startwhile insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache}

spawn join spawn join

XMT Architecture Overview• One serial core – master

thread control unit (MTCU)• Parallel cores (TCUS)

grouped in clusters• Global memory space

evenly partitioned in cache banks using hashing

• No local caches at TCU– Avoids expensive cache

coherence hardware

Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster CCluster C

DRAM Channel 1

DRAM Channel 1

DRAM Channel D

DRAM Channel D

MTCUMTCUHardware Scheduler/Prefix-Sum UnitHardware Scheduler/Prefix-Sum Unit

Parallel Interconnection Network

Memory Bank 1

Memory Bank 1

Memory Bank 2

Memory Bank 2

Memory Bank M

Memory Bank M

Shared Memory(L1 Cache)

…

…

…

Paraleap: XMT PRAM-on-chip silicon• Built FPGA prototype• Announced in SPAA’07• Built using 3 FPGA chips

– 2 Virtex-4 LX200 – 1 Virtex-4 FX100

Clock rate 75 MHz No. TCUs 64

DRAM size 1GB Clusters 8

DRAM channels 1 Cache modules 8

Mem. data rate 0.6GB/s Shared cache 256KB

With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person.. basic simplicity of the XMT architecture simple faster time to market, lower implementation cost.

Benchmarks

• Sparse Matrix – Vector Multiplication (SpMV)– Matrix stored in Compact Sparse Row (CSR) format– Serial version: iterate through rows– Parallel version: one thread per row

• 1-D FFT– Fixed-point arithmetic implementation– Serial version: Radix-2 Cooley-Tukey Algorithm– Parallel version: Parallelized each stage of serial algorithm

• Quicksort– Serial version: standard textbook implementation– Parallel version: two phases

• Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum

• Phase 2: Process all partitions in parallel using serial partitioning algorithm

Experimental Platforms

XMT Paraleap FPGA Intel Core 2 Duo

Processor 1 MTCU, 64 TCUs Dual Core, E6300

Clock 75 MHz 1.86GHz

Cache 256KB shared L1 cache 2x64KB L1, 2x2MB L2

DRAM 1GB DDR2 2GB DDR2

Data rate 0.6GB/s 6.4GB/s

Compiler XMTCC (GCC 4.0.2-based) GCC 4.1.0

Intel C++ Professional Compiler (ICC) v11

Compiler Optim.

-O3, data prefetch, read-only buffers

-O3, SSE3 SIMD, data prefetching, auto-parallelization

For meaningful comparison: compare cycle count

Input Datasetssmall large

Program N Footprint N Footprint

SpMV 22K 200KB 4M 33MB

FFT 8K 192KB 4M 96MB

Quicksort 100K 781KB 20M 153MB

• Large dataset represents realistic input sizes– Recommended by Intel engineer for comparison– Gives Intel Core 2 advantage because of larger cache

• Small dataset– Fits in both Paraleap and Intel Core 2 cache– Provides most fair comparison for current XMT generation

Clock-Cycle Speedup

Core 2 – ICC Core 2 - GCC

Program small large small large

SpMV 6.7 3.3 6.3 3.26

FFT 9.51 2.51 8.76 2.71

Quicksort(*) 13.07 7.75 13.89 8.18

• Paraleap outperforms Intel Core 2 on all benchmarks• Lower speed-ups for Large dataset because of smaller cache size

– Will not be an issue for future implementations of XMT• Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo• No reason for clock frequency of XMT to fall behind

• Computed as:

speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap

Conclusion• XMT provides viable answer to biggest challenges for the

field– Ease of programming– Scalability (up&down)

• Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform

• ICPP’08 paper compares with GPUs.

Software releaseAllows to use your own computer for programming on an XMT environment and experimenting with it, including:(i)Cycle-accurate simulator of the XMT machine(ii)Compiler from XMTC to that machine

Also provided, extensive material for teaching or self-studying parallelism, including(i)Tutorial + manual for XMTC (150 pages)(ii)Classnotes on parallel algorithms (100 pages)(iii)Video recording of 9/15/07 HS tutorial (300 minutes)(iv) Video recording of grad Parallel Algorithms lectures (30+hours)www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html

Next Major ObjectiveIndustry-grade chip. Requires 10X in funding.

Ease of Programming• Benchmark: can any CS major program your manycore?

- cannot really avoid it. Teachability demonstrated so far:

- To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort.

Other teachers:- Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher.- High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

Date post:	20-Dec-2015
Category:	Documents
View:	213 times
Download:	0 times

Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art...

Documents