DUSD(Labs) Breaking the Memory Wall for Scalable Microprocessor Platforms Wen-mei Hwu with John W....

DUSD(Labs)

Breaking the Memory Wall Breaking the Memory Wall for Scalable Microprocessor Platformsfor Scalable Microprocessor Platforms

Wen-mei HwuWen-mei Hwu

withwith

John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player,Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player,

Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,Dan R. Burke, Nacho Navarro, Steve S. LumettaDan R. Burke, Nacho Navarro, Steve S. Lumetta

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

PACT Keynote, October 1, 2004

Semiconductor computing platform Semiconductor computing platform challengeschallenges

performanceperformance

billion transistorsbillion transistors

powerpower costcost

DSP/

ASIP

DSP/

ASIP

securitysecurity

Inte

llige

nt R

AM

Inte

llige

nt R

AM

feature setfeature setreliabilityreliability

Mem. Latency/BandwidthMem. Latency/BandwidthPower ConstraintsPower Constraints

Mic

ropr

oces

sors

Mic

ropr

oces

sors

Reconfigurability

Reconfigurabilityacce

lera

tors

acce

lera

tors

O/S limitationsO/S limitationsS/W inertiaS/W inertia

wire loadwire load

process variationprocess variation

leakageleakage fab costfab cost


ASIC/ASIP economicsASIC/ASIP economics

Optimistically, ASIC/ASSP revenues growing 10–20% / yearOptimistically, ASIC/ASSP revenues growing 10–20% / year Engineering portion of budget is supposed to be trimmed every year (but Engineering portion of budget is supposed to be trimmed every year (but

never is)never is)

Chip development costs rising faster than increased revenues and decreased Chip development costs rising faster than increased revenues and decreased

engineering costs can make up the differenceengineering costs can make up the difference

Implies Implies 40% fewer IC designs 40% fewer IC designs (doing more applications) - every process (doing more applications) - every process

generation!!generation!!

Number of ICDesigns ≤

Per-chip Development

Cost

Total ASIC/ASSPRevenues

EngineeringCosts×

10-20%

5-20%

40%

30-100%


ASIPs: non-traditional programmable platformsASIPs: non-traditional programmable platforms

Level of concurrency mustLevel of concurrency must

be comparable to ASICsbe comparable to ASICs

XScaleCore

HashEngine

Scratch-pad

SRAM

RFIFO

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

QD

RS

RA

M

QD

RS

RA

M

QD

RS

RA

M

QD

RS

RA

M

RD

RA

M

RD

RA

M

RD

RA

M

PC

I

CSRs

TFIFO

SP

I4 / C

SIX

ASIPs will be on-chip, high-ASIPs will be on-chip, high-

performance multi-processorsperformance multi-processors


Example embedded ASSP implementationsExample embedded ASSP implementations

Intel IXP1200 Network ProcessorPhilips Nexperia (Viper)

MIPS

VLIW


What about the general purpose worldWhat about the general purpose world

Clock frequency increase of computing engines is slowing downClock frequency increase of computing engines is slowing down Power budget hinders higher clock frequencyPower budget hinders higher clock frequency Device variation limits deeper pipeliningDevice variation limits deeper pipelining Most future perf. improvement will come from concurrency and specializationMost future perf. improvement will come from concurrency and specialization

Size increase of single-thread computing engines is slowing downSize increase of single-thread computing engines is slowing down Power budget limits number of transistors activated by each instructionPower budget limits number of transistors activated by each instruction

Need finer-grained units for defect containmentNeed finer-grained units for defect containment

Wire delay is becoming a primary limiter in large, monolithic designsWire delay is becoming a primary limiter in large, monolithic designs

The approach to covering all applications with a primarily single The approach to covering all applications with a primarily single execution model is showing limitationsexecution model is showing limitations


Impact of Transistor VariationsImpact of Transistor Variations

130nm

30%

5X

FrequencyFrequency~30%~30%

LeakageLeakagePowerPower

~5X~5X

0.90.9

1.01.0

1.11.1

1.21.2

1.31.3

1.41.4

11 22 33 44 55Normalized Leakage (INormalized Leakage (Isbsb))

No

rmal

ized

Fre

qu

en

cyN

orm

aliz

ed F

req

ue

ncy

Source: Shekhar Borkar, Intel


Metal InterconnectsMetal Interconnects

Interconnect RC DelayInterconnect RC Delay

Source: Shekhar Borkar, Intel

0

0.5

1

500 250 130 65 32

Lin

e C

ap (

Rel

ativ

e)

Low-K ILD

1

10

100

1000

500 250 130 65 32

Lin

e R

es (

Rel

ativ

e)

1

10

100

1000

10000

350 250 180 130 90 65

Del

ay (

ps) Clock Period

RC delay of 1mm interconnect

Copper Interconnect

1

10

100

500 250 130 65 32

RC

Del

ay (

Rel

ativ

e)

0.7x Scaled RC Delay


Measured SPECint2000 PerformanceMeasured SPECint2000 Performanceon real hardware with same fabrication technologyon real hardware with same fabrication technology

0

200

400

600

800

1000

1200

GEOMEAN

164.gzip

175.vpr

176.gcc

181.mcf

186.crafty

197.parser252.eon

253.perlbmk

254.gap

255.vortex

256.bzip2

300.tw olf

Intel ecc 7.1 20030909 IMPACT 20031010 gcc 3.2 Pentium4 2GHz (Willamette)

McKinley 1GHz/3MB2P HP zx6000 8GB RAMlinux 2.4.21-imp-smp

Date: October 2003


General processor coresGeneral processor cores Very low power compute Very low power compute

and memory structuresand memory structures O/S provides lightweight O/S provides lightweight

access to custom featuresaccess to custom features

Acceleration logicAcceleration logic Application specific logicApplication specific logic High-bandwidth, distributed High-bandwidth, distributed

storage (RAM, registers)storage (RAM, registers) To developer, behave like To developer, behave like

software componentssoftware components

Memory systemMemory system Data delivery to processor Data delivery to processor O/S and virtual memory O/S and virtual memory

issuesissues Intelligent memory Intelligent memory

controllerscontrollers

Application processorsApplication processors Lightweight compute enginesLightweight compute engines High-bandwidth, distributed High-bandwidth, distributed

storage (RAM, registers)storage (RAM, registers) High-bandwidth, scalable High-bandwidth, scalable

interconnectinterconnect

Convergence of future computing platformsConvergence of future computing platforms


Breaking the memory wall withBreaking the memory wall withdistributed memory and data movementdistributed memory and data movement

ACC

LOCALMEMORY

ACC

MA

INM

EMO

RY

GPP

MTM

ACC

LOCALMEMORY

Memory transfermodule

schedules system-wide bulk data

movement

General-purpose processororchestrates activity

Accelerated activities and associatedprivate data are localized forbandwidth, power efficiency

Accelerators can usescheduled, streaming

communication...

or can operate onlocally-buffered data

pushed to them inadvance


Parallelization with deep analysis: Parallelization with deep analysis: Deconstructing von Neumann Deconstructing von Neumann [IWLS2004][IWLS2004]

Memory dataflow that enablesMemory dataflow that enables Extraction of independent memory access streamsExtraction of independent memory access streams

Conversion of implicit flows through memory into explicit communicationConversion of implicit flows through memory into explicit communication

Applicability to mass software base requires pointer analysis, Applicability to mass software base requires pointer analysis,

control flow analysis, array dependence analysiscontrol flow analysis, array dependence analysis

CPUCPUWeight_Ai (Az, F_ga3, Ap3)

Weight_Ai (Az, F_g4, Ap4)

Residu (Ap3, &syn_subfr[i],)

Copy (Ap3, h, 11)

Set_zero (&h[11], 11)

Syn_filt (Ap4, h, h, 22, &h)

tmp = h[0] * h[0];for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i];tmp1 = tmp >> 8;tmp = h[0] * h[1];for (i = 1 ; i < 21 ; i++) tmp = tmp + h[i] * h[i+1];tmp2 = tmp >> 8;if (tmp2 <= 0) tmp2 = 0;else tmp2 = tmp2 * MU; tmp2 = tmp2/tmp1;

preemphasis (res2, temp2, 40)

Syn_filt (Ap4, res2, &syn_p),40, mem_syn_pst, 1);

agc (&syn[i_subfr], &syn)29491, 40)

res2

m_syn

F_g3

F_g4

Az_4

synth

syn

Ap3

Ap4

h

tmp

tmp1

tmp2

CPUDRAM

DRAM

Weight_Ai Weight_Ai

Copy+Set_zero Residu

Syn_filt

Corr0/Corr1

preemph

agc

Syn_filt

PE’sres2

m_syn

F_g3

F_g4

Az_4

synth

syn

Ap3

Ap4

h

tmp

tmp1

tmp2

PE’sDRAM


Memory bottleneck exampleMemory bottleneck example(G.724 Decoder Post-filter, C code)(G.724 Decoder Post-filter, C code)

Problem:Problem: Production/consumption occur with different patterns across 3 kernels Production/consumption occur with different patterns across 3 kernels Anti-dependence in Anti-dependence in preemphasispreemphasis function function (loop reversal not applicable)(loop reversal not applicable) Consumer must wait until producer finishes Consumer must wait until producer finishes

Goal:Goal: Convert memory access to inter-cluster communication Convert memory access to inter-cluster communication

** * *+

Residu

preemphasis

** * *+

Syn_filt

res

[0:39]

[39:0] [39:0]

[0:39] MEMtime


Breaking the memory bottleneckBreaking the memory bottleneck

Remove anti-dependence by array Remove anti-dependence by array

renamingrenaming

Apply loop reversal to match Apply loop reversal to match

producer/consumer I/Oproducer/consumer I/O

Convert array access to inter-Convert array access to inter-

component communicationcomponent communication

** * *+

Residu

+** * *

Syn_filt

res

preemphasis

res2

timeInterprocedural pointer analysis + array dependence test +

array access pattern summary+ interprocedural memory data flow


Full system environmentFull system environment Linux running on PowerPCLinux running on PowerPC Lean system with custom Linux Lean system with custom Linux

(Nacho Navarro, UIUC/UPC)(Nacho Navarro, UIUC/UPC) Virtex 2 Pro FPGA logic treated Virtex 2 Pro FPGA logic treated

as software componentsas software components

Removing memory bottleneckRemoving memory bottleneck Random memory access Random memory access

converted to dataflowconverted to dataflow Memory objects assigned to Memory objects assigned to

distributed Block RAMdistributed Block RAM

SW / HW communicationSW / HW communication PLB vs. OCM interfacePLB vs. OCM interface

A prototyping experience with the A prototyping experience with the Xilinx ML300Xilinx ML300

FPGA

PLB - BRAMInterface

PPCP

LB

BRAM

OC

M

FPGA

DDR

BRAM

FPGA

PLB - BRAMInterface

PPCP

LB

BRAM

OC

M

FPGA

DDR

BRAM


Initial results from our ML300 testbedInitial results from our ML300 testbed

Case study: GSM vocoderCase study: GSM vocoder Main filter in FPGAMain filter in FPGA

Rest in software running under Rest in software running under Linux with customized supportLinux with customized support

Straightforward software/ Straightforward software/ accelerator communications accelerator communications patternpattern

Fits in available resources on Fits in available resources on Xilinx ML300 V2P7Xilinx ML300 V2P7

Performance compared to all-Performance compared to all-software execution, with software execution, with communication overheadcommunication overhead

Projected filter latency

~8x~32x

Cy

cle

s

0

1000

2000

3000

14000

15000

16000

Software Naïve Optimized

Hardware implementation

0 1 77

1 2 88

2 3 99

7

6

8

65

76

87

Cy

cle

s

0

1

2

× × × × × ×× ×

×

×

×

- + + +

- +

-

-

-

+

Shift Register


Applicationsand

Systems Software

Applicationsand

Systems Software

Grand challengeGrand challenge

Moving the mass-market software base to heterogeneous Moving the mass-market software base to heterogeneous

computing architecturescomputing architecturesEmbedded computing platforms in the near termEmbedded computing platforms in the near term

General purpose computing platforms in the long runGeneral purpose computing platforms in the long run

PlatformsPlatforms

Programming models

Restructuringcompilers

Communications andstorage management

Accelerator architectures

OS support


Slicing through software layersSlicing through software layers

To

tal System

Cu

stom

izer

Application

Middleware Layer

Middleware Layer

O/S Layer

O/S Layer

Circuits

Device Drivers

IntegratedApplicationSoftware

CompositeLow-layer

Custom Circuits

Layered designparadigm

Automated customimplementation

Automation delivers productivity and efficiency

To

tal System

Cu

stom

izer

Application

Middleware Layer

Middleware Layer

O/S Layer

O/S Layer

Circuits

Device Drivers

Layered designparadigm

Automation delivers productivity and efficiency

IntegratedApplicationSoftware

CompositeLow-layer

Custom Circuits

Automated customimplementation

Middleware

Middleware

O/S Layer

O/S Layer

Circuits

Drivers

Intermediateexperimental step

To

tal System

Cu

stom

izer

Cus

tom

feat

ure

Use

r le

vel d

river

Application

Cus

tom

feat

ure

Use

r le

vel d

river


Taking the first step: pointer analysisTaking the first step: pointer analysis

To what can this variable point? (points-to)To what can this variable point? (points-to) Can these two variables point to the same thing? (alias)Can these two variables point to the same thing? (alias)

Fundamental to unraveling communications through memory: programmers like Fundamental to unraveling communications through memory: programmers like

modularity and pointers!modularity and pointers!

Pointer analysis is Pointer analysis is abstract executionabstract execution Model all possible executions of the programModel all possible executions of the program Has to include important facets, or result won’t be usefulHas to include important facets, or result won’t be useful Has to ignore irrelevant details, or result won’t be timelyHas to ignore irrelevant details, or result won’t be timely Unrealizable dataflow Unrealizable dataflow = artifacts of “corners cut” in the model= artifacts of “corners cut” in the model

Typically, emphasis has been on timeliness, not resolution, because expensive Typically, emphasis has been on timeliness, not resolution, because expensive

algorithms cause unstable analysis time – for typical alias uses, may be OK…algorithms cause unstable analysis time – for typical alias uses, may be OK…

……but we have new applications that can benefit from higher accuracybut we have new applications that can benefit from higher accuracy Data flow unraveling for logic synthesis and heterogeneous systemsData flow unraveling for logic synthesis and heterogeneous systems


How to be fast, safe and accurate?How to be fast, safe and accurate?

An efficient, accurate, and safe pointer analysis based on the An efficient, accurate, and safe pointer analysis based on the

following two key ideasfollowing two key ideas

CEO

VP VP

MANAGERMANAGER

MANAGER MANAGERMANAGER

WORKER WORKER

WORKERWORKER

WORKER WORKER

WORKERWORKER

Incr

ease

d A

bstr

actio

n

Efficient analysis of a large program Efficient analysis of a large program necessitates that only relevant details are necessitates that only relevant details are

forwarded to a higher level componentforwarded to a higher level componentThe algorithm can locally cut its The algorithm can locally cut its

losses (like a bulkhead) …losses (like a bulkhead) …

… … to avoid a global explosion in to avoid a global explosion in problem sizeproblem size


One facet: context sensitivityOne facet: context sensitivity

Context sensitivity – avoids unrealizable data Context sensitivity – avoids unrealizable data

flow by distinguishing proper calling contextflow by distinguishing proper calling context

What assignments to What assignments to aa and and gg receive? receive? CI: CI: aa and and gg each receive each receive 11 and and 33

CS: CS: gg receives only receives only 11 and and aa receives only receives only 33

Typical reactions to CS costsTypical reactions to CS costs Forget it, live with lots of unrealizable dataflowForget it, live with lots of unrealizable dataflow

Combine it with a “cheapener” like the lossy Combine it with a “cheapener” like the lossy compression of a Steensgaard analysiscompression of a Steensgaard analysis

We want to do better, but we may sometimes We want to do better, but we may sometimes

need to mix CS and CI to keep analysis fastneed to mix CS and CI to keep analysis fast

*p := q;

r := g + 5;

jade(int *p, int q)int r;

x := z

jade1(&g, 1)

jade2(&a, 3)

int g;

iris()int a;

g := 1a := 3

r := 6

Desired resultsDesired results

ExampleExample


Context Insensitive (CI)Context Insensitive (CI)

a := 1

g := 3

r := 8*p := q;

r := g + 5;

p := &a;

p := &g; q := 1;

q := 3;

Collecting all the assignments in the program and Collecting all the assignments in the program and

solving them simultaneously yields a context solving them simultaneously yields a context

insensitive solutioninsensitive solution

Unfortunately, this leads to three spurious solutions.Unfortunately, this leads to three spurious solutions.


Context Sensitive (CS): Naïve processContext Sensitive (CS): Naïve processint g;

iris()int a;

jade1(&g, 1)

jade2(&a, 3) g := 1

a := 3

Retention of side Retention of side effect still leads to effect still leads to spurious resultsspurious results

*p := q;

r := g + 5;

jade(int *p, int q)int r;

x := z

p := &g;

p := &a;

q := 1;

q := 3;

g := 1 a := 3

jade2*p2 := q2;

p2 := &a2; q2 := 3;

r2 := g + 5;

x2 := z2

jade1

x1 := z1

p1 := &g; q1 := 1;*p1 := q1;r1 := g + 5;

Excess Excess statements statements unnecessary unnecessary and costlyand costly

g := 3

a := 1

r := 6

r := 8


CS: “Accurate and Efficient” approachCS: “Accurate and Efficient” approachint g;

iris()int a;

jade1(&g, 1)

jade2(&a, 3)

p1 := &g1; q1 := 1;

*p1 := q1;jade1

*p2 := q2;

p2 := &a2; q2 := 3;jade2

p := &g;

p := &a;

q := 1;

q := 3;

g := 1 a := 3

g := 1

a := 3

Now, only correct Now, only correct result derivedresult derived

Compact summary Compact summary of jade usedof jade used

int r;

jade(int *p, int q)

*p := q;

r := g + 5;

*p := q; r := 6Summary accounts for all Summary accounts for all

side-effects. DELETE side-effects. DELETE assignment to prevent assignment to prevent

contaminationcontaminationx := z


Analyzing large, complex programsAnalyzing large, complex programs [SAS2004][SAS2004]

Bench-Bench-markmark

INACCURATEINACCURATE Context Context

InsensitiveInsensitive

(seconds)(seconds)

PREV PREV Context-Context-

Sensitive Sensitive

(seconds)(seconds)

NEW NEW Context-Context-SensitiveSensitive

(seconds)(seconds)

espressoespresso 22 99 11lili 11 13321332 11

ijpegijpeg 22 8585 11perlperl 44 408408 1111gccgcc 5252 HOURSHOURS 124124

perlbmkperlbmk 155155 MONTHSMONTHS 198198gapgap 6262 33503350 117117

vortexvortex 55 136136 33twolftwolf 11 22 11

This results in an efficient analysis This results in an efficient analysis process without loss of accuracyprocess without loss of accuracy

Originally, problem size exploded as Originally, problem size exploded as more contexts were encounteredmore contexts were encountered

New algorithm contains problem New algorithm contains problem size with each additional contextsize with each additional context

008.espresso099.go 130.li

124.m88ksim 175.vpr134.perl

176.gcc 254.gap 255.vortex

1E+00

1E+02

1E+04

1E+06

1E+08

1E+10

1E+12

1E+14

Naï

ve

Exh

au

sti

ve

In

lin

ing

1 3 5 7 9

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

Ne

w C

om

pa

cti

on

Alg

ori

thm

Call Graph Depthmain() leaves

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

Call Graph Depthmain() leaves

1012

104


Example application and current challengesExample application and current challenges[PASTE2004][PASTE2004]

Improved efficiency increases the Improved efficiency increases the scope over which unique, heap-scope over which unique, heap-

allocated objects can be discoveredallocated objects can be discovered

Example: Improved analysis algorithms Example: Improved analysis algorithms provide more accurate call graphs provide more accurate call graphs (below) instead of a blurred view (below) instead of a blurred view

(above) for use by program (above) for use by program transformation toolstransformation tools

A multitudeof distinct

objects

Observed Connectivity1 10 100 1000 10000

1

10

100

1000

Dis

cove

red

Ob

jec

ts

132.ijpeg

BETTER

WORSE

A few, highly-connected

objects

3

2

1

0

ANALYSISSCOPE

......

......


From benchmarks to broad application code baseFrom benchmarks to broad application code base

The long term trend is for all code to go through a compiler and be managed The long term trend is for all code to go through a compiler and be managed by a runtime systemby a runtime system Microsoft code base to go through Phoenix – OpenIMPACT participationMicrosoft code base to go through Phoenix – OpenIMPACT participation Open source code base to go through GCC/OpenIMPACT under GelatoOpen source code base to go through GCC/OpenIMPACT under Gelato

The compiler and runtime will perform deep analysis to allow tool to have The compiler and runtime will perform deep analysis to allow tool to have visibility into softwarevisibility into software Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration, Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration,

memory managers, runtime, memory managers, runtime, etc.etc.

systemsApplicationsApplications systemsOperating systemsOperating systems systemsLibrariesLibraries

systemsCompilerCompiler

systemsRuntime and ToolsRuntime and Tools

systemsHardwareHardware


Global memory dataflow analysisGlobal memory dataflow analysis

Integrates analyses to deconstruct memory “black box”Integrates analyses to deconstruct memory “black box” Interprocedural pointer analysis: allow programmer to use Interprocedural pointer analysis: allow programmer to use

language and modularity without losing transformabilitylanguage and modularity without losing transformability

Array access pattern analysis: figure out communication among Array access pattern analysis: figure out communication among loops that communicate through arraysloops that communicate through arrays

Control and data flow analyses: enhance resolution by Control and data flow analyses: enhance resolution by understanding program structureunderstanding program structure

Heap analysis extends analysis to much wider software baseHeap analysis extends analysis to much wider software base

SSA-based inductor detection and dependence test have SSA-based inductor detection and dependence test have

been integrated into IMPACT environmentbeen integrated into IMPACT environment


foo (int *s, int L){ int *p=s, i; for (i=0; i<L; i++) *p = ...; p++; }

foo writes A[0:63] stride 1

bar reads A[1:64]stride 1

procedure call

parametermapping

Read from *(t) Read from *(t) to *(t+M) to *(t+M) with stride 1with stride 1

Procedure body

summary forthe whole loop

Write *ploop body

main(...){ int A[100]; foo(A, 64);foo(A, 64); bar(A+1, 64)bar(A+1, 64) }

bar (int *t, int M){ int *q=t, i; for (i=0; i<M; i++) … = *q; q++; }

Write from *(s) Write from *(s) to *(s+L) to *(s+L) with stride 1with stride 1

Read *q

Data flow analysis

determines that A[64] is not from foo

Pointer relation analysis

restates p/q in terms of s/t

Example on deriving memory data flowExample on deriving memory data flow


Conclusions and outlookConclusions and outlook

Heterogeneous multiprocessor systems will be the model Heterogeneous multiprocessor systems will be the model

for both general purpose and embedded computing for both general purpose and embedded computing

platforms in the futureplatforms in the futureBoth are motivated by powerful trendsBoth are motivated by powerful trends

Shorter term adoption for embedded systemsShorter term adoption for embedded systems

Longer term for general purpose systemsLonger term for general purpose systems

Programming models and parallelization of traditional Programming models and parallelization of traditional

programs to channel software to these new platformsprograms to channel software to these new platformsFeasibility of deep pointer analysis demonstratedFeasibility of deep pointer analysis demonstrated

Many need to participate in solving this grand challenge problem Many need to participate in solving this grand challenge problem

Date post:	11-Jan-2016
Category:	Documents
Upload:	kristian-wade
View:	219 times
Download:	0 times

DUSD(Labs) Breaking the Memory Wall for Scalable Microprocessor Platforms Wen-mei Hwu with John W....

Documents