The Raw The Raw ArchitectureArchitecture
Signal Processing on a Scalable Signal Processing on a Scalable Composable Computation FabricComposable Computation Fabric
David WentzlaffDavid Wentzlaff, Michael Taylor, Jason Kim, Jason , Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Miller, Fae Ghodrat, Ben Greenwald, Paul
Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Johnson,Walter Lee, Albert Ma, Nathan Shnidman, Henry Hoffmann, Arvind Saraf, Volker Strumpen, Henry Hoffmann, Arvind Saraf, Volker Strumpen,
Matt Frank, Saman Amarasinghe, and Anant AgarwalMatt Frank, Saman Amarasinghe, and Anant Agarwal
http://www.cag.lcs.mit.edu/raw
MIT Laboratory For Computer Science
OutlineOutline
MotivationMotivation
ArchitectureArchitecture
Raw PrototypeRaw Prototype
NetworksNetworks
Signal Processing ApplicationsSignal Processing Applications
StatusStatus
Wire DelayWire Delay and Tiled and Tiled ArchitecturesArchitectures
Problem: The amount of gates we can reach Problem: The amount of gates we can reach in one cycle is staying constant, but our in one cycle is staying constant, but our chips are getting bigger.chips are getting bigger.
Solutions:Solutions:1.1. Hide wire delay latency in micro-architecture Hide wire delay latency in micro-architecture
(Clustering/Hidden communication stalls)(Clustering/Hidden communication stalls)
2.2. Expose the communication to the instruction Expose the communication to the instruction set level and allow the software exploit localityset level and allow the software exploit locality
Fact 1: Number of transistors growingFact 1: Number of transistors growing
Fact 2: Proportionally wires not getting fasterFact 2: Proportionally wires not getting faster
Wire Delay and Wire Delay and Tiled Tiled ArchitecturesArchitectures
2.2. Expose the communication to the instruction set Expose the communication to the instruction set level and allow the software exploit localitylevel and allow the software exploit locality
Wire Delay and Wire Delay and Tiled Tiled ArchitecturesArchitectures
2.2. Expose the communication to the instruction set Expose the communication to the instruction set level and allow the software exploit localitylevel and allow the software exploit locality
Make a tile as big Make a tile as big as you can go in as you can go in one clock cycle, and one clock cycle, and expose longer expose longer communication to communication to the programmerthe programmer
Wire Delay and Wire Delay and Tiled Tiled ArchitecturesArchitectures
2.2. Expose the communication to the instruction set Expose the communication to the instruction set level and allow the software exploit localitylevel and allow the software exploit locality
Make a tile as big Make a tile as big as you can go in as you can go in one clock cycle, and one clock cycle, and expose longer expose longer communication to communication to the programmerthe programmer
What Are We Building?What Are We Building?The Raw PrototypeThe Raw Prototype
16 Replicated Tiles (Processors)16 Replicated Tiles (Processors)
What is in a tile?What is in a tile?8 stage Pipelined MIPS-like 32-bit 8 stage Pipelined MIPS-like 32-bit
processorprocessor
Pipelined Floating Point UnitPipelined Floating Point Unit
32KB Data Cache32KB Data Cache
32KB Instruction Memory32KB Instruction Memory
Interconnect RoutersInterconnect Routers
Raw’s Networking Raw’s Networking ResourcesResources
2 Dynamic Networks2 Dynamic NetworksFire and ForgetFire and ForgetHeader encodes destinationHeader encodes destination2 Stage router pipeline2 Stage router pipeline
2 Static Networks2 Static NetworksSoftware configurable crossbarSoftware configurable crossbarInterlocked and Flow ControlledInterlocked and Flow Controlled5 Stage static router pipeline5 Stage static router pipeline3 cycle nearest-neighbor ALU to ALU 3 cycle nearest-neighbor ALU to ALU
communication latencycommunication latencyNo header overhead, but requires knowledge No header overhead, but requires knowledge
of communication patterns at compile timeof communication patterns at compile time
Memory Mapped Memory Mapped Communication is Not a First Communication is Not a First Class CitizenClass Citizen
IF RFDA TL
M1 M2
F P
E
U
TV
F4 WB
To other tiles, through To other tiles, through memory system that memory system that happens to go over a happens to go over a network.network.
Raw’s First Class Register-Raw’s First Class Register-Mapped CommunicationMapped Communication
IF RFDA TL
M1 M2
F P
E
U
TV
F4 WB
r26
r27
r25
r24
NetworkNetworkInputInputFIFOsFIFOs
r26
r27
r25
r24
NetworkNetworkOutputOutputFIFOsFIFOs
Ex: add r26, r25, r24Ex: add r26, r25, r24
Signal Processing Signal Processing ApplicationsApplications
Problem: Increase performance of Problem: Increase performance of Signal Processing in a scalable Signal Processing in a scalable fashionfashion
Solution: Exploit parallelism in Signal Solution: Exploit parallelism in Signal Processing Applications at all Processing Applications at all levelslevels
Types of Parallelism in Types of Parallelism in Signal ProcessingSignal Processing
DSP Filter StyleDSP Filter Style
Fine Grain DataflowFine Grain Dataflow
Instruction Level ParallelismInstruction Level Parallelism
Data ParallelData Parallel
Thread Level Parallelism (MPI)Thread Level Parallelism (MPI)
Current ArchitecturesCurrent Architectures
RawRaw
Instruction Level Instruction Level ParallelismParallelism
RawCCRawCCMaps dataflow graphs across tilesMaps dataflow graphs across tiles
ILP across MultiprocessorILP across Multiprocessor
Heavily Latency sensitiveHeavily Latency sensitive
Single cycle reconfigurable Single cycle reconfigurable communicationcommunication
Fine Grain DataflowFine Grain Dataflow
Ex: Pipelined FIR FilterEx: Pipelined FIR Filterxn xn-1 xn-1 xn-3
W1 W2W0 W3
Computation: mul, addComputation: mul, add
Input Operands: xInput Operands: xii, , ll
Output Operands: Output Operands: kk
Cycle countClass First SecondCompute 2 2Communicate 0 3Overall 2 5
Fine Grain DataflowFine Grain Dataflow
Cycle countClass First SecondCompute 2 2Communicate 0 3Overall 2 5
First Class InterfaceFirst Class Interface Second Class Second Class InterfaceInterface
mul $r3, Wmul $r3, Wxx, NET_IN_1, NET_IN_1
add NET_OUT1, NET_IN_2, $r3add NET_OUT1, NET_IN_2, $r3
ld $r4, NET_IN_1_ADDRld $r4, NET_IN_1_ADDR
ld $r5, NET_IN_2_ADDRld $r5, NET_IN_2_ADDR
mul $r3, Wmul $r3, Wxx, $r4, $r4
add $r6, $r5, $r3add $r6, $r5, $r3
st NET_OUT_1_ADDR, $r6st NET_OUT_1_ADDR, $r6
DSP Filter StyleDSP Filter Style
Off-Off-chipchip
Off-Off-chipchip
Down-Sample
FFT
FrequencyDomain
FilterFFT
FFT
FFT-1
FFT-1
FFT-1
FFT FFT-1
Raw is ComposableRaw is Composable
Mix and match types of parallelismMix and match types of parallelism
4-way Threaded JavaApplication
2-way RawCCApplication
httpd
Whitebalance
Whitebalance
Aliasingfilter
mem mem
Zzz.
Raw StatusRaw Status
StatsStatsIBM SA-27E .15u 6 Layer CopperIBM SA-27E .15u 6 Layer Copper
18.2 mm X 18.2 mm die18.2 mm X 18.2 mm die
.122 Billion Transistors.122 Billion Transistors
2048KB SRAM On-chip2048KB SRAM On-chip
1657 Pin CCGA Package1657 Pin CCGA Package1080 HSTL Signal IO Operating at 1080 HSTL Signal IO Operating at
Core SpeedCore Speed
225MHz225MHz
~25 Watts~25 Watts
The Raw PerformanceThe Raw Performance
16 OPS/FLOPS per cycle (@225MHz = 3.6 16 OPS/FLOPS per cycle (@225MHz = 3.6 GFLOPS)GFLOPS)
230 Gb/s of on-chip “bisection bandwidth” 230 Gb/s of on-chip “bisection bandwidth”
201 Gb/s of off-chip I/O bandwidth201 Gb/s of off-chip I/O bandwidth
115 Gb/s of on-chip memory bandwidth115 Gb/s of on-chip memory bandwidth
Raw StatusRaw Status
Working:Working:Cycle Accurate Software SimulatorCycle Accurate Software Simulator
RTL SimulationRTL Simulation
Emulation SystemEmulation System
RawCC ILP CompilerRawCC ILP Compiler
Current:Current:VerificationVerification
Backend CompletionBackend Completion
Tapeout December 2001Tapeout December 2001
Chips Back Summer 2002Chips Back Summer 2002
SummarySummary
Raw’s First Class communication Raw’s First Class communication facilitates exploitation of new facilitates exploitation of new forms of parallelism in Signal forms of parallelism in Signal Processing applicationsProcessing applications
Extra SlidesExtra Slides