Evaluating the Imagine Stream Processor
Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das
ISCA 2004
Motivation• Provide efficiency of an ASIC• Provide flexibility of a programmable processor• Simplify special-purpose processor design • Lower special-purpose processor design cost• Provide better applicability• Target media applications
Stream Architecture
Development Board
PowerPC, 150 MHz2 x Imagine, 200 MHzFPGA Bridge, 66 MHz
256MB of SDRAM / Imagine, 100 MHz
Applications
Mapping
Execution on a Single Stream
…
…
…
…Iteration n
Iteration 1
…
……
Output Stream
Input Stream
SRFKernel 1
Execution of Multiple KernelsSRF Kernel 1
Stream 1
Stream 2
Stream 3
…
…
…
processing…
…
…
Kernel 2
processing…
…
…
Kernel 3
processing…
…
…
Stream 4
…
Application PerformanceGOPS: 18%
GFLOPS: 60%
Sources of Overhead
Stream Length Effects
Access Pattern Effects
Energy Efficiency
Energy consumption per FLOP :(when normalized to 0.13um 1.2V process)
Imagine @ 200 MHz:277pJ/FLOP
TI C67x DSP @ 225MHz:889pJ/FLOP (3.2x more)
Intel Pentium M @ 1200GHz:3600pJ/FLOP (13x more)
Memory Bandwidth Requirement
Host Processor Bandwidth Requirement
Programming Model
Compiler OptimizationsStream Ordering
Compiler OptimizationsSRF Overlapping and Packing
Compiler OptimizationsStrip-mining
Compiler OptimizationsLoop Unrolling and Software Pipelining
Conclusions
• Provides performance close to that of ASIC and flexibility via programming
• Can sustain between 16% and 60% of the peak arithmetic performance
• Exposed 2-level register file allows compiler to exploit locality
• Broader applicability• Requires considerable programming effort• Limited to media applications with regular control-
flow
Collab Questions
• How does the performance compare to other processors? (Dan, Marko, Jason, Prateeksha, Chris)
• What is the compiler efficiency? (Mario, Liang)• How were the design decisions motivated? (Jing,
Marisabel)• How does the programming model compare to that
of GPUs? (Greg)
Kernels