May 23-24, 2000Scottsdale, AZKickoff_may_2000.ppt
1
Morphable Computer Architecturesfor Highly Energy Aware Systems:
PACC Program Review: Nov. 1-3; Annapolis, MD
Peter M. Kogge: CSE Dept. University of Notre Dame [email protected]
Kanad Ghose: CS Dept.SUNY-Binghamton; [email protected]
Nikzad “Benny” Toomarian: Center for Integrated Space Microsystems (CISM)
Jet Propulsion Lab; [email protected]
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
2
Outline
Quad Chart
“Gear-Shifting” Simplified
The Morph Program
The Morph Architecture
Test Bed & Benchmarks
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
3
MORPHMORPH: Dynamic Low Energy Architectures
Profiles
Baseline
Morphable Node
Data Placement
Adaptive Algorithms
Run-time
Demo & Eval
5/00 11/00 5/01 11/01 5/02
New Ideas• Morphable microarchitecture to allow dynamic changes in energy expended per cycle• Energy efficient morphable memory hierarchies• Energy efficient ISA extensions to process data more energy efficiently• Adaptive algorithms to select best configuration• Energy aware run-time which can reconfigure system
MORPHAdds An
““Energy Gear”Energy Gear”to Dynamically Configurable
Embedded Systems
IMPACT• Focus on energy, not just power, management• Develops suite of widely applicable energy-reducing architectural techniques• Adds extra technology-independent degrees of freedom to dynamic energy control• Provides an overall inherently more energy efficient embedded computing system• Designed for transfer to real missions
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
4
What is “Gear-Shifting” all about?
Definitions: IPC = Instructions per Cycle EPC = Energy per Cycle C = Cycles per Second Performance = “Instructions/second” = IPCxC Power = “Energy/second” = EPCxC M = performance required during some mode (instructions/second)
Real world: performance needs change very dramatically
Observations on Conventional Designs: Conventional designs fix IPC at some IPCmax to meet peak need In such designs EPC = KxIPCa, where “a” can range to almost 4 Assume arbitrary clock selection (up to a maximum clock Cmax) Ignore Vdd changes for now
Power @ M = KxIPCmaxax(M/ IPCmax) = KxMxIPCmax
a-1
Dependent on clock only thru M
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
5
Some Simplified Gear Equations
Assume IPC smoothly changeable from IPCmin to IPCmax
Let R = (IPCmax/IPCmin) = “dynamic ratio” of performance range
Let g be a gear setting, ranging from 0 to 1 to change IPC
IPC(g) = IPCmin + (IPCmax - IPCmin)g = IPCmax[1/R + (1-1/R)g]
EPC(g) = Kx{IPCmax[1/R + (1-1/R)g]}a
Power(g, C) = K x {IPCmax[1/R + (1-1/R)g]}a x C
GEARSGEARS Large R: OUR CHALLENGELarge R: OUR CHALLENGE
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
6
A Gear-Shifting Strategy
To minimize power as we vary performance requirement M:
Use most efficient IPCmin as long as possible (until clock at maximum) G = 0
Then smoothly vary g while using Cmax
0 Imax x CmaxImin x Cmax
Performance Rqmt
G
0
1
0 Imax x CmaxImin x Cmax
Performance Rqmt
C
0
Cmax
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
7
The Result
0 IminCmax ImaxCmax
1
(1/R)a-1
0
Ratio of Power under optimal gear change to conventional fixed IPC Power
Performance Rqmt M
Potentially huge for large R
And we canstill use all theother tricksto lower peakpower!P
ower
Sav
ings
Fac
tor
Huge savings if applications spend most time here
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
8
The Morph Program
Develop a microarchitecture with a large dynamic R “Multi-cluster” superscalar CPU Intelligent placement of data within mixed memory type hierarchy Inherently low energy caches Low energy ISA extensions
Define & use a realistic embedded benchmark suite Drawn from deep-space processing needs - initially rovers Include other DARPA benchmarks such as from DIS Baseline on variety of systems
Develop real-time algorithms for reconfiguration
Demonstrate potential gains via simulation Simplescalar + energy models
Technology transfer to potential future JPL missions
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
9
The Team
SUNY-BINGHAMTON• Morphable Caches, RFs• Dynamic Bit Slicing• Energy Eff VLIW archs• Supporting compiler techniques
UNIVERSITYOF NOTRE DAME
• Morphable multi-cluster architecture• “At the sense amps” ISA extension• Runtime with hooks for dynamic morphing control
JET PROPULSIONLABORATORY
• Scenarios & benchmarks• Baseline characterizations• Runtime adaptation algorithms
Energy AwareData Placement
Overall Goals:• Architectures with variable IPC, EPC• Tools & S/W to manage morphing• Realistic demonstrations
Peter KoggeVincent FreehJay Brockman
Nikzad ToomarianMohammed MojarradiSavio Chau
Kanad Ghose
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
10
Starting A Solution:Multi Cluster Architecture
Fetch
Decode
Register File
DataCache
Fetch
Decode
R ename
Issue W indow
Register File
Bypass
DataCache
memoryd isambiguation
Fetch
Decode
Renameand steering
Issue Window
Register File
Bypass
DataCache
RAW
RAB
memorydisambiguat ion
Issue Window
Register File
Bypass
DataCache
RAW
RAB
memorydisambiguation
One Cluster
(a) Simple Pipeline (b) Classical Superscalar (c) New Multi Cluster
Problem: single large centralized register files with many ports Solution: multiple smaller
register files with few ports
IssueWidth(IW)
EPC/IPC ~ (IW)k
k as high as 1.9
w(IW/w)k
<< (IW)kw Clusters
IW/w
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
11
Target MorphMorph Configuration
Fetch
Decode
Renameand steering
Issue Window
Register File
Bypass
DataCache
RAW
RAB
memorydisambiguation
Issue Window
Register File
Bypass
DataCache
RAW
RAB
memorydisambiguation
One Cluster
EEPROM
FLASH
DRAM
SRAM
Dynamic issuewidth
Dynamic ALU width
Low energy caches
Energy-aware data placement
Dynamic data path width
Alternative ISAfeatures
Selective substrate bias
Embedded+external memory
Variable multi-cluster microarchitecture
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
12
Evaluation Methodology
PACCBenchmarks
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
++
+
IPC: Instructions per Cycle
EP
C:
En
ergy
per
Cyc
le
Energy Efficient Family
+ Today’s Performance Only Design Point
+
++
++
++
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
13
Multi-Cluster vs Conventional Results
1x6
1x41x
8
4x4
2x6
Conventional
Up to 1/2 the energy at same IPC, or 20% better IPC at same energy
2x4
4x2
Morph: dynamicallychange the cluster size& ride the EPC/IPC Savings
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
14
On-chip Caches: Addressing Dynamic & Static Leakage
On-chip caches dissipate 25% to 45% of total energy Likely to increase because of leakage
Added line buffers (4 to 16) reduce dynamic energy dissipation by 40% to 65+%, with no penalty in access time and with 4% to 6% area penalty
Use of dynamic activation of recently-accessed L2 cache areas reduce dynamic dissipation component by 40% to 80% Only selected areas of L2 in active mode, rest in standby Size of bit-cell groups controlled is critical Additional L2 area penalty of approx. 8% Heuristics for controlling transitions between active & standby modes
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
15
Addressing Dynamic & Static Dissipations in Caches
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
16
Exploiting Bit-Slice Inactivity in Datapaths
Expectation: Higher-order data bits likely to be insignificant at least some of the time
Opportunity: exploit byte slice inactivity over transfer paths, within storage devices (register files, caches) & function units
FOR SPECfp95 DP
FOR INTEGERS FROM SPECfp95
A circuit to provide read-enables in RFsto avoid energy dissipation on access
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
17
Deep Space: The Ultimate Power-Constrained Embedded System
Limited energy/power sources Renewable variable power: Solar cells Constant power: RPGs Fixed energy: batteries
Multiple operational modes, all compute/energy constrained Cruise Communication: compression vs
transmission Data gathering vs analysis Movement: collision avoidance
Today: “Pre-canned” power management by
serialized operations
Morph Initial Focus: Rovers
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
18
Pathfinder Sojourner
Energy Required
Function Time and Calculation
7.51W-hr5.63W-hr6.92W-hr1.83W-hr0.45W-hr
1.2W-hr
5.2W-hr0.63W-hr15.0W-hr
50W-hr
95W-hr
motor heating: 1 motor at a timemotor heating: 2 motors at a timedriving (extreme terrain @ -80degC)hazard detectionimaging (3 images @ 2 min/image)image compression (compress 3 images @ 6 min/image)6Mbit communication @ 50min/sol42, 10 sec health checks during dayremainder of 7 hr daytime CPU operationWEB heating (as needed)
= 7.51W x 1hr = 11.26W x 0.5hr= 13.85W x 0.5hr= 7.33W x 0.25hr= 4.5W x 0.1hr= 3.7W x 0.3hr = 6.27W x 0.8hr= 6.27W x 0.1hr= 3.7W x 4hr= 50W-hr
vs peak 15 W-hr Solar Cells + 150 W-hr non-rechargeable battery
Effects on application code:• Many actions sequential, not simultaneous• No dynamic scheduling, no autonomy• Not even CPU-clock management• Nowhere near enough CPU performance• Designed to limit worst case power• Dump excess power into heaters
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
19
Pancam/Mini-TES
Mini-Corer
Instrument Arm Cluster : Raman Spectrometer Alpha-Proton-X-Ray Spectrometer (APXS) Mössbauer Spectrometer Microscopic Imager
Athena/Mars ’03 Rovers Athena/Mars ’03 Rovers Rover ConfigurationRover Configuration
• 3 Hrs/day of solar @ 50 W• 5 amp hr 16V batteries• More complex communication• More complex on-board eqpt• Still statically scheduled
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
20
MUSES-CN Asteroid NanoRoverMUSES-CN Asteroid NanoRover
Solar powered @ 1 watt
including RF telecommunications system for communications to lander or small-body orbiter for relay to Earth.
Clock-adjustable CPU speed
To run a command: Determine available solar power. Minimum required power = device + CPU power If available power < minimum required:
if parameter enables re-orienting , re-orient to maximize solar power
if still not enough and parameter enables waiting, wait up to parameter limit for solar power
if still not enough, abort command Set CPU speed to maximum allowable based on
(power available) - (minimum needed for devices)
Perform command: during command execution, if power drops significantly (or load shed indication?...):
CPU speed is reduced to minimum required Operate motors one-at-a-time
Return CPU speed to parameter-specified idle
Still “sequential” operation
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
21
Some Morph Test Beds
PACC-Blue• 400MHz PPC 7400• Enhanced superscalar + Altivec• Linux
PACC-Gold• 400MHz PPC 750• Linux
JPL PPC-SBC•200 MHz 750•VxWorks
Oscilloscope
Logic AnalyzerPowerPC 750
NT Box
Ethernet
•Different PowerPC configurations•Microarchitecture•Clock rates•ISA extensions
• Run rover/PACC application code• Measure time/power• Use as input to Simplescalar simulation
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
22
The NASA X2000 Avionics System
high-rateinput
(camera)
high-speed bus (e.g. IEEE 1394)
communicationmodule (CDMA)
bus powercontroller
symmetric multiprocessor modules
altimetersubnet
microcontroller-directed subnet- power regulations & control- analog telemetry sensors- safety inhibits- valve & pyro drive
reconfigurable hardware blocks
low-speed bus (e.g. I2C )
• Design for 10-20X reduction in power, at 10-20X performance increase• With long-term survivability & technology scaling• Application-specific adaptive configuration to match run-time power supply constraints
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
23
X2000 FD Testbed with Power Awareness
cPCI bus (6U chassis)
PPC
750 (Synergy)
PPC
750 (Synergy)
1394a I/F (Saderta)
1394a I/F (Saderta)
Dual
I2C I/F (JPL)
Empty Slo
t
Empty Slo
t
Empty Slo
t
SUN E3500Workstation(35 GB HD)
SUN Ultra 10Workstation
SUN Ultra 10Workstation
cPCI bus (6U chassis)
PPC
750 (Synergy)
PPC
750 (Synergy)
1394a I/F (Saderta)
1394a I/F (Saderta)
Dual
I2C I/F (JPL)
Empty Slo
t
Empty Slo
t
Empty Slo
t
cPCI bus (6U chassis)
PPC
750 (Synergy)
PPC
750 (Synergy)
1394a I/F (Saderta)
1394a I/F (Saderta)
Dual
I2C I/F (JPL)
Empty Slo
t
Empty Slo
t
GPIB
cPCI bus (6U chassis)
PPC
750 (Synergy)
PPC
750 (Synergy)
1394a I/F (Saderta)
1394a I/F (Saderta)
Dual
I2C I/F (JPL)
Empty Slo
t
Empty Slo
t
Empty Slo
t
cPCI bus (6U chassis)
PPC
750 (Synergy)
PPC
750 (Synergy)
1394a I/F (Saderta)
1394a I/F (Saderta)
Dual
I2C I/F (JPL)
Empty Slo
t
Empty Slo
t
FPGA Rapid Prototype
PCI Bus analyzer
Hard Drive Hard Drive Hard Drive
Hard Drive Hard Drive
Terminal Server
Nov. 1-3, 2000Annapolis, MDOct_2000_review.ppt
24