Post on 03-Apr-2018
transcript
1
University of Southern California 1
© M. Pedram Nov 1997
M. PedramUSC
Massoud PedramUniversity of Southern California
Department of EE-SystemsLos Angeles CA 90089
M. PedramUSC
Outline
• Motivation and Objectives• Power Estimation Methodology• Example Analysis/Estimation Tools• Power Optimization Flow• Example Minimization Techniques• Summary
Jan 1998
2
University of Southern California 2
© M. Pedram Nov 1997
M. PedramUSC
Opportunities for Power Savings
System
Behavioral
RT-Level
Logic
Physical 5-10 %
10-25 %
25-40%
30-60 %
50-90 %HW/SW Co-designCustom ISAAlgorithm DesignCommunication Synthesis
Scheduling, AllocationPipeliningBehavioral Transformations
Clock Gating, PrecomputationOperand IsolationState Assignment, Retiming
Logic RestructuringTechnology MappingPin Ordering & Phase Assignment
Fanout Optimization, BufferingTransistor Sizing, PlacementPartitioning, Clock Tree DesignGlitch EliminationSavings
M. PedramUSC
Realistic Estimation Expectations
System
Architectural
RT-Level
Gate
Transistor
Instruction-Level ModelsIP Core ModelsProgram ComplexityProgram Simulation
Entropic BoundsArchitectural SimulationI/O and Memory Accesses
RT-Level MacromodelsHDL Simulation Quick Synthesis
Probabilistic SimulationGate-Level SimulationSampling and CompactionASIC Library Models
Parasitic ExtractionAccurate Timing AnalysisCircuit-Level Simulation
5-10 %
10-30 %
30-50%
40-70 %
70-90 %
Day
Hours
Hour
Minutes
Minute
Speed Error
Jan 1998
3
University of Southern California 3
© M. Pedram Nov 1997
M. PedramUSC
Example Applications
• Portable Electronics (PC, PDA, Wireless)• Ultra-Low-Power Circuits (Pacemaker)• Space Missions (Miniaturized Satellites)• IC Cost (Packaging and Cooling)• Reliability (Electromigration, Latch-up)• Signal Integrity (Switching Noise, DC
Voltage Drop)• Thermal Design
M. PedramUSC
Digital Camera CircuitCCD
preprocessor
DRAM
ROM
Compactflash
Single-cyclemultiplier
accumulator
Pixelcoprocessor
Mini-RISC MIPSmicroprocessor
Memorycontroller
10-channelDMA
controller
JPEGcodec
PC/ATAinterface
General-purpose I/O
NTSC/PALencoder
Triplevideo D/A
On-screendisplay
controller
LCDcontroller
High-speedserial I/O
Chargecoupleddevice
Cameralens
A/D
To TV
Camera’sLCD
display
To PCinterface
Source : LSI Logic
Jan 1998
4
University of Southern California 4
© M. Pedram Nov 1997
M. PedramUSC
A Power Estimation Methodology
Classify Modules
Quick SynthesisSampling/Compaction
Model Reduction
Control Logic
Interface Logic
Cores w/oMacromodel
Decompose & Classify Blocks
Composite Cores
Library ModelsAnalytic Models
Memory
Cores with Macromodel
Analog ModulesBehavioral
Simulation
Input Vector Sequences
DesignPlanning
InterconnectEstimates
HDL Description
CellLibrary
CAD Database
M. PedramUSC
Quick Synthesis
Quick RTL Synthesis
Quick Logic Synthesis
RTL Source Code
RTL Netlist
Mapped Netlist
Jan 1998
5
University of Southern California 5
© M. Pedram Nov 1997
M. PedramUSC
Cell Characterization
SPICE Simulation Event-Driven Simulation
Gate-Level Netlist
SPICE Parameters
Transistor-Level Netlist
Gate-Level Power/DelayCharacterization Data
RT-Level Power/DelayCharacterization Data
M. PedramUSC
Circuit Model Reduction
Datapath Block Predictor Block
FSMApproximate SSM Model
Analog Block Sensitivity-Based Simplified Models
Random Logic k-LUT Structure
Jan 1998
6
University of Southern California 6
© M. Pedram Nov 1997
M. PedramUSC
Dynamic (Simulative) Techniques
Report Power
InputSequence
SampledSequence
Compacted Sequence
RT-Level or Gate-LevelNetlist
Library Data
ProbabilisticCompaction
StatisticalSampling
Event orCycle-BasedSimulation
M. PedramUSC
Simple Random Sampling (SRS)
Exhaustive
0% errorvery time consuming
< 5% error> 1000X speedup
Random Sampling
Jan 1998
7
University of Southern California 7
© M. Pedram Nov 1997
M. PedramUSC
Monte Carlo Simulation
Efficiency: depends onpopulation characteristicsand sampling procedure
“Difficult” distributions
NO
YES
Report Power
Input vectors
Generate one sample of k units
Do circuit level simulation
Calculate mean power & confidence interval
Converged?
M. PedramUSC
SRS Results
0 2000 4000 6000 8000
C432
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Biased SequenceRandom Sequence
Jan 1998
8
University of Southern California 8
© M. Pedram Nov 1997
M. PedramUSC
Stratified Random Sampling (StRS)Estimate the average weight, assuming that gender andage of individuals are readily available
Lower samplevariance leads tofaster convergence
StratificationS
ampling
M. PedramUSC
Application to Power EstimationAge & Gender → Zero-delay power estimateWeight → Powermill power estimate
Input vectors(population)
Report Power
Zero-Delay Simulationof the Entire population
PopulationStratification
Random Sampling andPowerMill Simulation
Convergent?
NO
YES
Jan 1998
9
University of Southern California 9
© M. Pedram Nov 1997
M. PedramUSC
Regression EstimationEstimate the average height, assuming that weight ofindividuals is readily available
SampleH1
Simple RandomEstimator
H2 = H1+Slope•(W2-W1)
RegressionEstimator
W1: Avg. Weight of Samples
W2: Avg. Weight of Population
height
weight
slope
height
H1: Avg. Height of Samples
M. PedramUSC
StRS Results
02468
1012
C432 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552
Speedup
Jan 1998
10
University of Southern California 10
© M. Pedram Nov 1997
M. PedramUSC
Sampling on FSMsMethod 1: Functional simulation + sampling on comb. circuit
Logic
FFs
Logic
Input Seq. Input
Seq. +StateLines
Method 2: Do machine warm-up for every sampling unit
Logic
FFs
Warm-upvectors +sampling unit
Warm-up sequencelength can be quitelarge
M. PedramUSC
Probabilistic Compaction (PC)Sequence compaction : Generate a new, but shorter,sequence which exhibits nearly the same spatio-temporal correlations as the initial sequence
TargetCircuit
ProbabilisticModel
SequenceGeneration New
Seq.
InitialSeq.
outin
Jan 1998
11
University of Southern California 11
© M. Pedram Nov 1997
M. PedramUSC
PC by ExampleInitial sequence
011010100111110101110000011011100010000110101110101
Avg. bit activity23/16
Compacted sequence001010011000010001
Avg. bit activity8/5 Stochastic State Machine Model
1/2
1 1/2
1/4
2/3
1/2
1/4
1/21/3
1/2 1/2
1/2
1/2
3
0
6
1
5
4
7
1/2
M. PedramUSC
Dynamic Markov Trees
S1 = (0000, 0001, 1001, 1100, 1001, 1100, 1001, 1100)
12 2
3 3 3
4 4 4
2 62
2
1 1
3 33 3
3 3
DMT0
DMT1 12 2
3 3 3
4 4 45 555
6 666
7 7778 888
2 62 33
2
1 11
1
1
1
1
11
1
33
33
3
3
332
2
22
Upper subtree
Lower subtrees
Jan 1998
12
University of Southern California 12
© M. Pedram Nov 1997
M. PedramUSC
Comparison with SRS
0% 2% 4% 6% 8%
C432
C880
C1355
C1908
C3540
C6288
circ
uit
L = 4,000Compaction Ratio 10
SRSHierarchical
M. PedramUSC
Compaction for FSMs
Input Sequence Target Circuit
Logic
FFs
xn zn = out (xn, sn)
sn
p(xnsn) sn+1 = next (xn, sn)Markov Chain
Jan 1998
13
University of Southern California 13
© M. Pedram Nov 1997
M. PedramUSC
Higher Order DMTsA lag-k Markov chain which correctly models the inputsequence, also models the joint k-step conditionalprobabilities of the primary inputs and state lines
p(vi)
p(vj|vi)
p(vk|vjvi)
DMT0
DMT1
DMT2
vi
vj
vk
M. PedramUSC
High Order DMT Results
0% 10% 20% 30% 40% 50% 60%
b bara
d k17
mc
p lane t
sh iftre g
s1196
s1423
s5378
s820
s9234Com paction Ratio 10
Order 1
Order 2
Jan 1998
14
University of Southern California 14
© M. Pedram Nov 1997
M. PedramUSC
New Program Program Synthesis
Architectural SimulationCharacteristic
Profile
RTL Simulation PowerEstimation
CPU’sRTL
Model
Original Program
Instruction Level Compaction
M. PedramUSC
Characteristic Profile• Instruction mix• Average instruction size (if applicable)• Clocks per instruction• Branch prediction miss rate• Instruction cache miss rate• Data cache read/write miss rate• Pipeline stall rate• Speculative execution and register renaming are
not considered• Target micro-processor: A super-scalar pipelined
CPU with branch prediction (Intel’s Pentium chip)
Jan 1998
15
University of Southern California 15
© M. Pedram Nov 1997
M. PedramUSC
1. Block Allocation
2. Instruction Allocation
add op1, op2xor op1, op2mul op1, op2mov op1, mem1branch
3. Memory Allocation
mov op1, mem1
4. Operand Allocation/ Instruction Scheduling
add op1, op2
reg1 reg4reorder the sequence
memory space
Program Synthesis Procedure
M. PedramUSC
Results: Compression Ratio
1 10 100 1000 10000 100000
VORTEX
PERL
IJPEG
LISP
COMP
GCC
M88
GO
Jan 1998
16
University of Southern California 16
© M. Pedram Nov 1997
M. PedramUSC
Results: Accuracy
0 50 100 150 nano Joule
VORTEX
PERL
IJPEG
LISP
COMP
GCC
M88
GO
Energy per Instruction
SynthesizedOriginal
M. PedramUSC
Power Macromodeling
MacromodelEquation Form
ModelVariables
Training Set
Low - LevelSimulation
Regression Analysis
RegressionCoefficients
MacroModels
Module
Analytic ModelReduction
Jan 1998
17
University of Southern California 17
© M. Pedram Nov 1997
M. PedramUSC
Dual Bit Type Model
Pwr = C0 + C1 ·S1 + C2 ·S2 + C3 ·S3 + C4 ·S4
Consider a data path block:
Sign bit
MSB region LSB region
S1 S2 : avg. switching activity of LSB(MSB) region of operand 1
S3 S4 : avg. switching activity of LSB(MSB) region of operand 2
Module
M. PedramUSC
Input/Output Data Model
Pwr = C0 + C1 ·S1 + C2 ·S2 + C3 ·S3
Consider a data path block:
S1 S2 : avg. switching activity ofoperands 1 and 2
S3 : avg. switching activity ofoutput
Module
Jan 1998
18
University of Southern California 18
© M. Pedram Nov 1997
M. PedramUSC
Bitwise Data Model
Consider a random logic block:
∑ ⋅+=inputs iSiCCPwr 0
Si : avg. switching activity ofinput signal i Module
More parameters lead to ahigher degree of accuracy,but increase thecomputational overhead
M. PedramUSC
Cycle-Accurate Macro-Models
Cycle-AccurateMacro-Models
AveragePower
SignalIntegrity
PowerDistribution
PeakPower
DCVoltage
Drop
SwitchingNoise
Jan 1998
19
University of Southern California 19
© M. Pedram Nov 1997
M. PedramUSC
Statistical MacromodelConstruction
0
5
10
15
20
C1355 C1908 C2670 C3540 C432 C5315 C6288 C7552 C880 Mul16 Adder16
Erro r in Cyc lePo we r (%)
Erro r in Ave rag ePo we r (%)
Model Validation?
101110010…1111011010…0010000011…1…
Low-levelSimulation
TrainingSet
Variable Selection basedon Sensitivity Analysis
Circuit
Database
EquationForm
Initial VariableSet
No YesDONE
Variable Selection basedon Sensitivity Analysis
Least Square Fit
M. PedramUSC
Power Optimization FlowHW/SW Co-Design:
Partitioning and MappingCommunication Synthesis
Datapath Synthesis:Multiple Voltage Scheduling Resource Allocation/Binding
Controller Synthesis:State Assignment
Interface Design:I/O Encoding
Design Planning:Floorplanning & Global Routing
Power/Delay Budgeting
Logic Synthesis:Logic Minimization & Restructuring
Technology Mapping
Physical Design:Placement & RoutingResizing & Rewiring
Jan 1998
20
University of Southern California 20
© M. Pedram Nov 1997
M. PedramUSC
System and Behavioral Synthesis
Resource Allocation and Sharing
Interconnect Synthesis
CommunicationSynthesis
Selective Shut-offof System Modules
Multiple Supply Voltage Scheduling
HW-SW Partitioningand Mapping
M. PedramUSC
RT-Level and Logic Synthesis
Logic Restructuringand Minimization
Technology Mapping
Retiming
Precomputation or Bypass Logic
State Assignment and Bus Encoding
Gated Clocking
Jan 1998
21
University of Southern California 21
© M. Pedram Nov 1997
M. PedramUSC
Physical Design
Transistor Re-ordering under a DSM Delay/Power Model
Gate Sizing under aDSM Delay/Power Model
Bounded-SkewGated Clock Routing
Floorplanning withIntegrated Power Plane Design
Fanout Optimization undera DSM Delay/Power Model
P/G Network Design forGround Bounce Control
M. PedramUSC
Summary• CAD flows and tools can reduce power
dissipation in VLSI circuits and systems by afactor of 5 - 8 X over the next three years
• Process and voltage scaling can provideanother factor of 8 - 12 X
• Commercial tools for gate-level and circuit-level power estimation and optimization exist
• High level power analysis and estimation toolsare a key enabling technology
• Early Design planning and system-level designand power optimization tools are needed
Jan 1998