Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
Hot Chips 16 August 24, 2004
OptimoDE: Programmable Accelerator Engines Through Retargetable Customization
Nathan Clark, Hongtao Zhong,Kevin Fan, Scott Mahlke
CCCP Research GroupUniversity of Michigan
http://cccp.eecs.umich.edu
Krisztián Flautner,Koen Van Nieuwenhove
ARM Limited
Hot Chips 16 August 24, 2004
OptimoDE Overview• OptimoDE
– A configurable VLIW-styled Data Engine architecture– Targeted at intensive data processing
• Characteristics– Very wide performance envelope
• Power / area / speed tradeoff• Exploiting parallelism in applications
– Unlimited data path configuration options– User extensible through ISA customization
• Semi-automatic design system– User-in-the-loop design, retargetable compiler toolchain
Hot Chips 16 August 24, 2004
OptimoDE in a System On Chip
SDRAM
SoC
AH
B B
us M
atrix
ARMCPU
DMAController
InterruptController
SRAM
I/O
SDRAMController
Mem
ory
Con
trol
Data Engine
DA
TA
1
MEM
CTRL
DA
TA
2
FIFO
switch
Memory
Dat
a M1
M2
M
S
S S
M
S S M
Hot Chips 16 August 24, 2004
OptimoDE Architecture Model• Functional Units
– ALU, ACU, Multipliers– Custom
• Memory– RAM ( asynch / synch )– ROM
• I/O ports– addressable– handshake protocol
• Registers– Register files
• Interconnect– Direct connection– Shared bus
• Controller
• All layers required• Intra-layer configuration
interconnect
InterconnectInterconnect
Controller
C
ControllerController
interconnect
… …
Function unitsFunction units MemoriesMemories
regs regs regs…regs
I/O portsI/O ports
RegistersRegisters
Hot Chips 16 August 24, 2004
Design Toolchain
A.tmp
DesignDE DEvelop
Librarian
saveISS
set target
001010110011010110110111
load
run / profile
A.inc
A.inc.c1
UserLibrary
instantiate
#include …
main(){
… = dct();
}
create_resource
xxxxxxxdctxxxxx
A.inc
A.inc.c2
A.inc
A.inc.c3
A Evaluation
OptimoDELibraryOptimoDE
LibraryUser
Library
A Definition
LIFETIMELIFETIME
LOADLOAD
Hot Chips 16 August 24, 2004
Compiler Toolchain
*1
+2
*4
*5
+6
+7
*8
INPUT
OUTPUT
+3
*1
+2
+3
012345
+2
*1
*4
*3
+6
*8
+7
*5
*9
+10
C SourceDescription
DEvelop
Micro-code
check map compile
Analysis feedback
Syntax checks
Dataflow analysis
Match architecture
and dataflow graph
Optimize code and
register use
Hot Chips 16 August 24, 2004
32-point DCT Microarchitecture
inport
outportacu_1 acu_2 acu_3
romram_1 ram_2 alu_1 alu_2
imm_1 imm_2 control
• 2 Custom FUs, 2 RAM, 1 ROM, 3 ACU, 2 I/O ports• Designer responsible for creating custom units manually
Hot Chips 16 August 24, 2004
Retargetable Customization• Prototype 2 technologies in OptimoDE
– Automated ISA customization– Retargetable customization to an “application-area”
• Customizing for 1 application– Programmability Nominally programmable– Critical problem – Cannot sustain performance
across similar applications– How well does a custom ISA generalize
• 5 encryption algorithms, create custom design for each• Average loss >80% verses native [MICRO, 2003]
– Proactive generalization creates a retargetable design
Hot Chips 16 August 24, 2004
Creating Custom Instructions
• Candidate discovery– Identify customization
opportunities• Examine program DFG• Partition DFG at:
– Memory operations– Unprofitable edges
• Enumerate candidate subgraphs within each partition
Hot Chips 16 August 24, 2004
Grouping and Selection
• Group candidate subgraphs with same structure
Group 4
Group 2
Group 1
Group 3
Group 4Group 3
Group 1 Group 2
• Estimate performance and cost for each group
Cost: 0.5 AddersGain: 1,000 Cycles
Cost: 1 AdderGain: 2,500 Cycles
Cost: 2 AddersGain: 10,000 Cycles
Cost: 1 AdderGain: 1,500 Cycles
• Greedily select groups to implement in hardware subject to budget
Hot Chips 16 August 24, 2004
• Wildcard – multiple functionality at nodes
Input 2
Input 1
0xFF
0x8, 0x40x4
>>
|,&
+,-
Output
Proactively Generalize Groups
• Cost-effectively extend group functionality to enable reuse
Input 2
Input 1
0xFF
0x8
>>
|
+
Output
Input 2
Input 1
0xFF
0x8, 0x40x4
>>
|,&
+,-
Output
• Subsumed – configurable interconnect to bypass nodes
Hot Chips 16 August 24, 2004
Native Speedups
0
1
2
3
4
5
6
7
8
9
3Des AES Blowfish Md5 Rc4 SHA
Speedup
ARM 926EJ (200 MHz)OptimoDE (333 MHz, 5 Issue: 3 ALU, 1 Mem, 1 Brn)OptimoDE + 1 CFU (333 MHz)
Hot Chips 16 August 24, 2004
0
1
2
3
4
5
6
7
3des-3des3des-Blowfish
3des-Rc43des-AES3des-Sha
Blowfish-3desBlowfish-Blowfish
Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des
Rc4-Blowfish
Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish
AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish
Sha-Rc4Sha-AESSha-Sha
Speedup
Base
Generalized
0
1
2
3
4
5
6
7
3des-3des3des-Blowfish
3des-Rc43des-AES3des-Sha
Blowfish-3desBlowfish-Blowfish
Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des
Rc4-Blowfish
Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish
AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish
Sha-Rc4Sha-AESSha-Sha
Speedup
Base
Generalized
Importance of Generalization
0
1
2
3
4
5
6
7
3des-3des3des-Blowfish
3des-Rc43des-AES3des-Sha
Blowfish-3desBlowfish-Blowfish
Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des
Rc4-Blowfish
Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish
AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish
Sha-Rc4Sha-AESSha-Sha
Speedup
Base
Generalized
Key: application run – application designed for
Hot Chips 16 August 24, 2004
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rc4 Rc4, Sha Blowfish, Rc4, Sha 3des, Blowfish,Rc4, Sha
3des, AES,Blowfish, Rc4, Sha
Apps Designed For
Fraction of Native Speedup Attained
3des
AES
Blowfish
Rc4
Sha
Md5
Mean
Designing for a Domain
Hot Chips 16 August 24, 2004
Case Study - Md5
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10
CFU Cost Budget (Relative to 32-bit ripple-carry adder)
Speedup
Hot Chips 16 August 24, 2004
OptimoDE Design for this Point
Input 4
Input 1
Input 3
Input 2
+
+
+
<<
0x5
>>
0x1B
|
Output
^
&
^
Input 1Input 2
Input 3
Output
ALU 1 ALU 2 CFUSRAMACU
RF RFRF RF …
Control Memory
Hot Chips 16 August 24, 2004
Die Area Breakdown
ALU 1 ALU 2 CFUSRAMACU
RF RFRF RF …
Control Memory
Die impact is artificially large because of naïve implementation
OptimoDE = 5.5 mm2 in 0.13 ARM 926EJ = 5.0 mm2 in 0.13
Reg. File and Interconnect
9%
CFU2%
Control Memory
11%
Baseline Design
78%
Hot Chips 16 August 24, 2004
Conclusions
• OptimoDE – Configurable VLIW-style data engine architecture– Automated tools for implementing embedded signal
and data processing solutions
• Automatic retargetable customization– Customized design combined with cost-effective
generalization– Performance programmability - Performance stability
across a family of similar applications