Swankoski MAPLD 2005 / B1031
Dynamic High-Performance Multi-Mode Architectures for AES Encryption
Eric SwankoskiNaval Research Lab
Vijay NarayananPenn State University
Swankoski MAPLD 2005 / B1032
Background & Motivation
Bandwidth and throughput capabilities of modern optical networks is skyrocketing
Protecting transmitted data becoming more and more critical
Current encryption architectures generally aren’t capable of keeping up with high-speed environments
SEU effects rarely, if ever, considered
Swankoski MAPLD 2005 / B1033
Plan of Attack: FPGA Encryption
Algorithm: Advanced Encryption Standard (AES)– Supports multiple key lengths– Supports multiple encryption modes– Supports multiple levels of pipelining
Target Architecture: Xilinx FPGAs– Can be adapted to ASIC devices– Virtex-II, Virtex-4
Target Performance: 60+ gigabits per second– Requires both inner-round and outer-round pipelining
Swankoski MAPLD 2005 / B1034
The AES Algorithm
10 Rounds of Encryption for 128-bit operands Four basic operations:
– SubBytes: 8-bit substitution (16 parallel operations per round)
– ShiftRows: Byte reordering and rotation (4 parallel operations per round)
– MixColumns: Polynomial multiplication (4 parallel operations per round)
– AddRoundKey Simple 128-bit XOR
Swankoski MAPLD 2005 / B1035
Optimizing for Performance
Exploit all possible parallelism Alternative byte substitution methods
– 1 cycle for a lookup-based substitution– 5 cycles for a mathematical transformation
Utilize pipelining– Outer-Round: 1 cycle per round– Inner-Round:
4 cycles per round (lookup-based byte substitution) 8 cycles per round (pipelined byte substitution)
Swankoski MAPLD 2005 / B1036
Combinatorial Byte Substitution
Actual mathematical transformation Conventional implementation cannot be
pipelined– Simple (atomic) 8x8 lookup table
Smaller than lookup table Faster than lookup table
– Utilizes five-stage pipeline
All internal operands are four bits wide
Swankoski MAPLD 2005 / B1037
Encryption Round Diagram
Atomic S-Box:– 40 Pipeline Stages
Combinatorial S-Box:– 76 Pipeline Stages– Needs a constant stream to be
effective Parallel Key Scheduling
– No performance penalty Offline Key Scheduling
– Precomputed keys can be stored in registers
Swankoski MAPLD 2005 / B1038
Counter (CTR) Mode
Effectively converts AES into a stream cipher High security – similar to CBC Supports inner-round and outer-round pipelining No error propagation – errors are completely isolated
Swankoski MAPLD 2005 / B1039
Cipher Block Chaining (CBC) Mode
Most secure – no patterns are observed Cannot be pipelined 100% downstream corruption resulting from data loss or single-
event upsets (SEUs) during encryption– Errors are isolated during decryption
Swankoski MAPLD 2005 / B10310
Electronic Codebook (ECB) Mode
Supports full pipelining No error propagation – errors are completely isolated Least secure – identical input gives identical output
– Patterns observable in video and image data
Swankoski MAPLD 2005 / B10311
Staggered CBC Mode
Pipelined with Output FeedbackEach encrypted block n depends on
itself and the block (n – x) where x is the latency of the pipeline
Maintains security while mitigating some error propagation problems
Swankoski MAPLD 2005 / B10312
More Challenges
Error-Tolerant EncryptionMaintaining High SecurityMaintaining High Performance
Swankoski MAPLD 2005 / B10313
Error-Tolerant Encryption
Are errors acceptable?– Possibly, but better to assume not
How do the multiple modes of encryption deal with upsets?
Is there a benefit to triple modular redundancy (TMR)?– Is it what we expect?
Swankoski MAPLD 2005 / B10314
Error-Tolerant Encryption
CTR and ECB encryption isolate errors– Transmission integrity largely preserved even
without SEU mitigation
TMR can ensure 100% transmission integrity– TMR REQUIRED for CBC encryption
Swankoski MAPLD 2005 / B10315
Error-Tolerant Encryption
Image 1: Error-Free Plaintext Image– Before Encryption / After Decryption– CTR, ECB, or CBC with mitigation
Image 2: Decrypted Plaintext Image– One corrupted block– CTR or ECB without mitigation
Image 3: Decrypted Plaintext Image– One block corrupted during encryption– CBC without mitigation
Swankoski MAPLD 2005 / B10316
Maintaining High Security
How do the multiple modes of encryption affect security?
Is physical protection of the key necessary?– Depends on the environment
How is throughput affected by increased security?– Hopefully, not at all…
Swankoski MAPLD 2005 / B10317
Maintaining High Security
ECB-encrypted image has observable patterns CTR/CBC/SCBC encryption looks like random noise
Swankoski MAPLD 2005 / B10318
Maintaining High Security
Physical Key Protection– Not required in aerospace applications
Power Analysis / Soft Attacks– Countermeasures not mode specific
Throughput Effects– ECB & CTR far outperform CBC– Why is CBC an official mode?
Swankoski MAPLD 2005 / B10319
System-Level Diagram
Supports ECB, CTR, CBC, and SCBC modes
Supports two types of TMR
– System: triplicates all control, key hardware, and mode logic
– Encryption: triplicates only encryption and key scheduling hardware
Swankoski MAPLD 2005 / B10320
Performance Results – Virtex-4
Key Scheduling– Offline uses precomputed and stored keys (compile or design time)– Online uses dynamically computed keys (run time)
Significant performance improvement for combinatorial byte substitution in pipelined mode
Virtex-II Pro performs better with ROM implementation (56.42 & 60.35 Gbps) Better CBC performance achieved through other architectures
Byte
Substitution
Key
SchedulingArea Frequency
Throughput
(CTR, ECB, SCBC)
Throughput
(CBC)
ROM Online 3588 339.5 MHz 43.5 Gbps 1.088 Gbps
ROM Offline 2827 446.8 MHz 57.2 Gbps 1.430 Gbps
Combinatorial Online 13651 519.2 MHz 66.5 Gbps 700.0 Mbps
Combinatorial Offline 10912 519.2 MHz 66.5 Gbps 700.0 Mbps
Swankoski MAPLD 2005 / B10321
Lessons Learned
Don’t try to over-optimize FPGA code– Returns diminish quickly– Sometimes less is more
Know your synthesis tool– Now why did it do THAT?
Check your system’s memory– RAM does fail at inopportune times…
ESPECIALLY if it has a lifetime warranty
Swankoski MAPLD 2005 / B10322
Lessons Learned
Over-optimization– In a highly pipelined FPGA design, routing plays a
MAJOR role in the clock frequency 70%-80% of the total delay
– What would work in an ASIC (or in theory, or on paper…) might actually make things worse
– Manual floorplanning and P&R might help, but usually provides minimal (if any) improvement
– Moral? – Try reducing the pipeline depth as well as increasing it, it just might help!