Download - Kris Gaj ECE 545 Research and teaching interests: Digital System … · 2010-09-14 · for ASICs DIGITAL SYSTEMS DESIGN Concentration advisors: Kris Gaj, Jens-Peter Kaps, Ken Hintz

1

Course web page:

ECE 545

Digital System Design with VHDL

ECE web page → Courses → Course web pages → ECE 545

http://ece.gmu.edu/coursewebpages/ECE/ECE545/F10/

Kris Gaj

Office hours: Monday, 7:30-8:30 PM, Wednesday, 6:00-7:00 PM, and by appointment

Research and teaching interests: •  reconfigurable computing •  computer arithmetic •  cryptography •  network security

Contact: The Engineering Building, room 3225

[email protected]

ECE 545

Part of:

MS in Electrical Engineering

MS in Computer Engineering

Digital Systems Design Microprocessor and Embedded Systems

Strongly suggested for two concentration areas:

Elective

Elective course in the remaining concentration areas

One of five core courses (must be passed with B or better)

algorithmic

Design level

register-transfer

gate

transistor

layout

devices

Courses Computer Arithmetic

Digital System Design with VHDL

Digital Integrated Circuits Physical

VLSI Design

VLSI Test Concepts

ECE 545

ECE 645

ECE 586

ECE 680

ECE 682

ECE684 MOS Device Electronics

ECE 584 Semiconductor Device Fundamentals

ECE 681

VLSI Design for ASICs

DIGITAL SYSTEMS DESIGN

Concentration advisors: Kris Gaj, Jens-Peter Kaps, Ken Hintz

1.  ECE 545 Digital System Design with VHDL – K. Gaj, project, FPGA design with VHDL,

Aldec/Mentor Graphics, Xilinx/Altera

2. ECE 645 Computer Arithmetic – K. Gaj, project, FPGA design with VHDL or Verilog,

Aldec/Mentor Graphics, Xilinx/Altera

3. ECE 681 VLSI Design for ASICs – N. Klimavicz, project/lab, back-end ASIC design with Synopsys tools

4. ECE 586 Digital Integrated Circuits – D. Ioannou, R. Mulpuri

5. ECE 682 VLSI Test Concepts – T. Storey

Grading Scheme

•  Homework - 10%

•  Project - 40%

•  Midterm Exam - 20%

•  Final Exam - 30%

2

Midterm exam 1

  2 hours 30 minutes

  in class

  design-oriented

  open-books, open-notes

  practice exams will be available on the web

Monday, November 1st

Tentative date:

Final exam

  2 hours 45 minutes

  in class

  design-oriented

  open-books, open-notes

  practice exams will be available on the web

Monday, December 20, 7:30-10:15pm

Date:

9

Project

Project

  individual

  semester-long

  related to the research project conducted by Cryptographic Engineering Research Group (CERG) at GMU

  supporting NIST (National Institute of Standards and Technology) in the evaluation of candidates for a new cryptographic standard

11

Background

Hash Function

arbitrary length

message

hash function

hash value h(m)

h

m

fixed length

It is computationally infeasible to find such

m and m’ that h(m)=h(m’)

3

Main Application: Digital Signature

Signature

DIGITAL HANDWRITTEN

A6E3891F2939E38C745B 25289896CA345BEF5349 245CBA653448E349EA47

Main Goals: •  unique identification •  proof of agreement to the contents of the document

Message

Hash function

Public key cipher

Alice Signature

Alice’s private key

Bob

Hash function

Alice’s public key

Typical Digital Signature Scheme

Hash value 1

Hash value 2

Hash value

Public key cipher

yes no

Message Signature

Handwritten and Digital Signatures Common Features

Handwritten signature Digital signature

1. Unique 2. Impossible to be forged 3. Impossible to be denied by the author 4. Easy to verify by an independent judge 5. Easy to generate

Handwritten and Digital Signatures Differences

Handwritten signature Digital signature

6. Associated physically with the document

7. Almost identical for all documents 8. Usually at the last page

6. Can be stored and transmitted independently of the document 7. Function of the document 8. Covers the entire document

Hash function algorithms

Customized (dedicated)

Based on block ciphers

Based on modular arithmetic

MDC-2 MDC-4

IBM, Brachtl, Meyer, Schilling, 1988

MASH-1 1988-1996

MD2 Rivest 1988

MD4 Rivest 1990

MD5 Rivest 1990

SHA-0

SHA-1

RIPEMD

RIPEMD-160

European RACE Integrity Primitives Evaluation Project, 1992

NSA, 1992

NSA, 1995

SHA-256, SHA-384, SHA-512 NSA, 2000

Attacks against dedicated hash functions known by 2004

MD2

MD4

MD5 SHA-0

SHA-1

RIPEMD

RIPEMD-160

partially broken

broken, H. Dobbertin, 1995 (one hour on PC, 20 free bytes at the start of the message)

partially broken, collisions for the compression function, Dobbertin, 1996 (10 hours on PC)

weakness discovered, 1995 NSA, 1998 France

reduced round version broken, Dobbertin 1995

SHA-256, SHA-384, SHA-512

4

MD4

MD5 SHA-0

SHA-1

RIPEMD

RIPEMD-160

SHA-256, SHA-384, SHA-512

broken; Wang, Feng, Lai, Yu Crypto 2004 (1 hr on a PC)

attack with 240 operations Crypto 2004

What was discovered in 2004-2005? broken; Wang, Feng, Lai, Yu, Crypto 2004 (manually, without using a computer)

broken; Wang, Feng, Lai, Yu, Crypto 2004 (manully, without using a computer)

attack with 263 operations Wang, Yin, Yu, Aug 2005

263 operations Schneier, 2005

In hardware:

Machine similar to the one used to break DES:

Cost = $50,000-$70,000 Time: 18 days or Cost = $0.9-$1.26M Time: 24 hours

In software:

Computer network similar to distributed.net used to break DES (~331,252 computers) :

Cost = ~ $0 Time: 7 months

Cryptographic Standards

So how the cryptographic standards have been created so far?

National Security Agency (also known as “No Such Agency” or “Never Say Anything”)

Created in 1952 by president Truman

Goals: •  designing strong ciphers (to protect U.S. communications) •  breaking ciphers (to listen to non-U.S. communications)

Budget and number of employees kept secret Largest employer of mathematicians in the world Larger purchaser of computer hardware

NSA-developed Cryptographic Standards

time

1970 1980 1990 2000 2010

DES – Data Encryption Standard 1977 1999

Triple DES

SHA-1–Secure Hash Algorithm SHA-2

Block Ciphers

Hash Functions 1995 2003 1993

SHA-0

2005

Cryptographic Standard Contests

time 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12

AES

NESSIE

CRYPTREC

eSTREAM

SHA-3

34 stream ciphers → 4 SW+4 HW winners

51 hash functions → 1 winner

15 block ciphers → 1 winner

IX.1997 X.2000

I.2000 XII.2002

V.2008

X.2007 XII.2012

XI.2004

5

25

SHA-3 Contest - NIST Evaluation Criteria

Security

So*ware Efficiency

Hardware Efficiency

Simplicity

FPGAs ASICs

Flexibility Licensing

Software or hardware?

SOFTWARE HARDWARE security of data

during transmission

flexibility (new cryptoalgorithms,

protection against new attacks)

speed

random key generation

access control to keys

tamper resistance

low cost resistance to

side-channel attacks

Memory

Power consumption

Primary efficiency indicators

Software Hardware

Speed Memory Speed Area

Efficiency parameters Latency Throughput = Speed

Encryption/ decryption

Time to encrypt/decrypt a single block

of data

Mi

Ci Number of bits

encrypted/decrypted in a unit of time

Encryption/ decryption

Mi Mi+1 Mi+2

Ci Ci+1 Ci+2

Throughput = Block_size · Number_of_blocks_processed_simultaneously Latency

Advanced Encryption Standard (AES) Contest 1997-2001

15 Candidates from USA, Canada, Belgium,

France, Germany, Norway, UK, Israel, Korea, Japan, Australia, Costa Rica

June 1998

August 1999

October 2000 1 winner: Rijndael

Belgium

5 final candidates

Mars, RC6, Rijndael, Serpent, Twofish

Round 1

Round 2

Security Software efficiency

Flexibility

Security Hardware efficiency

0 50 100 150 200 250 300 350 400 450 500

Serpent Rijndael Twofish RC6 Mars

Speed of the final AES candidates in Xilinx FPGAs Speed [Mbit/s] K.Gaj, P. Chodowiec, AES3, April, 2000

6

0 10 20 30 40 50 60 70 80 90 100


Survey filled by 167 participants of the Third AES Conference, April 2000

# votes


Results of the NSA group ASICs Speed [Mbit/s]

606

414

0

100

200

300

400

500

600

700

202

105 103 57

431

177 143

61

NSA ASIC

GMU FPGA

AES3, April, 2000

0

5

10

15

20

25

30


Efficiency in software: NIST-specified platform

128-bit key 192-bit key 256-bit key

200 MHz Pentium Pro, Borland C++ Speed [Mbits/s] Security

Complexity

High

Adequate

Simple Complex

NIST Report: Security

Rijndael

MARS Serpent Twofish

RC6

AES Final Report, October 2000

35

NIST SHA-3 Contest - Timeline

51 candidates

Round 1 14

5-6 1-2 Round 2 Round 3

July 2009 End of 2010 Mid 2012 Oct. 2008

36

•  Fair and comprehensive methodology for evaluation of hardware performance in FPGAs

•  High-speed fully autonomous implementations of all 14 SHA-3 candidates & SHA-2 256-bit & 512-bit variants

optimized for the maximum throughput to area ratio

•  Open-source benchmarking tool supporting optimization of tool options and efficient generation of results for multiple FPGA families

GMU Team Goals

7

Primary Designers of GMU Codes Ekawat Homsirikamol

a.k.a “Ice” Marcin Rogawski

Developed optimized VHDL implementations of 14 Round 2 SHA-3 candidates + SHA-2 in two variants each (256 & 512-bit output),

for some functions using several alternative architectures 38

Methodology

39

Comprehensive Evaluation

•  two major vendors: Altera and Xilinx (~90% of the market) •  multiple high-performance and low-cost families

Altera Xilinx

Technology Low-cost High- performance

Low-cost High- performance

90 nm Cyclone II Stratix II Spartan 3 Virtex 4

65 nm Cyclone III Stratix III Virtex 5

40

•  Language: VHDL

•  Tools: FPGA vendor tools

•  Interface

•  Performance Metrics

•  Design Methodology

•  Benchmarking

Uniform Evaluation

41

Why Interface Matters?

•  Pin limit

Total number of i/o ports ≤ Total number of an FPGA i/o pins

•  Support for the maximum throughput

Time to load the next message block ≤ Time to process previous block

42

Interface: Two possible solutions

Length of the message communicated at the beginning

+ easy to implement passive source circuit

− area overhead for the counter of message bits

Dedicated end of message port

− more intelligent source circuit required

+ no need for internal message bit counter

msg_bitlen

zero_word

message end_of_msg SHA core

8

43

SHA Core: Interface & Typical Configuration

•  SHA core is an active component; surrounding FIFOs are passive and widely available •  Input interface is separate from an output interface •  Processing a current block, reading the next block, and storing a result for the previous message can be all done in parallel

fifoin_empty

fifoin_read

idata w w

odata

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO

SHA core

clk rst

ext_idata

w

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

w

clk rst

clk rst clk rst

clk rst

clk rst

44

SHA Core: Interface & Typical Configuration

fifoin_empty

fifoin_read

idata w w

odata

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

w

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

w

clk rst

io_clk rst io_clk rst

clk rst

clk rst

io_clk

io_clk

•  Some functions may require a faster input/output clock in order to load input data at a faster rate

45

Primary Secondary

1. Throughput (single long message)

2. Area

3. Throughput / Area 3. Hash Time for Short Messages (up to 1000 bits)

Performance Metrics

46

Performance Metrics - Area

We force these vectors to look as follows through the synthesis and implementation options:

0

0

0

0

Areaa

47

Primary Optimization Target: Throughput to Area Ratio

Features: •  practical: good balance between speed and cost •  very reliable guide through the entire design process,

facilitating the choice of   high-level architecture   implementation of basic components   choice of tool options

•  leads to high-speed, close-to-maximum-throughput designs

Choice of Optimization Target

48

Our Design Flow

Specification Interface

Datapath Block diagram

Controller ASM Chart

VHDL Code

Formulas for Throughput & Hash time

Max. Clock Freq. Resource Utilization

Throughput, Area, Throughput/Area, Hash Time for Short Messages

Controller Template

Library of Basic Components

9

49

Basic Operations of 14 SHA-3 Candidates

49 NTT – Number Theoretic Transform, GF MUL – Galois Field multiplication,

MUL – integer multiplication, mADDn – multioperand addition with n operands

ATHENa – Automated Tool for Hardware Evalua?oN

50

Benchmarking open-‐source tool, wriGen in Perl, aimed at an

AUTOMATED genera?on of OPTIMIZED results for MULTIPLE FPGA plaSorms

Under development at George Mason University.

http://cryptography.gmu.edu/athena

ATHENa Server

FPGA Synthesis and Implementation

Result Summary + Database Entries

2 3

HDL + scripts + configuration files

1

Database Entries

Download scripts and

configuration files8

Designer

4

HDL + FPGA Tools

User

Database query

Ranking of designs

5 6

Basic Dataflow of ATHENa

0 Interfaces

+ Testbenches 51 52

synthesizable source files

configuraKon files

testbench

constraint files

result summary

(user-‐friendly)

database entries

(machine-‐ friendly)

ATHENa Major Features (1) •  synthesis, implementa?on, and ?ming analysis in batch mode

•  support for devices and tools of mulKple FPGA vendors:

•  genera?on of results for mulKple families of FPGAs of a given vendor

•  automated choice of a best-‐matching device within a given family

53

ATHENa Major Features (2)

•  automated verificaKon of designs through simula?on in batch mode

•  support for mulK-‐core processing

•  automated extracKon and tabulaKon of results

•  several opKmizaKon strategies aimed at finding

–  op?mum op?ons of tools

–  best target clock frequency

–  best star?ng point of placement

OR

54

10

55

•  batch mode of FPGA tools

•  ease of extraction and tabulation of results •  Excel, CSV (available), LaTeX (coming soon)

•  optimized choice of tool options

Generation of Results Facilitated by ATHENa

vs.

56

Relative Improvement of Results from Using ATHENa Virtex 5, 256-bit Variants of Hash Functions

0

0.5

1

1.5

2

2.5

Groestl

Shavite-

3 Luf

fa

Keccak

Hamsi

Echo

Skein

Fugue

Sha-2

BMW

CubeHash

Blak

e

Shabal

SIMD JH

Area Thr Thr/Area

Ratios of results obtained using ATHENa suggested options vs. default options of FPGA tools

58

Results

59

Throughput [Mbit/s] Virtex 5, 256-bit variants of algorithms

0

2000

4000

6000

8000

10000

12000

14000

16000

ECHO

Kecca

k

Groestl

Lu

ffa

BMW

JH

CubeH

ash

Fugue

SHAvite-3

BLAKE

Skein

Hamsi

Shaba

l

SIMD

SHA-2

60

Throughput [Mbit/s] Virtex 5, 512-bit variants of algorithms

0.0

2000.0

4000.0

6000.0

8000.0

10000.0

12000.0

14000.0

Groestl

BMW

Luffa

Kecca

k

ECHO

SIMD

JH

SHAvite-3

BLAKE

CubeH

ash

Skein

Shaba

l

SHA-2

Hamsi

Fugue

11

61

Normalization & Compression of Results

•  Absolute result

e.g., throughput in Mbits/s, area in CLB slices

•  Normalized result

•  Overall normalized result

Geometric mean of normalized results for

all inves?gated FPGA families

€

normalized _ result =result _ for_ SHA − 3_candidate

result _ for_ SHA − 2

62

Normalized Throughput & Overall Normalized Throughput

63

Overall Normalized Throughput: 256-bit variants of algorithms Normalized to SHA-256, Averaged over 7 FPGA families

0

1

2

3

4

5

6

7

8

Kecca

k

ECHO Lu

ffa

BMW

Groestl

JH

CubeH

as

h Fugue

SHAvite-3

BLAKE

Hamsi

Skein

Shaba

l

SIMD

64

Overall Normalized Throughput: 512-bit variants of algorithms Normalized to SHA-512, Averaged over 7 FPGA families

0

0.5

1

1.5

2

2.5

3

3.5

4

Groestl

Lu

ffa

BMW

ECHO

Kecca

k JH

SIMD

CubeH

ash

SHAvite-3

BLAKE

Skein

Shaba

l

Hamsi

Fugue

65

Area [CLB slices] Virtex 5, 256-bit variants of algorithms

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

SHA-2

CubeH

ash

Hamsi

Fugue

JH

SHAvite-3

Lu

ffa

Kecca

k

Shaba

l

Skein

Groestl

BLAKE

BMW

ECHO

SIMD

66

Area [CLB slices] Virtex 5, 512-bit variants of algorithms

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

SHA-2

CubeH

ash

Fugue

JH

Kecca

k

Shaba

l

Skein

SHAvite-3

Lu

ffa

Hamsi

Groestl

BLAKE

ECHO

BMW

SIMD

12

67

Overall Normalized Area: 256-bit variants of algorithms Normalized to SHA-256, Averaged over 7 FPGA families

0

5

10

15

20

25

30

CubeH

ash

Hamsi

BLAKE

Luffa

Shaba

l JH

Kecca

k

SHAvite-3

Skein

Fugue

Groestl

BMW

SIMD

ECHO

68

Overall Normalized Area: 512-bit variants of algorithms Normalized to SHA-512, Averaged over 7 FPGA families

0

5

10

15

20

25

30

CubeH

ash

Fugue

Kecca

k

Shaba

l JH

Skein

BLAKE

Hamsi

Luffa

SHAvite-3

Groestl

BMW

ECHO

SIMD

69

Overall Normalized Throughput/Area: 256-bit variants Normalized to SHA-256, Averaged over 7 FPGA families

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Kecca

k Lu

ffa

CubeH

ash

Groestl

JH

Hamsi

BLAKE

Fugue

SHAvite-3

Shaba

l

Skein

BMW

ECHO

SIMD

70

Overall Normalized Throughput/Area: 512-bit variants Normalized to SHA-512, Averaged over 7 FPGA families

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Kecca

k

CubeH

ash

Luffa

JH

Groestl

Shaba

l

BLAKE

Skein

SHAvite-3

Fugue

Hamsi

BMW

ECHO

SIMD

71

Throughput vs. Area Normalized to Results for SHA-256 and Averaged over 7 FPGA Families – 256-bit variants

best

worst

72

Throughput vs. Area Normalized to Results for SHA-512 and Averaged over 7 FPGA Families – 512-bit variants

best

worst

13

73

Execution Time for Short Messages up to 1000 bits Virtex 5, 256-bit variants of algorithms

74

Execution Time for Short Messages up to 1000 bits Virtex 5, 512-bit variants of algorithms

75

Thr/Area Thr Area Short msg. Thr/Area Thr Area Short msg.

256-bit variants 512-bit variants

BLAKE BMW CubeHash ECHO Fugue Groestl Hamsi JH Keccak Luffa Shabal SHAvite-3 SIMD Skein

76

•  Throughput/Area & Throughput most crucial for high-speed implementations

•  Area cannot be easily traded for Throughput

Best performers so far 1-2. Keccak & Luffa 3. Groestl

Worst performers so far: 14. SIMD 13. ECHO 12. BMW

Summary of Results

77

•  Cryptology e-Print Archive - 2010/445 (100+ pages) •  Detailed hierarchical block diagrams •  Corresponding formulas for execution time and throughput

•  FPL 2010 paper •  ATHENa features •  Case studies

•  ATHENa web site •  Most recent results •  Comparisons with results from other groups •  Optimum options of tools

More About our Designs & Tools

78

Comparison with

Other Groups

14

79

OTHER GROUPS GMU

Area Thr Thr/Area Source Area Thr Thr/Area

BLAKE 1660 2676 1.61 Kobayashi et al. 1871 2854 1.53

CubeHash 590 2960 5.02 Kobayashi et al. 707 3445 4.87

ECHO 9333 14860 1.59 Lu et al. 5445 13875 2.55 Groestl 1722 10276 5.97 Gauvaram

et al. 1884 8677 4.61

Hamsi 718 1680 2.34 Kobayashi et al. 946 2646 2.80

Keccak 1412 6900 4.89 Bertoni et al. 1229 10807 8.79 Luffa 1048 6343 6.05 Kobayashi

et al. 1154 8008 6.94

Shabal 153 2051 13.41 Detrey et al. 1266 2624 2.07 Skein (estimated) 1632 3535 2.17 Tillich 1463 2812 1.92

Comparison with Best Results Reported by Other Groups Virtex 5, 256-bit variants of algorithms

80

BEST REPORTED RESULTS

Area Thr Thr/Area Source

BLAKE 1660 2676 1.61 Kobayashi et al. BMW 4400 5577 1.27 GMU CubeHash 590 2960 5.02 Kobayashi et al. ECHO 5445 13875 2.55 GMU Fugue 956 3151 3.30 GMU Groestl 1722 10276 5.97 Gauvaram et al. Hamsi 946 2646 2.80 GMU JH 1108 3955 3.57 GMU Keccak 1229 10807 8.79 GMU Luffa 1154 8008 6.94 GMU Shabal 153 2051 13.41 Detrey et al. SHAvite-3 1130 2887 2.55 GMU SIMD 9288 2326 0.25 GMU Skein 1632 3535 2.17 Tillich et al.

Best Overall Reported Results as of Aug. 6, 2010 Virtex 5, 256-bit variants of algorithms

81

Throughput vs. Area: Best reported results Virtex 5, 256-bit variants of algorithms

best

worst

82

Your Project

83

Analysis of Alternative Architectures - Unrolled

r times r/2 times

84

Analysis of Alternative Architectures - Folded

r times 2⋅r times 2⋅r times

Basic Folded

Vertically-2x (fv2)

Folded Horizontally-2x

(fh2)

15

85

Preliminary results for CubeHash, Groestl, Keccak & Luffa in Virtex 5

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7

Nor

mal

ized

Thr

ough

put

Normalized Area

CubeHash

Groestl

Luffa

Keccak

x1 x2 x4

fv3 ^2

x1 x2

fv4

fv2

x1

x1 x2

CubeHash

Luffa

Keccak

Groestl

Your Project •  14 SHA-3 candidates left in the contest

•  Given:  specification of the function  reference implementation in C  interface  testbench and test vectors  GMU implementation of the basic version including

 block diagrams  ASM charts  short description  formulas for execution time & throughput  source codes  results for Xilinx and Altera FPGAs

Your Project Develop:

 Block diagram  ASM chart  Formulas for execution time & throughput  Synthesizable code in VHDL  Results for multiple families of FPGAs from Xilinx and

Altera for at least one architecture from each of the following

three classes of architectures: –  Unrolled architecture –  Folded architecture –  Architecture based on the use of embedded FPGA

resources (BRAMs, multipliers, DSP units, etc.) [256 bit only, 512-bit only, or both]

88

Block R

AM

s and MU

Ls

Block R

AM

s and MU

Ls

Configurable Logic Blocks

I/O Blocks

What is an FPGA?

Block RAMs & Embedded Multipliers

89

RAM Blocks and Multipliers in Xilinx FPGAs

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

90

Using Embedded FPGA Resources

Basic design

Your design

( 1536, 0, 0)

( 768, 2, 4)

Basic design

Your design

( 3010, 0, 0)

( 1505, 32 kbit, 4)

16

91

Block RAM

Spartan-3 Dual-Port

Block RAM

Port A

Port B

Block RAM

•  Most efficient memory implementation •  Dedicated blocks of memory

•  Ideal for most memory requirements •  4 to 104 memory blocks

•  18 kbits = 18,432 bits per block (16 k without parity bits) •  Use multiple blocks for larger memories

•  Builds both single and true dual-port RAMs •  Synchronous write and read (different from distributed RAM)

92

Block RAM can have various configurations (port aspect ratios)

0

16,383

1

4,095

4 0

8,191

2 0

2047

8+1 0

1023

16+2 0

16k x 1

8k x 2 4k x 4

2k x (8+1)

1024 x (16+2)

93

Port A Out 18-Bit Width

Port B In 1k-Bit Depth

Port A In 1K-Bit Depth

Port B Out 18-Bit Width

DOA[17:0]

DOB[17:0]

WEA

ENA

RSTA

ADDRA[9:0]

CLKA

DIA[17:0]

WEB

ENB

RSTB

ADDRB[9:0]

CLKB

DIB[17:0]

Dual-Port Bus Flexibility

94

Embedded Multipliers in Spartan 3

18x18 bit signed multipliers with optional input/output registers

95

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Multiplier-Accumulator - MAC

96

Xilinx XtremeDSP

•  Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAs

•  Essentially a multiply-accumulate core with many other features

•  Now also Spartan-3A and Virtex 5 have DSP blocks

17

97

DSP48 Slice: Virtex 4

98

Simplified Form of DSP48

Technology Low-‐cost High-‐performance

120/150 nm Virtex 2, 2 Pro

90 nm Spartan 3 Virtex 4

65 nm Virtex 5

45 nm Spartan 6

40 nm Virtex 6

Xilinx FPGA Devices Altera FPGA Devices

Technology Low-‐cost Mid-‐range High-‐performance

130 nm Cyclone Stra?x

90 nm Cyclone II StraKx II

65 nm Cyclone III Arria I StraKx III

40 nm Cyclone IV Arria II StraKx IV

All Projects - Organization

•  Projects divided into phases

•  Deliverables for each phase submitted through Blackboard at selected checkpoints and evaluated by the instructor and/or TA

•  Feedback provided to students on a best effort basis

•  Final report and codes submitted using Blackboard at the end of the semester

Honor Code Rules

•  All students are expected to write and debug their codes individually

•  Students are encouraged to help and support each other in all problems related to the - operation of the CAD tools, - basic understanding of the problem.

18

103

Course Objectives

•  At the end of this course you should be able to: •  Code in VHDL for synthesis •  Decompose a digital system into a controller (FSM) and datapath,

and code accordingly •  Write VHDL testbenches •  Synthesize and implement digital systems on FPGAs •  Effectively code digital systems for cryptography, signal

processing, and microprocessor applications •  This knowledge will come about through homework, exams,

and an extensive project •  The project in particular will help you know VHDL and the FPGA

design flow from beginning to end

104

Additional Skills Learned in the Project

•  Reading & understanding specification of a complex algorithm

•  Design of new hardware architectures based on existing architectures (datapath & controller) •  Reading, understanding, and modifying existing

VHDL code •  Using embedded resources of modern FPGAs •  Characterizing performance of your codes for multiple FPGA families

105

Project Task 1

•  Read the following chapters from the GMU technical report published at http://eprint.iacr.org/2010/445

•  Chapter 1 Introduction & Motivation •  Chapter 2 Methodology •  Chapter 3 Comprehensive Designs of SHA-3 Candidates 3.1, 3.2 + subsection concerning your algorithm •  Chapter 4 Design Summary and Results

•  Download and get familiar with the package of a hash function assigned to you

http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/submissions_rnd2.html •  Read carefully the specification of your algorithm

106

Project Task 1 – cont.

In one week: Meeting with the instructor devoted to fully understanding the GMU report, specification, block diagrams, interface, and timing formulas.

In two weeks: Draft block diagrams of the - selected unrolled architecture - selected folded architecture. Corresponding timing formulas for execution time &

throughput.