MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.

MISTY1 Block CipherMISTY1 Block Cipher

Undergrad Team U8 – JK FlipFlopUndergrad Team U8 – JK FlipFlop

Clark Cianfarini and Garrett SmithClark Cianfarini and Garrett Smith

What is MISTY1?What is MISTY1?

Cryptographic block cipherCryptographic block cipher Developed by Mitsubishi ElectricDeveloped by Mitsubishi Electric Created in 1995Created in 1995 Developed primarily for encryption Developed primarily for encryption

on mobile phones and other mobile on mobile phones and other mobile devicesdevices

Stands for: Stands for: MMitsubishi itsubishi IImproved mproved SSecurity ecurity TTechnologechnologYY

Technical SpecsTechnical Specs

Feistel NetworkFeistel Network 64-bit block size64-bit block size 128-bit key128-bit key Rounds in multiples Rounds in multiples

of 4 (4, 8, 12, 16, …)of 4 (4, 8, 12, 16, …)

RFC 2994RFC 2994

Picture from:

http://web.archive.org/web/20000823133547/http://www.mitsubishi.com/ghp_japan/misty/misty_e_b.pdf

Our Original ImplementationOur Original Implementation

8 rounds; the standard8 rounds; the standard 128-bit key and 64-bit data as 128-bit key and 64-bit data as

hexadecimal inputs (command line hexadecimal inputs (command line arguments)arguments)

Encrypt and decrypt functionality Encrypt and decrypt functionality both implemented (as well as both implemented (as well as performing both consecutively for performing both consecutively for benchmarking)benchmarking)

Original (Unoptimized) DesignOriginal (Unoptimized) Design

Designed for code size and clarityDesigned for code size and clarity Written in CWritten in C Only standard libraries usedOnly standard libraries used Inefficiencies in: loops, multiplies and divides, Inefficiencies in: loops, multiplies and divides,

function calls, parameter passingfunction calls, parameter passing Usage: ./misty <e|d|b> <K> <M> [I]Usage: ./misty <e|d|b> <K> <M> [I]

• 'e' to encrypt, 'd' to decrypt, 'b' to test both'e' to encrypt, 'd' to decrypt, 'b' to test both

• K is a required 16-digit hex string (128 bits)K is a required 16-digit hex string (128 bits)

• M is a required 8-digit hex string (64 bits)M is a required 8-digit hex string (64 bits)

• I is an optional number of iterations for benchmarkingI is an optional number of iterations for benchmarking

Original Design GPROF ProfileOriginal Design GPROF Profile % cumulative self self total time seconds seconds calls us/call us/call name 45.57 10.65 10.65 560000000 0.02 0.02 fi 19.80 15.28 4.63 160000000 0.03 0.09 fo 7.63 17.06 1.78 100000000 0.02 0.02 fl 6.94 18.69 1.62 100000000 0.02 0.02 flinv 5.25 19.91 1.23 10000000 0.12 0.27 key_schedule 3.13 20.65 0.73 10000000 0.07 1.01 decrypt_block 3.06 21.36 0.72 10000000 0.07 1.03 encrypt_block 2.44 21.93 0.57 20000000 0.03 0.03 unpack_data 1.54 22.29 0.36 50000000 0.01 0.04 decrypt_round_even 1.33 22.60 0.31 40000000 0.01 0.13 encrypt_round_even 1.03 22.85 0.24 40000000 0.01 0.18 decrypt_round_odd 0.96 23.07 0.23 __gmon_start__ 0.86 23.27 0.20 40000000 0.01 0.09 encrypt_round_odd 0.34 23.35 0.08 10000000 0.01 0.04 encrypt_final 0.21 23.40 0.05 main 0.00 23.40 0.00 48 0.00 0.00 xtoi 0.00 23.40 0.00 4 0.00 0.00 print_hex_data 0.00 23.40 0.00 2 0.00 0.00 parse_hex_arg

80% of the time spent in FO/FI/FL/FLINV80% of the time spent in FO/FI/FL/FLINV Compiled with gcc-4.3.4Compiled with gcc-4.3.4 Benchmarked on 64-bit Core2 @ 2.4 GHz, linux-2.6.33Benchmarked on 64-bit Core2 @ 2.4 GHz, linux-2.6.33

Unoptimized Execution TimeUnoptimized Execution Time gcc misty_slow.c -o slowgcc misty_slow.c -o slow time ./slow b 00112233445566778899aabbccddeeff time ./slow b 00112233445566778899aabbccddeeff

0123456789abcdef 100000000123456789abcdef 10000000 real 0m23.093sreal 0m23.093s

user 0m22.886suser 0m22.886ssys 0m0.031ssys 0m0.031s

10 million iterations, 2.31 µs per iteration (~ 1.15 µs per 10 million iterations, 2.31 µs per iteration (~ 1.15 µs per encryption and decryption)encryption and decryption)

Revised Software DesignRevised Software Design

Designed for optimal performanceDesigned for optimal performance Loops unrolled (rounds, d0/d1 pack)Loops unrolled (rounds, d0/d1 pack) Pow-2 mul, div, mod → shift, andPow-2 mul, div, mod → shift, and Functions inlinedFunctions inlined Reduced parameter passing (key)Reduced parameter passing (key) Compiler optimization levels enabledCompiler optimization levels enabled Compiler architecture-specific Compiler architecture-specific

options enabledoptions enabled

Rounds: Before UnrollingRounds: Before Unrolling

for (i = 0; i < NUM_ROUNDS; i++)for (i = 0; i < NUM_ROUNDS; i++)

{{

if (i == (NUM_ROUNDS - 1))if (i == (NUM_ROUNDS - 1))

encrypt_final(i, &d0, &d1, ek);encrypt_final(i, &d0, &d1, ek);

else if ((i % 2) == 0)else if ((i % 2) == 0)

encrypt_round_even(i, &d0, &d1, ek);encrypt_round_even(i, &d0, &d1, ek);

elseelse

encrypt_round_odd(i, &d0, &d1, ek);encrypt_round_odd(i, &d0, &d1, ek);

}}

Rounds: After UnrollingRounds: After Unrolling // round 0// round 0

d0 = fl(d0, 0);d0 = fl(d0, 0);

d1 = fl(d1, 1);d1 = fl(d1, 1);

d1 = d1 ^ fo(d0, 0);d1 = d1 ^ fo(d0, 0);

// round 1// round 1

d0 = d0 ^ fo(d1, 1);d0 = d0 ^ fo(d1, 1);


d0 = fl(d0, 2);d0 = fl(d0, 2);

d1 = fl(d1, 3);d1 = fl(d1, 3);

d1 = d1 ^ fo(d0, 2);d1 = d1 ^ fo(d0, 2);


d0 = d0 ^ fo(d1, 3);d0 = d0 ^ fo(d1, 3);


d0 = d0 ^ fo(d1, 7);d0 = d0 ^ fo(d1, 7);

// finalize// finalize

d0 = fl(d0, 8);d0 = fl(d0, 8);

d1 = fl(d1, 9);d1 = fl(d1, 9);


d0 = fl(d0, 4);d0 = fl(d0, 4);

d1 = fl(d1, 5);d1 = fl(d1, 5);

d1 = d1 ^ fo(d0, 4);d1 = d1 ^ fo(d0, 4);


d0 = d0 ^ fo(d1, 5);d0 = d0 ^ fo(d1, 5);


d0 = fl(d0, 6);d0 = fl(d0, 6);

d1 = fl(d1, 7);d1 = fl(d1, 7);

d1 = d1 ^ fo(d0, 6);d1 = d1 ^ fo(d0, 6);

Execution Time and SpeedupExecution Time and Speedup

Description Time SpeedupDescription Time Speedup

Slow / Initial 0m23.093s 1.00000Slow / Initial 0m23.093s 1.00000

Unroll Rounds 0m21.573s 1.07046Unroll Rounds 0m21.573s 1.07046

Unroll D0/D1 Init 0m20.750s 1.11292Unroll D0/D1 Init 0m20.750s 1.11292

Shift and AND 0m18.978s 1.21683Shift and AND 0m18.978s 1.21683

Unroll Packing 0m18.135s 1.27339Unroll Packing 0m18.135s 1.27339

Make EK Global 0m17.902s 1.28997Make EK Global 0m17.902s 1.28997

Inline F0/FI/FL 0m15.921s 1.45047Inline F0/FI/FL 0m15.921s 1.45047

Enable O1 0m4.308s 5.36049Enable O1 0m4.308s 5.36049



Architecture Flags 0m4.128s 5.59423Architecture Flags 0m4.128s 5.59423

Building and Testing the Building and Testing the Optimized ImplementationOptimized Implementation

gcc misty_fast.c -o fastgcc misty_fast.c -o fast gcc misty_fast.c -o fast -O1gcc misty_fast.c -o fast -O1 gcc misty_fast.c -o fast -O2gcc misty_fast.c -o fast -O2 gcc misty_fast.c -o fast -O3gcc misty_fast.c -o fast -O3 gcc misty_fast.c -o fast -O3 -march=core2gcc misty_fast.c -o fast -O3 -march=core2 Fastest execution time:Fastest execution time:

real 0m4.128sreal 0m4.128suser 0m4.117suser 0m4.117ssys 0m0.007ssys 0m0.007s

10 million iterations, 413 ns per iteration10 million iterations, 413 ns per iteration

Execution Time and SpeedupExecution Time and Speedup

Final Design GPROF ProfileFinal Design GPROF Profile % cumulative self self total time seconds seconds calls ns/call ns/call name 42.99 2.26 2.26 10000000 226.15 226.15 decrypt_block 41.57 4.45 2.19 10000000 218.65 218.65 encrypt_block 15.41 5.26 0.81 main 0.00 5.26 0.00 4 0.00 0.00 print_hex_data 0.00 5.26 0.00 2 0.00 0.00 parse_hex_arg

Most function calls inlined, only decrypt_block and Most function calls inlined, only decrypt_block and encrypt_block remain encrypt_block remain

What was Learned?What was Learned? Original implementation may not have been implemented Original implementation may not have been implemented

all that badly (~1.5 speedup from manual all that badly (~1.5 speedup from manual implementations)implementations)

Larger benefit from instruction level optimization (gcc)Larger benefit from instruction level optimization (gcc) Profile first, then optimize in places where it actually Profile first, then optimize in places where it actually

mattersmatters Bit-wise AND operator lower precedence than modulus:Bit-wise AND operator lower precedence than modulus:

x % y + z → (x % y) + zx % y + z → (x % y) + z x & y + z → x & (y + z)x & y + z → x & (y + z)

All optimizations add up to a significant amount of savingsAll optimizations add up to a significant amount of savings

Future WorkFuture Work Use of SSE vector instructions for parallel operationsUse of SSE vector instructions for parallel operations

Data types such as uint8_t/uint16_t converted to Data types such as uint8_t/uint16_t converted to natural integer size for better memory alignment and natural integer size for better memory alignment and access performanceaccess performance

Use of a union to replace packing and unpacking of Use of a union to replace packing and unpacking of data from array to D0/D1data from array to D0/D1

Written directly in optimized assemblyWritten directly in optimized assembly

Dedicated hardware implementation (ASIC/FPGA) for Dedicated hardware implementation (ASIC/FPGA) for MISTY1 (originally designed to be implemented in MISTY1 (originally designed to be implemented in hardware)hardware)

Questions?Questions?

??

Date post:	14-Dec-2015
Category:	Documents
Upload:	gerardo-aldredge
View:	215 times
Download:	0 times

MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.

Documents