Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | gerardo-aldredge |
View: | 215 times |
Download: | 0 times |
MISTY1 Block CipherMISTY1 Block Cipher
Undergrad Team U8 – JK FlipFlopUndergrad Team U8 – JK FlipFlop
Clark Cianfarini and Garrett SmithClark Cianfarini and Garrett Smith
What is MISTY1?What is MISTY1?
Cryptographic block cipherCryptographic block cipher Developed by Mitsubishi ElectricDeveloped by Mitsubishi Electric Created in 1995Created in 1995 Developed primarily for encryption Developed primarily for encryption
on mobile phones and other mobile on mobile phones and other mobile devicesdevices
Stands for: Stands for: MMitsubishi itsubishi IImproved mproved SSecurity ecurity TTechnologechnologYY
Technical SpecsTechnical Specs
Feistel NetworkFeistel Network 64-bit block size64-bit block size 128-bit key128-bit key Rounds in multiples Rounds in multiples
of 4 (4, 8, 12, 16, …)of 4 (4, 8, 12, 16, …)
RFC 2994RFC 2994
Picture from:
http://web.archive.org/web/20000823133547/http://www.mitsubishi.com/ghp_japan/misty/misty_e_b.pdf
Our Original ImplementationOur Original Implementation
8 rounds; the standard8 rounds; the standard 128-bit key and 64-bit data as 128-bit key and 64-bit data as
hexadecimal inputs (command line hexadecimal inputs (command line arguments)arguments)
Encrypt and decrypt functionality Encrypt and decrypt functionality both implemented (as well as both implemented (as well as performing both consecutively for performing both consecutively for benchmarking)benchmarking)
Original (Unoptimized) DesignOriginal (Unoptimized) Design
Designed for code size and clarityDesigned for code size and clarity Written in CWritten in C Only standard libraries usedOnly standard libraries used Inefficiencies in: loops, multiplies and divides, Inefficiencies in: loops, multiplies and divides,
function calls, parameter passingfunction calls, parameter passing Usage: ./misty <e|d|b> <K> <M> [I]Usage: ./misty <e|d|b> <K> <M> [I]
• 'e' to encrypt, 'd' to decrypt, 'b' to test both'e' to encrypt, 'd' to decrypt, 'b' to test both
• K is a required 16-digit hex string (128 bits)K is a required 16-digit hex string (128 bits)
• M is a required 8-digit hex string (64 bits)M is a required 8-digit hex string (64 bits)
• I is an optional number of iterations for benchmarkingI is an optional number of iterations for benchmarking
Original Design GPROF ProfileOriginal Design GPROF Profile % cumulative self self total time seconds seconds calls us/call us/call name 45.57 10.65 10.65 560000000 0.02 0.02 fi 19.80 15.28 4.63 160000000 0.03 0.09 fo 7.63 17.06 1.78 100000000 0.02 0.02 fl 6.94 18.69 1.62 100000000 0.02 0.02 flinv 5.25 19.91 1.23 10000000 0.12 0.27 key_schedule 3.13 20.65 0.73 10000000 0.07 1.01 decrypt_block 3.06 21.36 0.72 10000000 0.07 1.03 encrypt_block 2.44 21.93 0.57 20000000 0.03 0.03 unpack_data 1.54 22.29 0.36 50000000 0.01 0.04 decrypt_round_even 1.33 22.60 0.31 40000000 0.01 0.13 encrypt_round_even 1.03 22.85 0.24 40000000 0.01 0.18 decrypt_round_odd 0.96 23.07 0.23 __gmon_start__ 0.86 23.27 0.20 40000000 0.01 0.09 encrypt_round_odd 0.34 23.35 0.08 10000000 0.01 0.04 encrypt_final 0.21 23.40 0.05 main 0.00 23.40 0.00 48 0.00 0.00 xtoi 0.00 23.40 0.00 4 0.00 0.00 print_hex_data 0.00 23.40 0.00 2 0.00 0.00 parse_hex_arg
80% of the time spent in FO/FI/FL/FLINV80% of the time spent in FO/FI/FL/FLINV Compiled with gcc-4.3.4Compiled with gcc-4.3.4 Benchmarked on 64-bit Core2 @ 2.4 GHz, linux-2.6.33Benchmarked on 64-bit Core2 @ 2.4 GHz, linux-2.6.33
Unoptimized Execution TimeUnoptimized Execution Time gcc misty_slow.c -o slowgcc misty_slow.c -o slow time ./slow b 00112233445566778899aabbccddeeff time ./slow b 00112233445566778899aabbccddeeff
0123456789abcdef 100000000123456789abcdef 10000000 real 0m23.093sreal 0m23.093s
user 0m22.886suser 0m22.886ssys 0m0.031ssys 0m0.031s
10 million iterations, 2.31 µs per iteration (~ 1.15 µs per 10 million iterations, 2.31 µs per iteration (~ 1.15 µs per encryption and decryption)encryption and decryption)
Revised Software DesignRevised Software Design
Designed for optimal performanceDesigned for optimal performance Loops unrolled (rounds, d0/d1 pack)Loops unrolled (rounds, d0/d1 pack) Pow-2 mul, div, mod → shift, andPow-2 mul, div, mod → shift, and Functions inlinedFunctions inlined Reduced parameter passing (key)Reduced parameter passing (key) Compiler optimization levels enabledCompiler optimization levels enabled Compiler architecture-specific Compiler architecture-specific
options enabledoptions enabled
Rounds: Before UnrollingRounds: Before Unrolling
for (i = 0; i < NUM_ROUNDS; i++)for (i = 0; i < NUM_ROUNDS; i++)
{{
if (i == (NUM_ROUNDS - 1))if (i == (NUM_ROUNDS - 1))
encrypt_final(i, &d0, &d1, ek);encrypt_final(i, &d0, &d1, ek);
else if ((i % 2) == 0)else if ((i % 2) == 0)
encrypt_round_even(i, &d0, &d1, ek);encrypt_round_even(i, &d0, &d1, ek);
elseelse
encrypt_round_odd(i, &d0, &d1, ek);encrypt_round_odd(i, &d0, &d1, ek);
}}
Rounds: After UnrollingRounds: After Unrolling // round 0// round 0
d0 = fl(d0, 0);d0 = fl(d0, 0);
d1 = fl(d1, 1);d1 = fl(d1, 1);
d1 = d1 ^ fo(d0, 0);d1 = d1 ^ fo(d0, 0);
// round 1// round 1
d0 = d0 ^ fo(d1, 1);d0 = d0 ^ fo(d1, 1);
// round 2// round 2
d0 = fl(d0, 2);d0 = fl(d0, 2);
d1 = fl(d1, 3);d1 = fl(d1, 3);
d1 = d1 ^ fo(d0, 2);d1 = d1 ^ fo(d0, 2);
// round 3// round 3
d0 = d0 ^ fo(d1, 3);d0 = d0 ^ fo(d1, 3);
// round 7// round 7
d0 = d0 ^ fo(d1, 7);d0 = d0 ^ fo(d1, 7);
// finalize// finalize
d0 = fl(d0, 8);d0 = fl(d0, 8);
d1 = fl(d1, 9);d1 = fl(d1, 9);
// round 4// round 4
d0 = fl(d0, 4);d0 = fl(d0, 4);
d1 = fl(d1, 5);d1 = fl(d1, 5);
d1 = d1 ^ fo(d0, 4);d1 = d1 ^ fo(d0, 4);
// round 5// round 5
d0 = d0 ^ fo(d1, 5);d0 = d0 ^ fo(d1, 5);
// round 6// round 6
d0 = fl(d0, 6);d0 = fl(d0, 6);
d1 = fl(d1, 7);d1 = fl(d1, 7);
d1 = d1 ^ fo(d0, 6);d1 = d1 ^ fo(d0, 6);
Execution Time and SpeedupExecution Time and Speedup
Description Time SpeedupDescription Time Speedup
Slow / Initial 0m23.093s 1.00000Slow / Initial 0m23.093s 1.00000
Unroll Rounds 0m21.573s 1.07046Unroll Rounds 0m21.573s 1.07046
Unroll D0/D1 Init 0m20.750s 1.11292Unroll D0/D1 Init 0m20.750s 1.11292
Shift and AND 0m18.978s 1.21683Shift and AND 0m18.978s 1.21683
Unroll Packing 0m18.135s 1.27339Unroll Packing 0m18.135s 1.27339
Make EK Global 0m17.902s 1.28997Make EK Global 0m17.902s 1.28997
Inline F0/FI/FL 0m15.921s 1.45047Inline F0/FI/FL 0m15.921s 1.45047
Enable O1 0m4.308s 5.36049Enable O1 0m4.308s 5.36049
Enable O2 0m4.276s 5.40061Enable O2 0m4.276s 5.40061
Enable O3 0m4.155s 5.55654Enable O3 0m4.155s 5.55654
Architecture Flags 0m4.128s 5.59423Architecture Flags 0m4.128s 5.59423
Building and Testing the Building and Testing the Optimized ImplementationOptimized Implementation
gcc misty_fast.c -o fastgcc misty_fast.c -o fast gcc misty_fast.c -o fast -O1gcc misty_fast.c -o fast -O1 gcc misty_fast.c -o fast -O2gcc misty_fast.c -o fast -O2 gcc misty_fast.c -o fast -O3gcc misty_fast.c -o fast -O3 gcc misty_fast.c -o fast -O3 -march=core2gcc misty_fast.c -o fast -O3 -march=core2 Fastest execution time:Fastest execution time:
real 0m4.128sreal 0m4.128suser 0m4.117suser 0m4.117ssys 0m0.007ssys 0m0.007s
10 million iterations, 413 ns per iteration10 million iterations, 413 ns per iteration
Execution Time and SpeedupExecution Time and Speedup
Final Design GPROF ProfileFinal Design GPROF Profile % cumulative self self total time seconds seconds calls ns/call ns/call name 42.99 2.26 2.26 10000000 226.15 226.15 decrypt_block 41.57 4.45 2.19 10000000 218.65 218.65 encrypt_block 15.41 5.26 0.81 main 0.00 5.26 0.00 4 0.00 0.00 print_hex_data 0.00 5.26 0.00 2 0.00 0.00 parse_hex_arg
Most function calls inlined, only decrypt_block and Most function calls inlined, only decrypt_block and encrypt_block remain encrypt_block remain
What was Learned?What was Learned? Original implementation may not have been implemented Original implementation may not have been implemented
all that badly (~1.5 speedup from manual all that badly (~1.5 speedup from manual implementations)implementations)
Larger benefit from instruction level optimization (gcc)Larger benefit from instruction level optimization (gcc) Profile first, then optimize in places where it actually Profile first, then optimize in places where it actually
mattersmatters Bit-wise AND operator lower precedence than modulus:Bit-wise AND operator lower precedence than modulus:
x % y + z → (x % y) + zx % y + z → (x % y) + z x & y + z → x & (y + z)x & y + z → x & (y + z)
All optimizations add up to a significant amount of savingsAll optimizations add up to a significant amount of savings
Future WorkFuture Work Use of SSE vector instructions for parallel operationsUse of SSE vector instructions for parallel operations
Data types such as uint8_t/uint16_t converted to Data types such as uint8_t/uint16_t converted to natural integer size for better memory alignment and natural integer size for better memory alignment and access performanceaccess performance
Use of a union to replace packing and unpacking of Use of a union to replace packing and unpacking of data from array to D0/D1data from array to D0/D1
Written directly in optimized assemblyWritten directly in optimized assembly
Dedicated hardware implementation (ASIC/FPGA) for Dedicated hardware implementation (ASIC/FPGA) for MISTY1 (originally designed to be implemented in MISTY1 (originally designed to be implemented in hardware)hardware)
Questions?Questions?
??