Download - Exploiting parallelism opportunities in non-parallel architectures to improve NLFSR software implementations

Exploiting parallelism opportunities in

non-parallel architectures to improve

NLFSR software implementations

Pedro Malagón

Juan-Mariano de Goyeneche

José M. Moya

1 / 20

Context

• Remote Keyless Entry Systems (RKE)

– Small communications

– Two sides of communication know state

– Knowing previous state/message provides no

information of next state/message (ideally)

2

Global goal

3

• Automatic generation of different

implementations of the same encryption

algorithm

• Random execution of implementations in

order to introduce variability that increases

resistance against Side-Channel Attacks

LFSR (I)

• Linear Feedback Shift Registers

• Implementation

– Very simple in Hardware

– One-bit at a time in Software

4

LFSR (II)

• Pros:

– Pseudo-random sequence

– Long period: n-bits → 2n

– Simple implementation

• Cons:

– Berlekamp-Massey algorithm

• Observing 2n gives complete information of LFSR

5

NLFSR (I)

• Add non linearity to improve security

• Non-Linear Feedback Shift Registers

6

NLFSR (II)

• Implementation

– Focus on the NLF

– bit LUT

– Run-time computed: ANF

– Automatically detection of ci values

7

{ } { }( ) ∑

−

= −−−••••=

→12

0 11010110,,

1,01,0n

n

i

in

iiin

n

xxxcxxf KK

Concrete goal

8

• Goal: different implementations potentially automatic

• Two completley different implementations:

– ANF based and LUT based

• ANF drawbacks

– Too many run-time operations (boolean)

• Optimization of ANF based implementations

Round processing

• Feedback inputs can be available

• Available processing capabilities

– min (j - i, n) n-bit ALU, j-bit data, i bit

– Similar to MMX in AES implementations

9

round i+1

round i+1

LLVM Passes

10

• ANF implementation

• DAG building

• CFG generation

• Masking meta → valid bits

• Instruction scheduling (maximize bits)

• Loop instruction motion → Nested loops

– Power of two step

Test case

11

• KeeLoq in MSP430 (16-bit)

• Inputs: d0, d1, d9, d16, d20, d26, d31, k0

• Data: 32-bits

Experimental

12

• Compare 5 implementations

– 3 LUT based

– tb041: official PIC implementation

– nlf_tb041: mask calculation

– gen_tb041: official generic Microchip

– 2 ANF based

– bin_ops: one bit at a time

– par_bin_ops: applying optimizer

16-round processing

< 33

Setup

output

par_bin_ops

13

• Implementation

Cycles (16 rounds)

14

Instructions (16 rounds)

15

Memory (16 rounds)

16

Conclusions

17

• Worst case

– Cycles improvement: 2.45

– Code size grows in 2.27

• Automatically generated

Thank you

18

Thank you for coming

Any questions?