Exploiting parallelism opportunities in
non-parallel architectures to improve
NLFSR software implementations
Pedro Malagón
Juan-Mariano de Goyeneche
José M. Moya
1 / 20
Context
• Remote Keyless Entry Systems (RKE)
– Small communications
– Two sides of communication know state
– Knowing previous state/message provides no
information of next state/message (ideally)
2
Global goal
3
• Automatic generation of different
implementations of the same encryption
algorithm
• Random execution of implementations in
order to introduce variability that increases
resistance against Side-Channel Attacks
LFSR (I)
• Linear Feedback Shift Registers
• Implementation
– Very simple in Hardware
– One-bit at a time in Software
4
LFSR (II)
• Pros:
– Pseudo-random sequence
– Long period: n-bits → 2n
– Simple implementation
• Cons:
– Berlekamp-Massey algorithm
• Observing 2n gives complete information of LFSR
5
NLFSR (I)
• Add non linearity to improve security
• Non-Linear Feedback Shift Registers
6
NLFSR (II)
• Implementation
– Focus on the NLF
– bit LUT
– Run-time computed: ANF
– Automatically detection of ci values
7
{ } { }( ) ∑
−
= −−−••••=
→12
0 11010110,,
1,01,0n
n
i
in
iiin
n
xxxcxxf KK
Concrete goal
8
• Goal: different implementations potentially automatic
• Two completley different implementations:
– ANF based and LUT based
• ANF drawbacks
– Too many run-time operations (boolean)
• Optimization of ANF based implementations
Round processing
• Feedback inputs can be available
• Available processing capabilities
– min (j - i, n) n-bit ALU, j-bit data, i bit
– Similar to MMX in AES implementations
9
round i+1
round i+1
LLVM Passes
10
• ANF implementation
• DAG building
• CFG generation
• Masking meta → valid bits
• Instruction scheduling (maximize bits)
• Loop instruction motion → Nested loops
– Power of two step
Test case
11
• KeeLoq in MSP430 (16-bit)
• Inputs: d0, d1, d9, d16, d20, d26, d31, k0
• Data: 32-bits
Experimental
12
• Compare 5 implementations
– 3 LUT based
– tb041: official PIC implementation
– nlf_tb041: mask calculation
– gen_tb041: official generic Microchip
– 2 ANF based
– bin_ops: one bit at a time
– par_bin_ops: applying optimizer
16-round processing
< 33
Setup
output
par_bin_ops
13
• Implementation
Cycles (16 rounds)
14
Instructions (16 rounds)
15
Memory (16 rounds)
16
Conclusions
17
• Worst case
– Cycles improvement: 2.45
– Code size grows in 2.27
• Automatically generated
Thank you
18
Thank you for coming
Any questions?