NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters...

Post on 09-Jul-2020

4 views 0 download

transcript

NetTM: Faster and Easier Synchronization for Soft Multicores

via Transactional Memory

Martin Labrecque

Prof. Greg Steffan

University of Toronto

FPGA, February 27th 2011

2

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

3

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Deep packet inspection requires: software + CPUs

4

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Deep packet inspection requires: software + CPUs

Our goal: implement those cores directly in the FPGA

5

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Deep packet inspection requires: software + CPUs

Our goal: implement those cores directly in the FPGA

FPGA

Processor(s)

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

DDR controller

Ethernet MAC

6

NetThreads: Our Base System

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

NetFPGA

7

NetThreads: Our Base System

8 threads?

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

NetFPGA

8

NetThreads: Our Base System

8 threads?

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

Write 1 program, run on all threads!

NetFPGA

9

NetThreads: Our Base System

8 threads?

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

Write 1 program, run on all threads!Released online: netfpga+netthreads

NetFPGA

10

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

E

11

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

EProgrammers need

to insert locks in case there is a dependence

Reality:waitwaitwaitT

IME

12

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

EProgrammers need

to insert locks in case there is a dependence

Reality:waitwaitwaitT

IME

Experimental result: Synchronizing packet processing threads with fine/medium-grained global locks is overly-

conservative 80-90% of the time [ANCS'10]

13

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

EProgrammers need

to insert locks in case there is a dependence

Reality:waitwaitwaitT

IME

x

TIM

EData-independent packets are

processed in parallel

Transactional memory

14

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

Off-chip DDR2

Input mem.

15

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

16

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

UndoLog

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

17

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

UndoLog

- 1K words speculative writes buffered per thread - 4-LUT: +21% 16K BRAMs: +25% Preserved 125 MHz

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

18

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

UndoLog

- 1K words speculative writes buffered per thread - 4-LUT: +21% 16K BRAMs: +25% Preserved 125 MHz

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

19

Transactional Memory●1st HTM implementation tightly integrated with soft processors

20

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

21

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

22

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

23

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

●Processor and conflict detection integration works well on FPGA

24

Transactional Memory

Future work: scale to more cores on newer FPGA/NetFPGA!

●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

●Processor and conflict detection integration works well on FPGA

25

Transactional Memory

NetTM and NetThreads available online

: netfpga+netthreads

martinL@eecg.utoronto.ca

Future work: scale to more cores on newer FPGA/NetFPGA!

●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

●Processor and conflict detection integration works well on FPGA