Hardware Multithreading

Post on 30-Dec-2015

86 views 8 download

description

Hardware Multithreading. COMP25212. Increasing CPU Performance. By increasing clock frequency – pipelining By increasing Instructions per Clock – superscalar Minimizing memory access impact – caches Maximizing pipeline utilization – branch prediction - PowerPoint PPT Presentation

transcript

1

Hardware Multithreading

COMP25212

2

Increasing CPU Performance

• By increasing clock frequency – pipelining

• By increasing Instructions per Clock – superscalar

• Minimizing memory access impact – caches

• Maximizing pipeline utilization – branch prediction

• Maximizing pipeline utilization – forwarding

• Maximizing instruction issue – dynamic scheduling

3

Increasing Parallelism

• Amount of parallelism that we can exploit is limited by the programs– Some areas exhibit great parallelism– Some others are essentially sequential

• In the later case, where can we find additional independent instructions?– In a different program!

4

Software Multithreading - Revision

• Modern Operating Systems support several processes/threads to be run concurrently

• Transparent to the user – all of them appear to be running at the same time

• BUT, actually, they are scheduled (and interleaved) by the OS

5

OS Thread Switching - Revision

Thread T0Thread T1Operating System

Save state into PCB0

Load state fromPCB1

Save state into PCB0

Load state fromPCB1

COMP25111 – Lect. 5

Exec

Exec

ExecWait

Wait

Wait

6

Process Control Block (PCB) - Revision

• Process ID• Process State• PC• Stack Pointer• General Registers• Memory Management

Info

• Open File List, with positions

• Network Connections• CPU time used• Parent Process ID

PCBs store information about the state of ‘alive’ processes handled by the OS

7

OS Process States - Revision

Terminated

Running on a CPU

Blocked waiting for

event

Ready waiting for

a CPUNew

Dispatched

Wait (e.g. I/O)

Eventoccurs

Pre-empted

COMP25111 – Lect. 5

8

Hardware Multithreading

• Allow multiple threads to share a single processor

• Requires replicating the independent state of each thread

• Virtual memory can be used to share memory among threads

9

CPU Support for Multithreading

Data Cache

Fetch Logic

Fetch Logic

Decode Logic

Fetch Logic

Exec Logic

Fetch Logic

Mem

Logic

Write Logic

Inst Cache

PCA

PCB

VA MappingA

VA MappingB

AddressTranslation

RegA

RegB

10

Hardware Multithreading Issues

• How to HW MT is presented to the OS– Normally present each hardware thread as a

virtual processor (Linux, UNIX, Windows)– Requires multiprocessor support from the OS

• Needs to share or replicate resources– Registers – normally replicated– Caches – normally shared

• Each thread will use a fraction of the cache• Cache trashing issues – harm performance

11

Example of Trashing - Revision

Memory Accesses Line 13D

Thread A Thread B Action taken Tag

: : Invalid

0x075A13D0 MISS : Load 0x075A 0x075A

0X018313D4 MISS Load 0X0183 0X0183

0x075A13D4 MISS Load 0x075A 0x075A

0X018313D8 MISS Load 0X0183 0X0183

0x075A13D8 MISS Load 0x075A 0x075A

: 0X018313DC MISS Load 0X0183 0X0183

Same index

Direct Mapped cache

12

Hardware Multithreading

• Different ways to exploit this new source of parallelism

• When & how to switch threads?– Coarse-grain Multithreading – Fine-grain Multithreading – Simultaneous Multithreading

13

Coarse-Grain Multithreading

14

Coarse-Grain Multithreading

• Issue instructions from a single thread • Operate like a simple pipeline

• Switch Thread on “expensive” operation:– E.g. I-cache miss– E.g. D-cache miss

15

Switch Threads on Icache miss1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst c IF MISS ID EX MEM WB

Inst X IF ID EX MEM

Inst Y IF ID EX

Inst Z IF ID

- - - -

• Remove Inst c and switch to ‘grey’ thread• ‘Grey’ thread will continue its execution until

there is another I-cache or D-cache miss

16

Switch Threads on Dcache miss1 2 3 4 5 6 7

Inst a IF ID EX M-Miss WB

Inst b IF ID EX MEM WB

Inst c IF ID EX MEM WB

Inst d IF ID EX MEM

Inst X IF ID EX

Inst Y IF ID

MISS MISS MISS

- - -

- - -

- - -Abort theseAbort these

• Remove Inst a and switch to ‘grey’ thread– Remove issued instructions from ‘white’ thread– Roll back ‘white’ PC to point to Inst a

17

Coarse Grain Multithreading

• Good to compensate for infrequent, but expensive pipeline disruption

• Minimal pipeline changes– Need to abort all the instructions in “shadow” of

Dcache miss overhead– Swap instruction streams

• Data control hazards are not solved

18

Fine-Grain Multithreading

19

Fine-Grain Multithreading

• Interleave the execution of several threads

• Usually using Round Robin among all the ready hardware threads

• Requires instantaneous thread switching– Complex hardware

20

Fine-Grain Multithreading

• Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?)

1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst N IF ID EX MEM

Inst c IF ID EX

Inst P IF ID

21

I-cache misses in Fine Grain Multithreading

• An I-cache miss is overcome transparently

1 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst b IF-MISS - - - -

Inst N IF ID EX MEM

Inst P IF ID EX

Inst Q IF ID

Inst b is removed and ‘white’ is marked as not ‘ready’

‘White’ thread is not ready so ‘grey’ is executed

22

D-cache misses in Fine Grain Multithreading

• Mark the thread as not ‘ready’ and issue only from the other thread(s)

1 2 3 4 5 6 7

Inst a IF ID EX M-MISS Miss Miss WB

Inst M IF ID EX MEM WB

Inst b IF ID - - -

Inst N IF ID EX MEM

Inst P IF ID EX

Inst Q IF ID

‘White’ marked as not ‘ready’. Remove Inst b. Update PC.

‘White’ thread is not ready so ‘Grey’ is executed

23

1 2 3 4 5 6 7

Inst a IF RO EX MEM WB

Inst M IF RO EX MEM WB

Inst b IF ID EX MEM WB

Inst N IF ID EX MEM

Inst c IF ID EX

Inst P IF ID

Fine Grain Multithreadingin Out-of-order processors

• In an out of order processor we may continue issuing instructions from both threads– Unless O-o-O algorithm stalls one of the threads

4 5 6 7

M MISS Miss Miss WB

EX MEM WB

RO (RO) (RO) EX

IF RO EX MEM

IF (RO) (RO)

IF RO

24

Fine Grain Multithreading

• Utilization of pipeline resources increased, i.e. better overall performance

• Impact of short stalls is alleviated by executing instructions from other threads

• Single-thread execution is slowed• Requires an instantaneous thread switching

mechanism – Expensive in terms of hardware

25

Simultaneous Multi-Threading

26

Simultaneous Multi-Threading

• The main idea is to exploit instructions level parallelism and thread level parallelism at the same time

• In a superscalar processor issue instructions from different threads in the same cycle

• Instructions from different threads can be using the same stage of the pipeline

27

Simultaneous Multi-Threading

1 2 3 4 5 6 7 8 9 10

Inst a IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst N IF ID EX MEM WB

Inst c IF ID EX MEM WB

Inst P IF ID EX MEM WB

Inst Q IF ID EX MEM WB

Inst d IF ID EX MEM WB

Inst e IF ID EX MEM WB

Inst R IF ID EX MEM WB

Same thread

Differentthread

28

SMT issues

• Asymmetric pipeline stall (from superscalar)– One part of pipeline stalls – we want the other

pipeline to continue

• Overtaking – want non-stalled threads to make progress

• Existing implementations on O-o-O, register renamed architectures (similar to Tomasulo)– e.g. Intel Hyperthreading

29

SMT: Glimpse into the Future

• Scout threads– A thread to prefetch memory – reduce cache miss

overhead

• Speculative threads– Allow a thread to execute speculatively way past

branch/jump/call/miss/etc– Needs revised O-o-O logic– Needs and extra memory support

30

Simultaneous Multi Threading

• Extracts the most parallelism from instructions and threads

• Implemented only in out-of-order processors because they are the only able to exploit that much parallelism

• Has a significant hardware overhead

31

ExampleConsider we want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #30, and the second program another at instruction #70. Assume that:

+ There is parallelism enough to execute all instructions independently (no hazards)

+ Switching threads can be done instantaneously + A cache miss requires 20 cycles to get the instruction to the cache. + The two programs would not interfere with each other’s caches lines

Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload

a) Sequentially (no multithreading),

b) With coarse-grain multithreading,

c) With fine-grain multithreading,

d) With 2-way simultaneous multithreading,

32

Summary of Hardware Multithreading

33

Benefits of Hardware Multithreading

• Multithreading techniques improve the utilisation of processor resources and, hence, the overall performance

• If the different threads are accessing the same input data they may be using the same regions of memory – Cache efficiency improves in these cases

34

Disadvantages of Hardware Multithreading

• The single-thread performance may be degraded when comparing with a single-thread CPU– Multiple threads interfering with each other

• Shared caches mean that, effectively, threads would use a fraction of the whole cache– Trashing may exacerbate this issue

• Thread scheduling at hardware level adds high complexity to processor design– Thread state, managing priorities, OS-level information, …

35

Multithreading Summary

• A cost-effective way of finding additional parallelism for the CPU pipeline

• Available in x86, Itanium, Power and SPARC• Present additional hardware thread as an

additional virtual CPU to Operating System• Operating Systems Beware!!! (why?)

36

Comparison of Multithreading Techniques – 4-way superscalar

1 2 1 2 3 1 2 3 13 4 54 5 6 6 2 3 47 8 7 4 5 5 69 10 11 12 8 6 7 8

7 9 10 11 128 9 10

9 10 11 1213 13 1414 15 16 15 16

1 2 1 2 1 2 1 23 1 2 3 3 1 2 34 5 6 1 2 3 1 3 4 57 8 1 4 5 6 69 10 11 12 3 2 3 4 7

4 5 8 7 4 51 2 3 4 5 5 6 9 104 5 2 3 4 11 12 8 66 4 5 6 7 8 77 6 9 10 11 12

SMT

Thread A Thread B Thread C Thread DTim

e ——

——

>

Tim

e ——

——

>

Coarse-grain Fine-grain