Post on 30-Dec-2015
description
transcript
1
Hardware Multithreading
COMP25212
2
Increasing CPU Performance
• By increasing clock frequency – pipelining
• By increasing Instructions per Clock – superscalar
• Minimizing memory access impact – caches
• Maximizing pipeline utilization – branch prediction
• Maximizing pipeline utilization – forwarding
• Maximizing instruction issue – dynamic scheduling
3
Increasing Parallelism
• Amount of parallelism that we can exploit is limited by the programs– Some areas exhibit great parallelism– Some others are essentially sequential
• In the later case, where can we find additional independent instructions?– In a different program!
4
Software Multithreading - Revision
• Modern Operating Systems support several processes/threads to be run concurrently
• Transparent to the user – all of them appear to be running at the same time
• BUT, actually, they are scheduled (and interleaved) by the OS
5
OS Thread Switching - Revision
Thread T0Thread T1Operating System
Save state into PCB0
Load state fromPCB1
Save state into PCB0
Load state fromPCB1
COMP25111 – Lect. 5
Exec
Exec
ExecWait
Wait
Wait
6
Process Control Block (PCB) - Revision
• Process ID• Process State• PC• Stack Pointer• General Registers• Memory Management
Info
• Open File List, with positions
• Network Connections• CPU time used• Parent Process ID
PCBs store information about the state of ‘alive’ processes handled by the OS
7
OS Process States - Revision
Terminated
Running on a CPU
Blocked waiting for
event
Ready waiting for
a CPUNew
Dispatched
Wait (e.g. I/O)
Eventoccurs
Pre-empted
COMP25111 – Lect. 5
8
Hardware Multithreading
• Allow multiple threads to share a single processor
• Requires replicating the independent state of each thread
• Virtual memory can be used to share memory among threads
9
CPU Support for Multithreading
Data Cache
Fetch Logic
Fetch Logic
Decode Logic
Fetch Logic
Exec Logic
Fetch Logic
Mem
Logic
Write Logic
Inst Cache
PCA
PCB
VA MappingA
VA MappingB
AddressTranslation
RegA
RegB
10
Hardware Multithreading Issues
• How to HW MT is presented to the OS– Normally present each hardware thread as a
virtual processor (Linux, UNIX, Windows)– Requires multiprocessor support from the OS
• Needs to share or replicate resources– Registers – normally replicated– Caches – normally shared
• Each thread will use a fraction of the cache• Cache trashing issues – harm performance
11
Example of Trashing - Revision
Memory Accesses Line 13D
Thread A Thread B Action taken Tag
: : Invalid
0x075A13D0 MISS : Load 0x075A 0x075A
0X018313D4 MISS Load 0X0183 0X0183
0x075A13D4 MISS Load 0x075A 0x075A
0X018313D8 MISS Load 0X0183 0X0183
0x075A13D8 MISS Load 0x075A 0x075A
: 0X018313DC MISS Load 0X0183 0X0183
Same index
Direct Mapped cache
12
Hardware Multithreading
• Different ways to exploit this new source of parallelism
• When & how to switch threads?– Coarse-grain Multithreading – Fine-grain Multithreading – Simultaneous Multithreading
13
Coarse-Grain Multithreading
14
Coarse-Grain Multithreading
• Issue instructions from a single thread • Operate like a simple pipeline
• Switch Thread on “expensive” operation:– E.g. I-cache miss– E.g. D-cache miss
15
Switch Threads on Icache miss1 2 3 4 5 6 7
Inst a IF ID EX MEM WB
Inst b IF ID EX MEM WB
Inst c IF MISS ID EX MEM WB
Inst X IF ID EX MEM
Inst Y IF ID EX
Inst Z IF ID
- - - -
• Remove Inst c and switch to ‘grey’ thread• ‘Grey’ thread will continue its execution until
there is another I-cache or D-cache miss
16
Switch Threads on Dcache miss1 2 3 4 5 6 7
Inst a IF ID EX M-Miss WB
Inst b IF ID EX MEM WB
Inst c IF ID EX MEM WB
Inst d IF ID EX MEM
Inst X IF ID EX
Inst Y IF ID
MISS MISS MISS
- - -
- - -
- - -Abort theseAbort these
• Remove Inst a and switch to ‘grey’ thread– Remove issued instructions from ‘white’ thread– Roll back ‘white’ PC to point to Inst a
17
Coarse Grain Multithreading
• Good to compensate for infrequent, but expensive pipeline disruption
• Minimal pipeline changes– Need to abort all the instructions in “shadow” of
Dcache miss overhead– Swap instruction streams
• Data control hazards are not solved
18
Fine-Grain Multithreading
19
Fine-Grain Multithreading
• Interleave the execution of several threads
• Usually using Round Robin among all the ready hardware threads
• Requires instantaneous thread switching– Complex hardware
20
Fine-Grain Multithreading
• Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?)
1 2 3 4 5 6 7
Inst a IF ID EX MEM WB
Inst M IF ID EX MEM WB
Inst b IF ID EX MEM WB
Inst N IF ID EX MEM
Inst c IF ID EX
Inst P IF ID
21
I-cache misses in Fine Grain Multithreading
• An I-cache miss is overcome transparently
1 2 3 4 5 6 7
Inst a IF ID EX MEM WB
Inst M IF ID EX MEM WB
Inst b IF-MISS - - - -
Inst N IF ID EX MEM
Inst P IF ID EX
Inst Q IF ID
Inst b is removed and ‘white’ is marked as not ‘ready’
‘White’ thread is not ready so ‘grey’ is executed
22
D-cache misses in Fine Grain Multithreading
• Mark the thread as not ‘ready’ and issue only from the other thread(s)
1 2 3 4 5 6 7
Inst a IF ID EX M-MISS Miss Miss WB
Inst M IF ID EX MEM WB
Inst b IF ID - - -
Inst N IF ID EX MEM
Inst P IF ID EX
Inst Q IF ID
‘White’ marked as not ‘ready’. Remove Inst b. Update PC.
‘White’ thread is not ready so ‘Grey’ is executed
23
1 2 3 4 5 6 7
Inst a IF RO EX MEM WB
Inst M IF RO EX MEM WB
Inst b IF ID EX MEM WB
Inst N IF ID EX MEM
Inst c IF ID EX
Inst P IF ID
Fine Grain Multithreadingin Out-of-order processors
• In an out of order processor we may continue issuing instructions from both threads– Unless O-o-O algorithm stalls one of the threads
4 5 6 7
M MISS Miss Miss WB
EX MEM WB
RO (RO) (RO) EX
IF RO EX MEM
IF (RO) (RO)
IF RO
24
Fine Grain Multithreading
• Utilization of pipeline resources increased, i.e. better overall performance
• Impact of short stalls is alleviated by executing instructions from other threads
• Single-thread execution is slowed• Requires an instantaneous thread switching
mechanism – Expensive in terms of hardware
25
Simultaneous Multi-Threading
26
Simultaneous Multi-Threading
• The main idea is to exploit instructions level parallelism and thread level parallelism at the same time
• In a superscalar processor issue instructions from different threads in the same cycle
• Instructions from different threads can be using the same stage of the pipeline
27
Simultaneous Multi-Threading
1 2 3 4 5 6 7 8 9 10
Inst a IF ID EX MEM WB
Inst b IF ID EX MEM WB
Inst M IF ID EX MEM WB
Inst N IF ID EX MEM WB
Inst c IF ID EX MEM WB
Inst P IF ID EX MEM WB
Inst Q IF ID EX MEM WB
Inst d IF ID EX MEM WB
Inst e IF ID EX MEM WB
Inst R IF ID EX MEM WB
Same thread
Differentthread
28
SMT issues
• Asymmetric pipeline stall (from superscalar)– One part of pipeline stalls – we want the other
pipeline to continue
• Overtaking – want non-stalled threads to make progress
• Existing implementations on O-o-O, register renamed architectures (similar to Tomasulo)– e.g. Intel Hyperthreading
29
SMT: Glimpse into the Future
• Scout threads– A thread to prefetch memory – reduce cache miss
overhead
• Speculative threads– Allow a thread to execute speculatively way past
branch/jump/call/miss/etc– Needs revised O-o-O logic– Needs and extra memory support
30
Simultaneous Multi Threading
• Extracts the most parallelism from instructions and threads
• Implemented only in out-of-order processors because they are the only able to exploit that much parallelism
• Has a significant hardware overhead
31
ExampleConsider we want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #30, and the second program another at instruction #70. Assume that:
+ There is parallelism enough to execute all instructions independently (no hazards)
+ Switching threads can be done instantaneously + A cache miss requires 20 cycles to get the instruction to the cache. + The two programs would not interfere with each other’s caches lines
Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload
a) Sequentially (no multithreading),
b) With coarse-grain multithreading,
c) With fine-grain multithreading,
d) With 2-way simultaneous multithreading,
32
Summary of Hardware Multithreading
33
Benefits of Hardware Multithreading
• Multithreading techniques improve the utilisation of processor resources and, hence, the overall performance
• If the different threads are accessing the same input data they may be using the same regions of memory – Cache efficiency improves in these cases
34
Disadvantages of Hardware Multithreading
• The single-thread performance may be degraded when comparing with a single-thread CPU– Multiple threads interfering with each other
• Shared caches mean that, effectively, threads would use a fraction of the whole cache– Trashing may exacerbate this issue
• Thread scheduling at hardware level adds high complexity to processor design– Thread state, managing priorities, OS-level information, …
35
Multithreading Summary
• A cost-effective way of finding additional parallelism for the CPU pipeline
• Available in x86, Itanium, Power and SPARC• Present additional hardware thread as an
additional virtual CPU to Operating System• Operating Systems Beware!!! (why?)
36
Comparison of Multithreading Techniques – 4-way superscalar
1 2 1 2 3 1 2 3 13 4 54 5 6 6 2 3 47 8 7 4 5 5 69 10 11 12 8 6 7 8
7 9 10 11 128 9 10
9 10 11 1213 13 1414 15 16 15 16
1 2 1 2 1 2 1 23 1 2 3 3 1 2 34 5 6 1 2 3 1 3 4 57 8 1 4 5 6 69 10 11 12 3 2 3 4 7
4 5 8 7 4 51 2 3 4 5 5 6 9 104 5 2 3 4 11 12 8 66 4 5 6 7 8 77 6 9 10 11 12
SMT
Thread A Thread B Thread C Thread DTim
e ——
——
>
Tim
e ——
——
>
Coarse-grain Fine-grain