Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is...

Multithreading

Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

http://creativecommons.org/choose/www.peerinstruction4cs.org

http://creativecommons.org/licenses/by-nc-sa/3.0/



Fine-Grain Multithreading

• Fine grain multithreading performs a switch between threads EVERY cycle. What is the primary goal of such an approach?

Selection “Best” argument

A Service each thread equally

B Hide instruction latencies

C Reduce replicated resources

D Exploit idle resources

E None of the above

Fine-Grain Multithreading

• Fine grain multithreading performs a switch between threads EVERY cycle. What is the primary drawback of such an approach?

Selection “Best” argument

A Poor scalability (benefits for 8 threads exceed benefits for 64 threads).

B Extra hardware

C Need for large number of threads

D 1 and 2

E 2 and 3

Course-Grain Multithreading

• Course grain multithreading performs a switch between threads whenever one thread encounters a high latency event. What is the primary goal of such an approach?

Selection Best argument

A Service each thread equally

B Hide instruction latencies

C Reduce replicated resources

D Exploit idle resources

E None of the above

Course-Grain Multithreading

• Course grain multithreading performs a switch between threads whenever one thread encounters a high latency event. What is the primary drawback of such an approach?

Selection Best argument

A Poor scalability (benefits for 8 threads exceed benefits for 64 threads)

B Context switch times are slow

C Extra hardware

D 1 and 2

E 2 and 3

Context Switch

• What happens on context switch?– Transfer of register state

– Transfer of PC

– Draining of the pipeline

• Additionally:– Warm up caches

– Warm up branch predictors

Multithreading

Issue Width Issue Width Issue Width

Coarse Grain Fine Grain SMT

Simultaneous Multithreading1. More functional units

2. Larger instruction queue

3. Larger reorder buffer

4. Means to differentiate between threads in the instruction queue, regrename, and reorder buffer

5. Ability to fetch from multiple programs

Selection Required Resources

A 1, 2, 3, 4, 5

B 1, 3, 5

C 1, 4, 5

D 4, 5

E None of the above

Given a modern out of order processor with register renaming, inst. queue, reorder buffer, etc. – What is REQUIRED to perform speculative multithreading

Point is – if you can just fetch from multiple streams – the processor is usually over provisioned anyway

Modern OOO Processor

Fetch Decode

InstructionQueue

Register Rename

INTALU

INTALU

INTALU

FPALU

FPALU

LoadQueue

StoreQueue

L1

Reorder Buffer

Draw just the need to fetch more insructions

SMT vs. early multi-core

• The argument was between a single aggressive SMT out-of-order processor and a number of simpler processors.

• At the time – the advantage for the simpler processors was a higher clock rate.

• The disadvantage for the simpler processors were lack of functional units / in-order execution / smaller caches/ etc.

SM vs. MP

SMT vs. early CMP

• SMT – 4 issue, 4 int ALU, 4 FP ALU

• CMP – 2 cores each 2-issue, 2 int ALU, 2 FP ALUs

• Say you have 4 threads

• Say you have 2 threads – one is floating point intense and the other is integer intense

• Say you have 1 thread

Point out single thread drives benchmark tests – no one buys a processor which does worse!

Multi-core recently

• Instruction queues were taking up 20% of a core area for 4-issue, how complex would it be for 8-issue?

• Simpler hardware does not mean faster CR.

• Tons of die space.

• Larger caches weren’t helping performance that much

• Why not just replicate a single advanced processor (core)?

SMT vs. CMP - Revised

• SMT – 4 issue, 4 int ALU, 4 FP ALU

• CMP – 2 cores each 4-issue, 4 int ALU, 4 FP ALUs

• Say you have 4 threads

• Say you have 2 threads – one is floating point intense and the other is integer intense.

• Say you have 1 thread….

Multi-core Today

• 4-8 cores per chip. “Multi-core Era”

• Throughput scales well with the number of cores.

• Each core is frequently SMT as well (for more throughput)

• Great when you have 4-8 threads (most of us have a fair number at any given time)

• What to do when we get 128 cores (“Many core era”)??

Multithreading Key Points

• Simultaneous Multithreading– Inexpensive addition to increase throughput for multiple threads

– Enables good throughput for multiple threads

– Does not impact single thread performance

• Single Chip Multiprocessors– ILP wall/Memory Wall/ Power Wall – all point to multi-core

– Enables excellent throughput for multiple threads

• Where do we find all these threads?Field of dreams argument

Date post:	15-Jan-2016
Category:	Documents
Upload:	justus-chadwell
View:	219 times
Download:	0 times

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is...

Documents