+ All Categories
Home > Documents > Lecture 11 Multithreaded Architectures

Lecture 11 Multithreaded Architectures

Date post: 07-Jan-2016
Category:
Upload: gaia
View: 36 times
Download: 0 times
Share this document with a friend
Description:
Lecture 11 Multithreaded Architectures. Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University. Concept. Data Access Latency Cache misses (L1, L2) Memory latency (remote, local) Often unpredictable - PowerPoint PPT Presentation
28
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University
Transcript
Page 1: Lecture 11 Multithreaded Architectures

Lecture 11

Multithreaded Architectures

Graduate Computer Architecture

Fall 2005

Shih-Hao Hung

Dept. of Computer Science and Information Engineering

National Taiwan University

Page 2: Lecture 11 Multithreaded Architectures

Concept

• Data Access Latency– Cache misses (L1, L2)– Memory latency (remote, local)– Often unpredictable

• Multithreading (MT)– Tolerate or mask long and often unpredictable lat

ency operations by switching to another context, which is able to do useful work.

Page 3: Lecture 11 Multithreaded Architectures

Why Multithreading Today?

• ILP is exhausted, TLP is in.• Large performance gap bet. MEM and

PROC.• Too many transistors on chip• More existing MT applications Today.• Multiprocessors on a single chip.• Long network latency, too.

Page 4: Lecture 11 Multithreaded Architectures

Classical Problem, 60’ & 70’

• I/O latency prompted multitasking • IBM mainframes • Multitasking • I/O processors • Caches within disk controllers

Page 5: Lecture 11 Multithreaded Architectures

Requirements of Multithreading

• Storage need to hold multiple context’s PC, registers, status word, etc.

• Coordination to match an event with a saved context

• A way to switch contexts • Long latency operations must use resources

not in use

Page 6: Lecture 11 Multithreaded Architectures

Processor Utilization vs. Latency

R = the run length to a long latency event

L = the amount of latency

Page 7: Lecture 11 Multithreaded Architectures

Problem of 80’

• Problem was revisited due to the advent of graphics workstations – Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for

the workstations to be more responsive. – These processes could drive or monitor display, i

nput, file system, network, user processing – Process switch was slow so the subsystems wer

e microprogrammed to support multiple contexts

Page 8: Lecture 11 Multithreaded Architectures

Scalable Multiprocessor (90’)

• Dance hall – a shared interconnect with memory on one side and processors on the other.

• Or processors may have local memory

Page 9: Lecture 11 Multithreaded Architectures

How do the processors communicate?

• Shared Memory • Potential long latency on every load

– Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s Butter

fly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and

semaphores. • Message Passing

– Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it.

– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5

– Synchronization occurs through send and receive

Page 10: Lecture 11 Multithreaded Architectures

Cycle-by-Cycle Interleaved Multithreading

• Denelcor HEP1 (1982), HEP2

• Horizon, which was never built

• Tera, MTA

Page 11: Lecture 11 Multithreaded Architectures

Cycle-by-Cycle Interleaved Multithreading

• Features– An instruction from a different context is launched at each

clock cycle – No interlocks or bypasses thanks to a non-blocking

pipeline

• Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on

completion

Page 12: Lecture 11 Multithreaded Architectures

Challenges with this approach

• I-Cache:– Instruction bandwidth – I-Cache misses: Since instructions are being grabbed from many different

contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time:

– Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts.

– In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times.

• Single thread performance– Single thread performance significantly degraded since the context is forc

ed to switch to a new thread even if none are available. • Very high bandwidth network, which is fast and wide • Retries on load empty or store full

Page 13: Lecture 11 Multithreaded Architectures

Improving Single Thread Performance

• Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from

each context. – This could lead to pipeline hazards, so other safe instructio

ns could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data depen

dencies and the hardware enforces it by switching to another context if detected.

• Switch on load • Switch on miss

– Switching on load or miss will increase the context switch time.

Page 14: Lecture 11 Multithreaded Architectures

Simultaneous Multithreading (SMT)

• Tullsen, et. al. (U. of Washington), ISCA ‘95• A way to utilize pipeline with increased parall

elism from multiple threads.

Page 15: Lecture 11 Multithreaded Architectures

Simultaneous Multithreading

Page 16: Lecture 11 Multithreaded Architectures

SMT Architecture• Straightforward extension to conventional superscal

ar design.– multiple program counters and some mechanism by which

the fetch unit selects one each cycle,– a separate return stack for each thread for predicting subro

utine return destinations,– per-thread instruction retirement, instruction queue flush, a

nd trap mechanisms,– a thread id with each branch target buffer entry to avoid pr

edicting phantom branches, and– a larger register file, to support logical registers for all threa

ds plus additional registers for register renaming. • The size of the register file affects the pipeline and the

scheduling of load-dependent instructions.

Page 17: Lecture 11 Multithreaded Architectures

SMT PerformanceTullsen ‘96

Page 18: Lecture 11 Multithreaded Architectures

Commercial Machines w/ MT Support

• Intel Hyperthreding (HT)– Dual threads– Pentium 4, XEON

• Sun CoolThreads – UltraSPARC T1– 4-threads per core

• IBM– POWER5

Page 19: Lecture 11 Multithreaded Architectures

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

Page 20: Lecture 11 Multithreaded Architectures

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

Page 21: Lecture 11 Multithreaded Architectures

SMT Summary

• Pros:– Increased throughput w/o adding much cost– Fast response for multitasking environment

• Cons:– Slower single processor performance

Page 22: Lecture 11 Multithreaded Architectures

Multicore

• Multiple processor cores on a chip– Chip multiprocessor (CMP)– Sun’s Chip Multithreading (CMT)

• UltraSPARC T1 (Niagara)– Intel’s Pentium D– AMD dual-core Opteron

• Also a way to utilize TLP, but– 2 cores 2X costs– No good for single thread performacne

• Can be used together with SMT

Page 23: Lecture 11 Multithreaded Architectures

Chip Multithreading (CMT)

Page 24: Lecture 11 Multithreaded Architectures

Sun UltraSPARC T1 Processor

http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers

Page 25: Lecture 11 Multithreaded Architectures

8 Cores vs 2 Cores

• Is 8-cores too aggressive?– Good for server applications, given

• Lots of threads• Scalable operating environment• Large memory space (64bit)

– Good for power efficiency• Simple pipeline design for each core

– Good for availability– Not intended for PCs, gaming, etc

Page 26: Lecture 11 Multithreaded Architectures

SPECWeb 2005

IBM X346: 3Ghz Xeon

T2000: 8 core 1.0GHz T1 Processor

Page 27: Lecture 11 Multithreaded Architectures

Sun Fire T2000 Server

Page 28: Lecture 11 Multithreaded Architectures

Server Pricing

• UltraSPARC– Sun Fire T1000 Server

• 6 core 1.0GHz T1 Processor

• 2GB memory, 1x 80GB disk

• List price: $5,745

– Sun Fire T2000 Server • 8 core 1.0GHz

T1 Processor• 8GB DDR2 memory,

2 X 73GB disk• List price: $13,395

• X86– Sun Fire X2100 Server

• Dual core AMD Opteron 175

• 2GB memory, 1x80GB disk

• List price: $2,295

– Sun Fire X4200 Server• 2x Dual core

AMD Opteron 275• 4GB memory,

2x 73GB disk• List price: $7,595


Recommended