+ All Categories
Home > Documents > Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead •...

Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead •...

Date post: 31-Aug-2019
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
24
1 Multiprocessors and Thread-Level Parallelism Pedro Trancoso H&P Chapter 4 2 Ιστορικά MEMORY CPU CPU CPU CPU MEM MEM MEM MEM
Transcript
Page 1: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

1

Multiprocessors and Thread-Level Parallelism

Pedro Trancoso

H&P Chapter 4

2

Ιστορικά

MEMORY

CPU CPU CPU CPU

MEM MEM MEM MEM

Page 2: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

2

3

Ιστορικά

MEM MEM MEM MEM

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

4

Οι κατηγορίες του Flynn •  SISD (Single Instruction Single Data) •  MISD (Multiple Instruction Single Data)

–  ???; multiple processors on a single data stream •  SIMD (Single Instruction Multiple Data)

–  Examples: Illiac-IV, CM-2 •  Simple programming model •  Low overhead •  Flexibility •  All custom integrated circuits

–  Multimedia Extensions: Intel SSE2, AMD 3DNow!, IBM AltiVec, Sun Vis

•  MIMD (Multiple Instruction Multiple Data) –  Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

•  Flexible •  Use off-the-shelf micros

Page 3: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

3

5

Νόµος του Amdahl •  Amdahl’s Law (FracX: original % to be speed up)

Speedup = 1 / [(FracX/SpeedupX + (1-FracX)] •  A portion is sequential => limits parallel speedup

–  Speedup <= 1/ (1-FracX) •  Ex. What fraction sequential to get 80X speedup

from 100 processors? Assume either 1 processor or 100 fully used

80 = 1 / [(FracX/100 + (1-FracX)] 0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1 FracX = (80-1)/79.2 = 0.9975

–  Only 0.25% sequential!

6

Παραλληλισµό στο Επεξεργαστή

•  Εντολές SIMD –  SSE2 (Intel), 3DNow!

(AMD), AltiVec (PowerPC)

•  Superscalar –  Intel Pentium 4, AMD Athlon, IBM PowerPC, MIPS

R10K, Sun UltraSparc

•  Multithreaded Architecture –  Threads: POSIX Pthreads, Sun Solaris LWP –  Architectures: HEP, Tera, Alewife

•  Simultaneous Multithreaded –  Hyperthreading (Intel)

Page 4: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

4

7

Επεξεργαστές VLIW

•  Very Long Instruction Word •  Απλή αρχιτεκτονική µε φόρτος στο

µεταγλωττιστής •  Ex: Itanium2 (Intel)

8

Multithreading

•  Thread-Level Parallelism • Multithreading

–  Fine-grained parallelism –  Coarse-grain parallelism –  Simultaneous Multithreading

Page 5: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

5

9

Multithreading (2)

Superscalar Coarse MT SMT Fine MT

10

IBM Power5

•  SMT added from Power4 to Power5 –  Increased associativity of L1 instruction cache and

instruction address translation buffers –  Added per-thread load and store queues –  Increased size of L2 and L3 –  Added separate instruction prefetch and buffering –  Increased number of virtual registers from 152 to 240 –  Increased size of several issue queues

•  Speedup in SPEC for Power5 (2 thread per core) from 0.89 to 1.41

Page 6: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

6

11

Chip Multiprocessor (CMP)

•  Πολυεπεξεργαστής σε ένα “chip”

•  Research:Piranha, FlexRAM

•  Commercial: IBM Power4, Sun MAJC, AMD Athlon X2, Intel Pentium D

12

Αρχιτεκτονικές µε Διαµεριζόµενη Μνήµη

•  Συγκεντρωτική διαµεριζόµενη µνήµη - Centralized shared-memory –  UMA - Uniform Memory Access

•  Κατανεµηµένη διαµεριζόµενη µνήµη Distributed shared-memory –  NUMA – Non-Uniform Memory Access

•  COMA - Cache-only Memory Access •  Κατανεµηµένη διαµεριζόµενη µνήµη από Λογισµικό Software distributed shared/memory

Page 7: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

7

13

Συγκεντρωτική διαµεριζόµενη µνήµη

•  Small scale bus-based multiprocessor –  Interconnection network is bottleneck, limited to

8-16 processors (Dell PowerEdge 2600) •  Larger scale crossbar multiprocessor

–  Sun Fire 15K (< 106 UltraSPARC III)

14

Κατανεµηµένη διαµεριζόµενη µνήµη

•  Scales well •  Non-uniform access •  Memory mapping schemes: First-touch •  Examples: SGI Origin 3000

Page 8: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

8

15

Κατανεµηµένη διαµεριζόµενη µνήµη – Κατάλογο Μνήµης

• Memory state • Memory location

16

Προσβάσεις Δεδοµένων

CPU CPU CPU

MEM

CPU

MEM MEM MEM

J K L

Page 9: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

9

17

Μια Λύση … COMA

CPU CPU CPU CPU

MEM MEM MEM

VIRTUAL MEM

MEM

18

COMA

• Kendall Square Research (1986-1994): KSR1

• Research: SICS Simple COMA, Illinois I-ACOMA

Page 10: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

10

19

Software Distributed Shared-Memory

•  Μεταγλωττιστής κάνει ανάλυση του κώδικα και βάλει extra εντολές

•  Examples: Shasta, Cashmere, DZOOM

20

Cache Coherency

•  Value of A is 20 or 15? •  Solution: Cache Coherence Protocols!

Page 11: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

11

21

HW Coherency Solutions •  Snooping Solution (Snoopy Bus):

–  Send all requests for data to all processors –  Processors snoop to see if they have a copy and respond

accordingly –  Requires broadcast, since caching information is at processors –  Works well with bus (natural broadcast medium) –  Dominates for small scale machines (most of the market)

•  Directory-Based Schemes (discuss later) –  Keep track of what is being shared in 1 centralized place (logically) –  Distributed memory => distributed directory for scalability

(avoids bottlenecks) –  Send point-to-point requests to processors via network –  Scales better than Snooping –  Actually existed BEFORE Snooping-based schemes

22

Βασικά Πρωτόκολλα Snoopy

•  Write Invalidate Protocol: –  Multiple readers, single writer –  Write to shared data: an invalidate is sent to all caches which

snoop and invalidate any copies –  Read Miss:

•  Write-through: memory is always up-to-date •  Write-back: snoop in caches to find most recent copy

•  Write Broadcast Protocol (typically write through): –  Write to shared data: broadcast on bus, processors snoop, and

update any copies –  Read miss: memory is always up-to-date

•  Write serialization: bus serializes requests! –  Bus is single point of arbitration

Page 12: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

12

23

Παράδειγµα Πρωτοκόλλου Snoopy

•  Invalidation protocol, write-back cache •  Each block of memory is in one state:

–  Clean in all caches and up-to-date in memory (Shared) –  OR Dirty in exactly one cache (Exclusive) –  OR Not in any caches

•  Each cache block is in one state (track these): –  Shared : block can be read –  OR Exclusive : cache has only copy, its writeable, and dirty –  OR Invalid : block contains no data

•  Read misses: cause all caches to snoop bus •  Writes to clean line are treated as misses

24

Παραλλαγές Snooping Cache

Berkeley Protocol

Owned Exclusive Owned Shared

Shared Invalid

Basic Protocol

Exclusive

Shared Invalid

Illinois Protocol Private Dirty Private Clean

Shared Invalid

Owner can update via bus invalidate operation Owner must write back when replaced in cache

If read sourced from memory, then Private Clean if read sourced from other cache, then Shared Can write in cache if held private clean or dirty

MESI Protocol

Modified (private,!=Memory) Exclusive (private,=Memory)

Shared (shared,=Memory) Invalid

•  Intel Pentium: MESI •  AMD Athlon: MOESI

Page 13: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

13

25

Πρωτόκολλο MSI Protocol MSI

26

Παράδειγµα

P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2

Inv Shr

Exc

P1

Inv Shr

Exc

P2

Page 14: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

14

27

Παράδειγµα

P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2

Inv Shr

Exc

P1

Inv Shr

Exc

P2

28

Παράδειγµα

P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2

Inv Shr

Exc

P1

Inv Shr

Exc

P2

Write Back

Page 15: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

15

29

Παράδειγµα

P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2

Inv Shr

Exc

P1

Inv Shr

Exc

P2

30

Παράδειγµα

P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2

Inv Shr

Exc

P1

Inv Shr

Exc

P2

Page 16: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

16

31

Πρωτόκολλο MESI

32

Συγχρονισµός

• Why Synchronize? Need to know when it is safe for different processes to use shared data

•  Issues for Synchronization: –  Uninterruptable instruction to fetch and update

memory (atomic operation); –  User level synchronization operation using this

primitive; –  For large scale MPs, synchronization can be a

bottleneck; techniques to reduce contention and latency of synchronization

Page 17: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

17

33

Αδιάλειπτη Εντολή Ανάκλησης και Ενηµέρωση Μνήµης (Uninterruptable Instruction to Fetch and Update Memory)

•  Ατοµική Ανταλλαγή (Atomic exchange): interchange a value in a register for a value in memory

0 => synchronization variable is free 1 => synchronization variable is locked and unavailable –  Set register to 1 & swap –  New value in register determines success in getting lock

0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access

–  Key is that exchange operation is indivisible

•  Δοκιµάζω-και-θέτω (Test-and-set): tests a value and sets it if the value passes the test

•  Ανάκληση-και-αύξηση (Fetch-and-increment): it returns the value of a memory location and atomically increments it

–  0 => synchronization variable is free

34

Αδιάλειπτη Εντολή Ανάκλησης και Ενηµέρωση Μνήµης •  Hard to have read & write in 1 instruction: use 2 instead •  Load linked (or load locked) + store conditional

–  Load linked returns the initial value –  Store conditional returns 1 if it succeeds (no other store to same

memory location since preceding load) and 0 otherwise •  Example doing atomic swap with LL & SC:

try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4

•  Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked

addi R2,R2,#1 ; increment (OK if reg–reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

Page 18: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

18

35

Συγχρονισµός Επίπεδου Χρήστης

•  Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock

li R2,#1 lockit: exch R2,0(R1) ;atomic exchange

bnez R2,lockit ;already locked?

•  What about MP with cache coherency? –  Want to spin on cache copy to avoid full memory latency –  Likely to get cache hits for such variables

•  Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic

•  Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: li R2,#1 lockit: lw R3,0(R1) ;load var

bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked?

36

Μοντέλα Συνέπειας Μνήµης (Memory Consistency Models) •  What is consistency? When must a processor see the new

value? e.g., seems that P1: A = 0; P2: B = 0; ..... ..... A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ...

•  Impossible for both if statements L1 & L2 to be true? –  What if write invalidate is delayed & processor continues?

•  Other Example:

Page 19: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

19

37

Μοντέλα Συνέπειας Μνήµης

•  Σειριακή Συνέπεια (Sequential consistency): result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above –  SC: delay all memory accesses until all invalidates done

38

Μοντέλα Συνέπειας Μνήµης •  Schemes faster execution to sequential consistency •  Not really an issue for most programs;

they are synchronized –  A program is synchronized if all access to shared data are ordered

by synchronization operations write (x)

... release (s) {unlock} ... acquire (s) {lock} ... read(x)

•  Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed)

•  Χαλαρωµένα Μοντέλα - Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

Page 20: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

20

39

Αρχιτεκτονική Ανταλλαγών Μηνυµάτων •  Scalable

–  No shared-memory bottleneck! –  Large problems: bioinformatics, environment,

meteorological, engineering, physics, etc.

•  Harder to program: –  Message-Passing Library: PVM, MPI

•  Example: –  First of TOP500 Supercomputer Sites: Earth Simulator

40TFLOPS (P4 3.06GHz is +- 6GFLOP)

40

Massively Parallel Processors (MPP)

•  39% of the TOP500 •  Custom-design interconnection network •  Large number of processing elements

–  Custom-design: •  NEC Earth Simulator: 640-node, 8-way vector smp, crossbar,

40TFLOPS, US$350mil •  Cray T3E

–  Commodity: •  ASCI White (IBM SP2): 8192 IBM RS6000, 6TB RAM, 160TB disk,

12.3TFLOPS

Page 21: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

21

41

Clusters

•  Processing nodes and network are Components-Of-The-Shelf (COTS)

•  Networks: Ethernet, Fast-Ethernet, Gigabit Ethernet, Quadrics, Myrinet, Infiniband, …

•  TOP500: –  20% are clusters –  3rd place: ‘Big Mac’, 1100 Dual G5 Apple, 17TFLOPS, US

$5.2mil (Virginia Tech) –  5th place: 2304 Intel Xeon 2.4GHz, 4.6TB RAM, 138TB disk,

11.2TFLOPS

•  Examples: –  Network-Of-Workstations (NOW) –  Beowulf: Linux Cluster –  Apple XServer, Sun LX50, IBM eServer XSeries 335

42

•  Cluster technology –  More than 10000 processors –  More than 3 billion pages, 150 million

searches/day –  Heterogeneous: Intel Celeron + Pentium III –  Network: 100Mbps and 1Gbps Ethernet

Page 22: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

22

43

Constellations

•  “Extension” of cluster – large systems as computing nodes

•  Example: –  ASCI Blue Mountain: 48 nodes, SGI Origin 3000

SMP with 128 processors, HiPPI network, 3.1TFLOP

44

Grid

•  Large-Scale Applications •  Diversity of resources •  Distributed data •  Cooperation between users •  Introduction to Grid Computing, The Globus

Project™, Argonne National Laboratory USC Information Sciences Institute, http://www.globus.org/

Page 23: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

23

45 DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago tomographic reconstruction

real-time collection

wide-area dissemination

desktop & VR clients with shared controls

Advanced Photon Source

Online Access to Scientific Instruments

archival storage

46 Image courtesy Harvey Newman, Caltech

Data Grids for High Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPS France Regional Centre

Italy Regional Centre

Germany Regional Centre

Institute Institute Institute Institute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0

Tier 1

Tier 2

Tier 4

1 TIPS is approximately 25,000 SpecInt95 equivalents

Page 24: Pedro Trancoso - cs.ucy.ac.cy · • Hard to have read & write in 1 instruction: use 2 instead • Load linked (or load locked) + store conditional – Load linked returns the initial

24

47

•  Community = –  1000s of home

computer users –  Philanthropic

computing vendor (Entropia)

–  Research group (Scripps)

•  Common goal= advance AIDS research

Home Computers Evaluate AIDS Drugs


Recommended