Download - Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

1

www.csiro.au

Mul$processor OS COMP9242 – Advanced Opera$ng Systems

Ihor Kuz | [email protected] S2/2015 Week 12

Overview

• Mul=processor OS •  How does it work? •  Scalability (Review)

• Mul=processor Hardware •  Contemporary systems (Intel, AMD, ARM, Oracle/Sun) •  Experimental and Future systems (Intel, MS, Polaris)

•  OS Design for Mul=processors •  Guidelines •  Design approaches –  Divide and Conquer (Disco, Tessela=on) –  Reduce Sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) –  No Sharing (Barrelfish, fos)

COMP9242 S2/2015 W12 2

Mul$processor OS

COMP9242 S2/2015 W12 3

Uniprocessor OS

COMP9242 S2/2015 W12 4

CPU

App1

OS

Memory

OS data Application data

App1

App2 App3

App4

App2

Run queue

FS structs

Process control blocks

2

Mul$processor OS

COMP9242 S2/2015 W12 5

CPU

App1

OS

CPU

App3

OS

CPU

App4

OS

CPU

App4

OS

Memory


App1

App2 App3

App4 Run

queue FS

structs Process control

blocks

Mul$processor OS

•  Key design challenges: •  Correctness of (shared) data structures

•  Scalability

COMP9242 S2/2015 W12 6

CPU

App1

OS

CPU

App3

OS

CPU

App4

OS

CPU

App4

OS

Memory


App1

App2 App3

App4 Run

queue FS

structs Process control

blocks

Key design challenges: •  Correctness of (shared) data

structures •  Scalability

Correctness of Shared Data

•  Concurrency control •  Locks •  Semaphores •  Transac=ons •  Lock-‐free data structures

• We know how to do this: •  In the applica=on •  In the OS

COMP9242 S2/2015 W12 7

Scalability Speedup as more processors added

COMP9242 S2/2015 W12 8

Speedu

p (S)

number of processors (n)

Ideal

S(N ) = T1TN

3

Scalability Speedup as more processors added

COMP9242 S2/2015 W12 9

speedu

p

number of processors

Reality

S(N ) = T1TN

Scalability and Serialisa$on Remember Amdahl’s law •  Serial (non-‐parallel) por=on: when applica=on not running on all cores •  Serialisa=on prevents scalability

COMP9242 S2/2015 W12 10 From http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg

T1 =1= (1−P)+P

TN = (1−P)+PN

S(N ) = T1TN

=1

(1−P)+ PN

S(∞)→ 1(1−P)

Serialisa$on

Where does serialisa=on show up? •  Applica=on (e.g. access shared app data) •  OS (e.g. performing syscall for app) How much =me is spent in OS?

Sources of Serialisa=on: •  Locking (explicit serialisa=on) – Wai=ng for a lock è stalls self –  Lock implementa=on: –  Atomic opera=ons lock bus è stalls everyone –  Cache coherence traffic loads bus è slows down others

Memory access (implicit) •  Rela=vely high latency to memory è stalls self

Cache (implicit) •  Processor stalled while cache line is fetched or invalidated •  Affected by latency of interconnect •  Performance depends on data size (cache lines) and conten=on (number of cores)

COMP9242 S2/2015 W12 11

More Cache-‐related Serialisa$on

False sharing •  Unrelated data structs share the same cache line •  Accessed from different processors è Cache coherence traffic and delay

Cache line bouncing •  Shared R/W on many processors •  E.g: bouncing due to locks: each processor spinning on a lock brings it into its own cache

è Cache coherence traffic and delay Cache misses •  Poten=ally direct memory access è stalls self • When does cache miss occur? –  Applica=on accesses data for the first =me, Applica=on runs on new core –  Cached memory has been evicted –  Cache footprint too big, another app ran, OS ran

COMP9242 S2/2015 W12 12

4

Mul$processor Hardware

COMP9242 S2/2015 W12 13

Mul$-‐What?

• Mul=processor, SMP •  >1 separate processors, connected by off chip bus

• Mul=core •  >1 processing cores in a single processor, connected by on chip bus

• Mul=thread, SMT •  >1 hardware threads in a single core

• Mul=core + Mul=processor •  >1 mul=core processors •  >1 mul=core dies in a package (mul=-‐chip module)

COMP9242 S2/2015 W12 14

Interes$ng Proper$es of Mul$processors

•  Scale and Structure •  How many cores and processors are there • What kinds of cores and processors are there •  How are they organised

•  Interconnect •  How are the cores and processors connected

• Memory Locality and Caches • Where is the memory • What is the cache architecture

•  Interprocessor Communica=on •  How do cores and processors send messages to each other

COMP9242 S2/2015 W12 15

Contemporary Mul$processor Hardware •  Intel: •  Nehalem, Westmere: 10 core, QPI •  Sandy Bridge, Ivy Bridge: –  5 core, ring bus, integrated GPU, L3, IO

•  Haswell (Broadwell): –  18 core, ring bus, transac=onal memory, slices (EP)

•  AMD: •  K10 (Opteron: Barcelona, Magny Cours) –  12 core, Hypertransport

•  Bulldozer, Piledriver, Steamroller (Opteron, FX) –  16 core, Clustered Mul=thread: module with 2 integer cores

•  Oracle (Sun) UltraSparc T1,T2,T3,T4,T5 (Niagara) •  16 cores, 8 threads/core (2 simultaneous), crossbar, 8 sockets

•  ARM Cortex A9, A15 MPCore, big.LITTLE •  4 -‐8 cores, big.LITTLE: A7 + A15

COMP9242 S2/2015 W12 16

5

Scale and Structure •  ARM Cortex A9 MPCore

COMP9242 S2/2015 W12 17 From http://www.arm.com/images/Cortex-A9-MP-core_Big.gif

Scale and Structure

•  ARM big.LITTLE

COMP9242 S2/2015 W12 18 From http://www.arm.com/images/Fig_1_Cortex-A15_CCI_Cortex-A7_System.jpg

Scale and Structure

•  Intel Nehalem

COMP9242 S2/2015 W12 19 From www.dawnofthered.net/wp-content/uploads/2011/02/Nehalem-EX-architecture-detailed.jpg

Interconnect

•  AMD Barcelona

COMP9242 S2/2015 W12 20 From www.sigops.org/sosp/sosp09/slides/baumann-slides-sosp09.pdf

6

Memory Locality and Caches

COMP9242 S2/2015 W12 21 From www.systems.ethz.ch/education/past-courses/fall-2010/aos/lectures/wk10-multicore.pdf

Interprocessor Communica$on •  Oracle Sparc T2

COMP9242 S2/2015 W12 22

UltraSPARC® IIIiprocessor

1x

2004 2005 2006 2007 2008

UltraSPARC® T1processor32 threadseight cores

14x

UltraSPARC T2 processor64 threadseight cores

35x

“Victoria Falls”128 threads

16 cores65x

(two sockets)

FB DIMM FB DIMM FB DIMM FB DIMM

SPU SPU SPU SPU SPU SPU SPU SPU

FPU FPU FPU FPU FPU FPU FPU FPU

2x 10Gigabit Ethernet

Power <95 W x8 @ 2.0 GHz

NIU(Ethernet+)

Sys I/FBuffer Switch Core PCIe

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

C0 C1 C2 C3 C4 C5 C6 C7

MCU

Full Cross Bar

MCU MCU MCU

FB DIMM FB DIMM FB DIMM FB DIMM

FPU FPU FPU FPU FPU FPU FPU FPU

2x 10Gigabit Ethernet

Power <100 W x8 @2. GHz

NIU(E-NET+)

Sys I/FBuffer Switch Core

PCIe

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

C0 C1 C2 C3 C4 C5 C6 C7

MCU

Full Cross Bar

MCU MCU MCU

From Sun/Oracle

Interprocessor Communica$on

COMP9242 S2/2015 W12 23 From http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

Interprocessor Communica$on/Structure/Memory

COMP9242 S2/2015 W12 24 From http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

7

Experimental/Future Mul$processor Hardware

• Microson Beehive •  Ring bus, no cache coherence

•  Tilera Tile64, Tile-‐Gx •  100 cores, mesh network

•  Intel Polaris •  80 cores, mesh network

•  Intel SCC •  48 cores, mesh network, no cache coherency

•  Intel MIC (Mul= Integrated Core) (Knight’s Corner -‐ Xeon Phi) •  60+ cores, ring bus

COMP9242 S2/2015 W12 25

Scale and Structure •  Tilera Tile64 (newest: EzChips TILE-‐Gx), Intel Polaris

COMP9242 S2/2015 W12 26

PCIe 1

MAC/ PHY

SerDes

GbE

GbE 1 Flexible I/O

Flexible I/O

UART, HPI, I2C, JTAG,SPI

DDR2 Controller 3 DDR2 Controller 2

DDR2 Controller 1 DDR2 Controller 0

XAUI 1 MAC/ PHY

SerDes

PCIe 0 MAC/ PHY

SerDes

SerDes

0

Reg File

P2

P1

P0

L2 CACHE

PROCESSOR CACHE

SWITCH

2D DMA

L-1I

MDN TDN

UDN IDN

STN

L-1D

I-TLB D-TLB

From www.tilera.com/products/processors/TILE64

Cache and Memory

•  Intel SCC

COMP9242 S2/2015 W12 27 From techresearch.intel.com/spaw2/uploads/files/SCC_Platform_Overview.pdf

Interprocessor Communica$on

•  Beehive

COMP9242 S2/2015 W12 28 From projects.csail.mit.edu/beehive/BeehiveV5.pdf

8

Interprocess Communica$on •  Intel MIC (Mul= Integrated Core) (Knight’s Corner/Landing -‐ Xeon Phi)

COMP9242 S2/2015 W12 29 From http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/

Summary •  Scalability •  100+ cores •  Amdahl’s law really kicks in

•  Heterogeneity •  Heterogeneous cores, memory, etc. •  Proper=es of similar systems may vary wildly (e.g. interconnect topology and latencies between different AMD plaoorms)

•  NUMA •  Also variable latencies due to topology and cache coherence

•  Cache coherence may not be possible •  Can’t use it for locking •  Shared data structures require explicit work

•  Computer is a distributed system • Message passing •  Consistency and Synchronisa=on •  Fault tolerance

COMP9242 S2/2015 W12 30

OS DESIGN for Mul$processors

COMP9242 S2/2015 W12 31

Op$misa$on for Scalability

•  Reduce amount of code in cri=cal sec=ons •  Increases concurrency •  Fine grained locking –  Lock data not code –  Tradeoff: more concurrency but more locking (and locking causes serialisa=on)

•  Lock free data structures •  Avoid expensive memory access •  Avoid uncached memory •  Access cheap (close) memory

COMP9242 S2/2015 W12 32

9

Op$misa$on for Scalability

•  Reduce false sharing •  Pad data structures to cache lines

•  Reduce cache line bouncing •  Reduce sharing •  E.g: MCS locks use local data

•  Reduce cache misses •  Affinity scheduling: run process on the core where it last ran. •  Avoid cache pollu=on

COMP9242 S2/2015 W12 33

OS Design Guidelines for Modern (and future) Mul$processors •  Avoid shared data •  Performance issues arise less from lock conten=on than from data locality

•  Explicit communica=on •  Regain control over communica=on costs (and predictability) •  Some=mes it’s the only op=on

•  Tradeoff: parallelism vs synchronisa=on •  Synchronisa=on introduces serialisa=on • Make concurrent threads independent: reduce crit sec=ons & cache misses

•  Allocate for locality •  E.g. provide memory local to a core

•  Schedule for locality • With cached data • With local memory

•  Tradeoff: uniprocessor performance vs scalability

COMP9242 S2/2015 W12 34

Design approaches

•  Divide and conquer •  Divide mul=processor into smaller bits, use them as normal •  Using virtualisa=on •  Using exokernel

•  Reduced sharing •  Brute force & Heroic Effort –  Find problems in exis=ng OS and fix them –  E.g Linux rearchitec=ng: BKL -‐> fine grained locking

•  By design –  Avoid shared data as much as possible

•  No sharing •  Computer is a distributed system –  Do extra work to share!

COMP9242 S2/2015 W12 35

Divide and Conquer

Disco •  Scalability is too hard!

•  Context: •  ca. 1995, large ccNUMA mul=processors appearing •  Scaling OSes requires extensive modifica=ons

•  Idea: •  Implement a scalable VMM •  Run mul=ple OS instances

•  VMM has most of the features of a scalable OS: •  NUMA aware allocator •  Page replica=on, remapping, etc.

•  VMM substan=ally simpler/cheaper to implement • Modern incarna=ons of this •  Virtual servers (Amazon, etc.) •  Research (Cerberus)

COMP9242 S2/2015 W12 36 Running commodity OSes on scalable multiprocessors [Bugnion et al., 1997] http://www-flash.stanford.edu/Disco/

10

Disco Architecture

COMP9242 S2/2015 W12 37

Disco Performance

COMP9242 S2/2015 W12 38

Space-‐Time Par$$oning

Tessella$on •  Space-‐Time par==oning •  2-‐level scheduling

•  Context: •  2009-‐… highly parallel mul=core systems •  Berkeley Par Lab

COMP9242 S2/2015 W12 39 Tessellation: Space-Time Partitioning in a Manycore Client OS [Liu et al., 2010] http://tessellation.cs.berkeley.edu/

Tessella$on

COMP9242 S2/2015 W12 40

11

Reduce Sharing K42 •  Context: •  1997-‐2006: OS for ccNUMA systems •  IBM, U Toronto (Tornado, Hurricane)

•  Goals: •  High locality •  Scalability

•  Object Oriented •  Fine grained objects

•  Clustered (Distributed) Objects •  Data locality

•  Deferred dele=on (RCU) •  Avoid locking

•  NUMA aware memory allocator • Memory locality

COMP9242 S2/2015 W12 41 Clustered Objects, Ph.D. thesis [Appavoo, 2005] http://www.research.ibm.com/K42/

K42: Fine-‐grained objects

COMP9242 S2/2015 W12 42

K42: Clustered objects •  Globally valid object reference

•  Resolves to •  Processor local representa=ve

•  Sharing, locking strategy local to each object

•  Transparency •  Eases complexity •  Controlled introduc=on of locality

•  Shared counter: •  inc, dec: local access •  val: communica=on

•  Fast path: •  Access mostly local structures

COMP9242 S2/2015 W12 43

K42 Performance

COMP9242 S2/2015 W12 44

2.4.19

12

Corey •  Context •  2008, high-‐end mul=core servers, MIT

•  Goals: •  Applica=on control of OS sharing

•  OS •  Exokernel-‐like, higher-‐level services as libraries •  By default only single core access to OS data structures •  Calls to control how data structures are shared

•  Address Ranges •  Control private per core and shared address spaces

•  Kernel Cores •  Dedicate cores to run specific kernel func=ons

•  Shares •  Lookup tables for kernel objects allow control over which object iden=fiers are visible to other cores.

COMP9242 S2/2015 W12 45 Corey: An Operating System for Many Cores [Boyd-Wickizer et al., 2008]

http://pdos.csail.mit.edu/corey

Linux Brute Force Scalability

•  Context •  2010, high-‐end mul=core servers, MIT

•  Goals: •  Scaling commodity OS

•  Linux scalability (2010 – scale Linux to 48 cores)

COMP9242 S2/2015 W12 46 An Analysis of Linux Scalability to Many Cores [Boyd-Wickizer et al., 2010]

Linux Brute Force Scalability •  Apply lessons from parallel compu=ng and past research •  sloppy counters, •  per-‐core data structs, •  fine-‐grained lock, lock free, •  cache lines •  3002 lines of code changed

•  Conclusion: •  no scalability reason to give up on tradi=onal opera=ng system organiza=ons just yet.

COMP9242 S2/2015 W12 47

Scalability of the API

•  Context •  2013, previous mul=core projects at MIT

•  Goals •  How to know if a system is really scalable?

• Workload-‐based evalua=on •  Run workload, plot scalability, fix problems •  Did we miss any non-‐scalable workload? •  Did we find all boulenecks?

•  Is there something fundamental that makes an system non-‐scalable? •  The interface might be a fundamental bouleneck

COMP9242 S2/2015 W12 48 The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors [Clements et al., 2013]

13

Scalable Commuta$vity Rule •  The Rule • Whenever interface opera1ons commute, they can be implemented in a way that scales.

•  Commuta=ve opera=ons: •  Cannot dis=nguish order of opera=ons from results •  Example: –  Creat: –  Requires that lowest available FD be returned –  Not commuta=ve: can tell which one was run first

• Why are commuta=ve opera=ons scalable? •  results independent of order ⇒ communica=on is unnecessary • without communica=on, no conflicts

•  Informs sonware design process •  Design: design guideline for scalable interfaces •  Implementa=on: clear target •  Test: workload-‐independent tes=ng

COMP9242 S2/2015 W12 49

Commuter: An Automated Scalability Tes$ng Tool

COMP9242 S2/2015 W12 50

(sv6)

FlexSC •  Context: •  2010, commodity mul=cores •  U Toronto

•  Goal: •  Reduce context switch overhead of system calls

•  Syscall context switch: •  Usual mode switch overhead •  But: cache and TLB pollu=on!

COMP9242 S2/2015 W12 51 FlexSC: Flexible System Call Scheduling with Exception-Less System Calls [Soares and Stumm., 2010]

FlexSC

•  Asynchronous system calls •  Batch system calls •  Run them on dedicated cores

•  FlexSC-‐Threads • M on N • M >> N

COMP9242 S2/2015 W12 52

14

FlexSC Results

COMP9242 S2/2015 W12 53

Apache FlexSC: batching, sys call core redirect

No sharing

• Mul=kernel •  Barrelfish •  fos: factored opera=ng system

COMP9242 S2/2015 W12 54 The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/

Barrelfish

•  Context: •  2007 large mul=core machines appearing •  100s of cores on the horizon •  NUMA (cc and non-‐cc) •  ETH Zurich and Microson

•  Goals: •  Scale to many cores •  Support and manage heterogeneous hardware

•  Approach: •  Structure OS as distributed system

•  Design principles: •  Interprocessor communica=on is explicit •  OS structure hardware neutral •  State is replicated

•  Microkernel •  Similar to seL4: capabili=es

COMP9242 S2/2015 W12 55 The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/

Barrelfish

COMP9242 S2/2015 W12 56

15

Barrelfish: Replica$on

•  Kernel + Monitor: •  Only memory shared for message channels

• Monitor: •  Collec=vely coordinate system-‐wide state

•  System-‐wide state: • Memory alloca=on tables •  Address space mappings •  Capability lists

• What state is replicated in Barrelfish •  Capability lists

•  Consistency and Coordina=on •  Retype: two-‐phase commit to globally execute opera=on in order •  Page (re/un)mapping: one-‐phase commit to synchronise TLBs

COMP9242 S2/2015 W12 57

Barrelfish: Communica$on •  Different mechanisms: •  Intra-‐core –  Kernel endpoints

•  Inter-‐core –  URPC

•  URPC •  Uses cache coherence + polling •  Shared bufffer –  Sender writes a cache line –  Receiver polls on cache line –  (last word so no part message)

•  Polling? –  Cache only changes when sender writes, so poll is cheap

–  Switch to block and IPI if wait is too long.

COMP9242 S2/2015 W12 58

Barrelfish: Results • Message passing vs caching

COMP9242 S2/2015 W12 59

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy

(cyc

les ×

100

0)

Cores

SHM8SHM4SHM2SHM1MSG8MSG1

Server

Barrelfish: Results •  Broadcast vs Mul=cast

COMP9242 S2/2015 W12 60

0

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32La

ten

cy (c

ycle

s ×

100

0)

Cores

BroadcastUnicast

MulticastNUMA-Aware Multicast

16

Barrelfish: Results •  TLB shootdown

COMP9242 S2/2015 W12 61

0

10

20

30

40

50

60

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Late

ncy

(cyc

les

× 1

00

0)

Cores

WindowsLinux

Barrelfish

Summary

COMP9242 S2/2015 W12 62

Summary •  Trends in mul=core •  Scale (100+ cores) •  NUMA •  No cache coherence •  Distributed system •  Heterogeneity

•  OS design guidelines •  Avoid shared data •  Explicit communica=on •  Locality

•  Approaches to mul=core OS •  Par==on the machine (Disco, Tessella=on) •  Reduce sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) •  No sharing (Barrelfish, fos)

COMP9242 S2/2015 W12 63