1
www.csiro.au
Mul$processor OS COMP9242 – Advanced Opera$ng Systems
Ihor Kuz | [email protected] S2/2015 Week 12
Overview
• Mul=processor OS • How does it work? • Scalability (Review)
• Mul=processor Hardware • Contemporary systems (Intel, AMD, ARM, Oracle/Sun) • Experimental and Future systems (Intel, MS, Polaris)
• OS Design for Mul=processors • Guidelines • Design approaches – Divide and Conquer (Disco, Tessela=on) – Reduce Sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) – No Sharing (Barrelfish, fos)
COMP9242 S2/2015 W12 2
Mul$processor OS
COMP9242 S2/2015 W12 3
Uniprocessor OS
COMP9242 S2/2015 W12 4
CPU
App1
OS
Memory
OS data Application data
App1
App2 App3
App4
App2
Run queue
FS structs
Process control blocks
2
Mul$processor OS
COMP9242 S2/2015 W12 5
CPU
App1
OS
CPU
App3
OS
CPU
App4
OS
CPU
App4
OS
Memory
OS data Application data
App1
App2 App3
App4 Run
queue FS
structs Process control
blocks
Mul$processor OS
• Key design challenges: • Correctness of (shared) data structures
• Scalability
COMP9242 S2/2015 W12 6
CPU
App1
OS
CPU
App3
OS
CPU
App4
OS
CPU
App4
OS
Memory
OS data Application data
App1
App2 App3
App4 Run
queue FS
structs Process control
blocks
Key design challenges: • Correctness of (shared) data
structures • Scalability
Correctness of Shared Data
• Concurrency control • Locks • Semaphores • Transac=ons • Lock-‐free data structures
• We know how to do this: • In the applica=on • In the OS
COMP9242 S2/2015 W12 7
Scalability Speedup as more processors added
COMP9242 S2/2015 W12 8
Speedu
p (S)
number of processors (n)
Ideal
S(N ) = T1TN
3
Scalability Speedup as more processors added
COMP9242 S2/2015 W12 9
speedu
p
number of processors
Reality
S(N ) = T1TN
Scalability and Serialisa$on Remember Amdahl’s law • Serial (non-‐parallel) por=on: when applica=on not running on all cores • Serialisa=on prevents scalability
COMP9242 S2/2015 W12 10 From http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg
T1 =1= (1−P)+P
TN = (1−P)+PN
S(N ) = T1TN
=1
(1−P)+ PN
S(∞)→ 1(1−P)
Serialisa$on
Where does serialisa=on show up? • Applica=on (e.g. access shared app data) • OS (e.g. performing syscall for app) How much =me is spent in OS?
Sources of Serialisa=on: • Locking (explicit serialisa=on) – Wai=ng for a lock è stalls self – Lock implementa=on: – Atomic opera=ons lock bus è stalls everyone – Cache coherence traffic loads bus è slows down others
Memory access (implicit) • Rela=vely high latency to memory è stalls self
Cache (implicit) • Processor stalled while cache line is fetched or invalidated • Affected by latency of interconnect • Performance depends on data size (cache lines) and conten=on (number of cores)
COMP9242 S2/2015 W12 11
More Cache-‐related Serialisa$on
False sharing • Unrelated data structs share the same cache line • Accessed from different processors è Cache coherence traffic and delay
Cache line bouncing • Shared R/W on many processors • E.g: bouncing due to locks: each processor spinning on a lock brings it into its own cache
è Cache coherence traffic and delay Cache misses • Poten=ally direct memory access è stalls self • When does cache miss occur? – Applica=on accesses data for the first =me, Applica=on runs on new core – Cached memory has been evicted – Cache footprint too big, another app ran, OS ran
COMP9242 S2/2015 W12 12
4
Mul$processor Hardware
COMP9242 S2/2015 W12 13
Mul$-‐What?
• Mul=processor, SMP • >1 separate processors, connected by off chip bus
• Mul=core • >1 processing cores in a single processor, connected by on chip bus
• Mul=thread, SMT • >1 hardware threads in a single core
• Mul=core + Mul=processor • >1 mul=core processors • >1 mul=core dies in a package (mul=-‐chip module)
COMP9242 S2/2015 W12 14
Interes$ng Proper$es of Mul$processors
• Scale and Structure • How many cores and processors are there • What kinds of cores and processors are there • How are they organised
• Interconnect • How are the cores and processors connected
• Memory Locality and Caches • Where is the memory • What is the cache architecture
• Interprocessor Communica=on • How do cores and processors send messages to each other
COMP9242 S2/2015 W12 15
Contemporary Mul$processor Hardware • Intel: • Nehalem, Westmere: 10 core, QPI • Sandy Bridge, Ivy Bridge: – 5 core, ring bus, integrated GPU, L3, IO
• Haswell (Broadwell): – 18 core, ring bus, transac=onal memory, slices (EP)
• AMD: • K10 (Opteron: Barcelona, Magny Cours) – 12 core, Hypertransport
• Bulldozer, Piledriver, Steamroller (Opteron, FX) – 16 core, Clustered Mul=thread: module with 2 integer cores
• Oracle (Sun) UltraSparc T1,T2,T3,T4,T5 (Niagara) • 16 cores, 8 threads/core (2 simultaneous), crossbar, 8 sockets
• ARM Cortex A9, A15 MPCore, big.LITTLE • 4 -‐8 cores, big.LITTLE: A7 + A15
COMP9242 S2/2015 W12 16
5
Scale and Structure • ARM Cortex A9 MPCore
COMP9242 S2/2015 W12 17 From http://www.arm.com/images/Cortex-A9-MP-core_Big.gif
Scale and Structure
• ARM big.LITTLE
COMP9242 S2/2015 W12 18 From http://www.arm.com/images/Fig_1_Cortex-A15_CCI_Cortex-A7_System.jpg
Scale and Structure
• Intel Nehalem
COMP9242 S2/2015 W12 19 From www.dawnofthered.net/wp-content/uploads/2011/02/Nehalem-EX-architecture-detailed.jpg
Interconnect
• AMD Barcelona
COMP9242 S2/2015 W12 20 From www.sigops.org/sosp/sosp09/slides/baumann-slides-sosp09.pdf
6
Memory Locality and Caches
COMP9242 S2/2015 W12 21 From www.systems.ethz.ch/education/past-courses/fall-2010/aos/lectures/wk10-multicore.pdf
Interprocessor Communica$on • Oracle Sparc T2
COMP9242 S2/2015 W12 22
UltraSPARC® IIIiprocessor
1x
2004 2005 2006 2007 2008
UltraSPARC® T1processor32 threadseight cores
14x
UltraSPARC T2 processor64 threadseight cores
35x
“Victoria Falls”128 threads
16 cores65x
(two sockets)
FB DIMM FB DIMM FB DIMM FB DIMM
SPU SPU SPU SPU SPU SPU SPU SPU
FPU FPU FPU FPU FPU FPU FPU FPU
2x 10Gigabit Ethernet
Power <95 W x8 @ 2.0 GHz
NIU(Ethernet+)
Sys I/FBuffer Switch Core PCIe
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
C0 C1 C2 C3 C4 C5 C6 C7
MCU
Full Cross Bar
MCU MCU MCU
FB DIMM FB DIMM FB DIMM FB DIMM
FPU FPU FPU FPU FPU FPU FPU FPU
2x 10Gigabit Ethernet
Power <100 W x8 @2. GHz
NIU(E-NET+)
Sys I/FBuffer Switch Core
PCIe
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
C0 C1 C2 C3 C4 C5 C6 C7
MCU
Full Cross Bar
MCU MCU MCU
From Sun/Oracle
Interprocessor Communica$on
COMP9242 S2/2015 W12 23 From http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4
Interprocessor Communica$on/Structure/Memory
COMP9242 S2/2015 W12 24 From http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4
7
Experimental/Future Mul$processor Hardware
• Microson Beehive • Ring bus, no cache coherence
• Tilera Tile64, Tile-‐Gx • 100 cores, mesh network
• Intel Polaris • 80 cores, mesh network
• Intel SCC • 48 cores, mesh network, no cache coherency
• Intel MIC (Mul= Integrated Core) (Knight’s Corner -‐ Xeon Phi) • 60+ cores, ring bus
COMP9242 S2/2015 W12 25
Scale and Structure • Tilera Tile64 (newest: EzChips TILE-‐Gx), Intel Polaris
COMP9242 S2/2015 W12 26
PCIe 1
MAC/ PHY
SerDes
GbE
GbE 1 Flexible I/O
Flexible I/O
UART, HPI, I2C, JTAG,SPI
DDR2 Controller 3 DDR2 Controller 2
DDR2 Controller 1 DDR2 Controller 0
XAUI 1 MAC/ PHY
SerDes
PCIe 0 MAC/ PHY
SerDes
SerDes
0
Reg File
P2
P1
P0
L2 CACHE
PROCESSOR CACHE
SWITCH
2D DMA
L-1I
MDN TDN
UDN IDN
STN
L-1D
I-TLB D-TLB
From www.tilera.com/products/processors/TILE64
Cache and Memory
• Intel SCC
COMP9242 S2/2015 W12 27 From techresearch.intel.com/spaw2/uploads/files/SCC_Platform_Overview.pdf
Interprocessor Communica$on
• Beehive
COMP9242 S2/2015 W12 28 From projects.csail.mit.edu/beehive/BeehiveV5.pdf
8
Interprocess Communica$on • Intel MIC (Mul= Integrated Core) (Knight’s Corner/Landing -‐ Xeon Phi)
COMP9242 S2/2015 W12 29 From http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/
Summary • Scalability • 100+ cores • Amdahl’s law really kicks in
• Heterogeneity • Heterogeneous cores, memory, etc. • Proper=es of similar systems may vary wildly (e.g. interconnect topology and latencies between different AMD plaoorms)
• NUMA • Also variable latencies due to topology and cache coherence
• Cache coherence may not be possible • Can’t use it for locking • Shared data structures require explicit work
• Computer is a distributed system • Message passing • Consistency and Synchronisa=on • Fault tolerance
COMP9242 S2/2015 W12 30
OS DESIGN for Mul$processors
COMP9242 S2/2015 W12 31
Op$misa$on for Scalability
• Reduce amount of code in cri=cal sec=ons • Increases concurrency • Fine grained locking – Lock data not code – Tradeoff: more concurrency but more locking (and locking causes serialisa=on)
• Lock free data structures • Avoid expensive memory access • Avoid uncached memory • Access cheap (close) memory
COMP9242 S2/2015 W12 32
9
Op$misa$on for Scalability
• Reduce false sharing • Pad data structures to cache lines
• Reduce cache line bouncing • Reduce sharing • E.g: MCS locks use local data
• Reduce cache misses • Affinity scheduling: run process on the core where it last ran. • Avoid cache pollu=on
COMP9242 S2/2015 W12 33
OS Design Guidelines for Modern (and future) Mul$processors • Avoid shared data • Performance issues arise less from lock conten=on than from data locality
• Explicit communica=on • Regain control over communica=on costs (and predictability) • Some=mes it’s the only op=on
• Tradeoff: parallelism vs synchronisa=on • Synchronisa=on introduces serialisa=on • Make concurrent threads independent: reduce crit sec=ons & cache misses
• Allocate for locality • E.g. provide memory local to a core
• Schedule for locality • With cached data • With local memory
• Tradeoff: uniprocessor performance vs scalability
COMP9242 S2/2015 W12 34
Design approaches
• Divide and conquer • Divide mul=processor into smaller bits, use them as normal • Using virtualisa=on • Using exokernel
• Reduced sharing • Brute force & Heroic Effort – Find problems in exis=ng OS and fix them – E.g Linux rearchitec=ng: BKL -‐> fine grained locking
• By design – Avoid shared data as much as possible
• No sharing • Computer is a distributed system – Do extra work to share!
COMP9242 S2/2015 W12 35
Divide and Conquer
Disco • Scalability is too hard!
• Context: • ca. 1995, large ccNUMA mul=processors appearing • Scaling OSes requires extensive modifica=ons
• Idea: • Implement a scalable VMM • Run mul=ple OS instances
• VMM has most of the features of a scalable OS: • NUMA aware allocator • Page replica=on, remapping, etc.
• VMM substan=ally simpler/cheaper to implement • Modern incarna=ons of this • Virtual servers (Amazon, etc.) • Research (Cerberus)
COMP9242 S2/2015 W12 36 Running commodity OSes on scalable multiprocessors [Bugnion et al., 1997] http://www-flash.stanford.edu/Disco/
10
Disco Architecture
COMP9242 S2/2015 W12 37
Disco Performance
COMP9242 S2/2015 W12 38
Space-‐Time Par$$oning
Tessella$on • Space-‐Time par==oning • 2-‐level scheduling
• Context: • 2009-‐… highly parallel mul=core systems • Berkeley Par Lab
COMP9242 S2/2015 W12 39 Tessellation: Space-Time Partitioning in a Manycore Client OS [Liu et al., 2010] http://tessellation.cs.berkeley.edu/
Tessella$on
COMP9242 S2/2015 W12 40
11
Reduce Sharing K42 • Context: • 1997-‐2006: OS for ccNUMA systems • IBM, U Toronto (Tornado, Hurricane)
• Goals: • High locality • Scalability
• Object Oriented • Fine grained objects
• Clustered (Distributed) Objects • Data locality
• Deferred dele=on (RCU) • Avoid locking
• NUMA aware memory allocator • Memory locality
COMP9242 S2/2015 W12 41 Clustered Objects, Ph.D. thesis [Appavoo, 2005] http://www.research.ibm.com/K42/
K42: Fine-‐grained objects
COMP9242 S2/2015 W12 42
K42: Clustered objects • Globally valid object reference
• Resolves to • Processor local representa=ve
• Sharing, locking strategy local to each object
• Transparency • Eases complexity • Controlled introduc=on of locality
• Shared counter: • inc, dec: local access • val: communica=on
• Fast path: • Access mostly local structures
COMP9242 S2/2015 W12 43
K42 Performance
COMP9242 S2/2015 W12 44
2.4.19
12
Corey • Context • 2008, high-‐end mul=core servers, MIT
• Goals: • Applica=on control of OS sharing
• OS • Exokernel-‐like, higher-‐level services as libraries • By default only single core access to OS data structures • Calls to control how data structures are shared
• Address Ranges • Control private per core and shared address spaces
• Kernel Cores • Dedicate cores to run specific kernel func=ons
• Shares • Lookup tables for kernel objects allow control over which object iden=fiers are visible to other cores.
COMP9242 S2/2015 W12 45 Corey: An Operating System for Many Cores [Boyd-Wickizer et al., 2008]
http://pdos.csail.mit.edu/corey
Linux Brute Force Scalability
• Context • 2010, high-‐end mul=core servers, MIT
• Goals: • Scaling commodity OS
• Linux scalability (2010 – scale Linux to 48 cores)
COMP9242 S2/2015 W12 46 An Analysis of Linux Scalability to Many Cores [Boyd-Wickizer et al., 2010]
Linux Brute Force Scalability • Apply lessons from parallel compu=ng and past research • sloppy counters, • per-‐core data structs, • fine-‐grained lock, lock free, • cache lines • 3002 lines of code changed
• Conclusion: • no scalability reason to give up on tradi=onal opera=ng system organiza=ons just yet.
COMP9242 S2/2015 W12 47
Scalability of the API
• Context • 2013, previous mul=core projects at MIT
• Goals • How to know if a system is really scalable?
• Workload-‐based evalua=on • Run workload, plot scalability, fix problems • Did we miss any non-‐scalable workload? • Did we find all boulenecks?
• Is there something fundamental that makes an system non-‐scalable? • The interface might be a fundamental bouleneck
COMP9242 S2/2015 W12 48 The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors [Clements et al., 2013]
13
Scalable Commuta$vity Rule • The Rule • Whenever interface opera1ons commute, they can be implemented in a way that scales.
• Commuta=ve opera=ons: • Cannot dis=nguish order of opera=ons from results • Example: – Creat: – Requires that lowest available FD be returned – Not commuta=ve: can tell which one was run first
• Why are commuta=ve opera=ons scalable? • results independent of order ⇒ communica=on is unnecessary • without communica=on, no conflicts
• Informs sonware design process • Design: design guideline for scalable interfaces • Implementa=on: clear target • Test: workload-‐independent tes=ng
COMP9242 S2/2015 W12 49
Commuter: An Automated Scalability Tes$ng Tool
COMP9242 S2/2015 W12 50
(sv6)
FlexSC • Context: • 2010, commodity mul=cores • U Toronto
• Goal: • Reduce context switch overhead of system calls
• Syscall context switch: • Usual mode switch overhead • But: cache and TLB pollu=on!
COMP9242 S2/2015 W12 51 FlexSC: Flexible System Call Scheduling with Exception-Less System Calls [Soares and Stumm., 2010]
FlexSC
• Asynchronous system calls • Batch system calls • Run them on dedicated cores
• FlexSC-‐Threads • M on N • M >> N
COMP9242 S2/2015 W12 52
14
FlexSC Results
COMP9242 S2/2015 W12 53
Apache FlexSC: batching, sys call core redirect
No sharing
• Mul=kernel • Barrelfish • fos: factored opera=ng system
COMP9242 S2/2015 W12 54 The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/
Barrelfish
• Context: • 2007 large mul=core machines appearing • 100s of cores on the horizon • NUMA (cc and non-‐cc) • ETH Zurich and Microson
• Goals: • Scale to many cores • Support and manage heterogeneous hardware
• Approach: • Structure OS as distributed system
• Design principles: • Interprocessor communica=on is explicit • OS structure hardware neutral • State is replicated
• Microkernel • Similar to seL4: capabili=es
COMP9242 S2/2015 W12 55 The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/
Barrelfish
COMP9242 S2/2015 W12 56
15
Barrelfish: Replica$on
• Kernel + Monitor: • Only memory shared for message channels
• Monitor: • Collec=vely coordinate system-‐wide state
• System-‐wide state: • Memory alloca=on tables • Address space mappings • Capability lists
• What state is replicated in Barrelfish • Capability lists
• Consistency and Coordina=on • Retype: two-‐phase commit to globally execute opera=on in order • Page (re/un)mapping: one-‐phase commit to synchronise TLBs
COMP9242 S2/2015 W12 57
Barrelfish: Communica$on • Different mechanisms: • Intra-‐core – Kernel endpoints
• Inter-‐core – URPC
• URPC • Uses cache coherence + polling • Shared bufffer – Sender writes a cache line – Receiver polls on cache line – (last word so no part message)
• Polling? – Cache only changes when sender writes, so poll is cheap
– Switch to block and IPI if wait is too long.
COMP9242 S2/2015 W12 58
Barrelfish: Results • Message passing vs caching
COMP9242 S2/2015 W12 59
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy
(cyc
les ×
100
0)
Cores
SHM8SHM4SHM2SHM1MSG8MSG1
Server
Barrelfish: Results • Broadcast vs Mul=cast
COMP9242 S2/2015 W12 60
0
2
4
6
8
10
12
14
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32La
ten
cy (c
ycle
s ×
100
0)
Cores
BroadcastUnicast
MulticastNUMA-Aware Multicast
16
Barrelfish: Results • TLB shootdown
COMP9242 S2/2015 W12 61
0
10
20
30
40
50
60
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Late
ncy
(cyc
les
× 1
00
0)
Cores
WindowsLinux
Barrelfish
Summary
COMP9242 S2/2015 W12 62
Summary • Trends in mul=core • Scale (100+ cores) • NUMA • No cache coherence • Distributed system • Heterogeneity
• OS design guidelines • Avoid shared data • Explicit communica=on • Locality
• Approaches to mul=core OS • Par==on the machine (Disco, Tessella=on) • Reduce sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) • No sharing (Barrelfish, fos)
COMP9242 S2/2015 W12 63