+ All Categories
Home > Documents > Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12...

Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12...

Date post: 23-Jan-2021
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
16
1 www.csiro.au Mul$processor OS COMP9242 – Advanced Opera$ng Systems Ihor Kuz | [email protected] S2/2015 Week 12 Overview Mul=processor OS • How does it work? • Scalability (Review) Mul=processor Hardware • Contemporary systems (Intel, AMD, ARM, Oracle/Sun) • Experimental and Future systems (Intel, MS, Polaris) OS Design for Mul=processors • Guidelines • Design approaches – Divide and Conquer (Disco, Tessela=on) – Reduce Sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) – No Sharing (Barrelfish, fos) COMP9242 S2/2015 W12 2 Mul$processor OS COMP9242 S2/2015 W12 3 Uniprocessor OS COMP9242 S2/2015 W12 4 CPU App1 OS Memory OS data Application data App1 App2 App3 App4 App2 Run queue FS structs Process control blocks
Transcript
Page 1: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

1

www.csiro.au

Mul$processor  OS  COMP9242  –  Advanced  Opera$ng  Systems  

Ihor  Kuz  |  [email protected]  S2/2015  Week  12  

Overview  

• Mul=processor  OS  •  How  does  it  work?  •  Scalability  (Review)  

• Mul=processor  Hardware  •  Contemporary  systems  (Intel,  AMD,  ARM,  Oracle/Sun)  •  Experimental  and  Future  systems  (Intel,  MS,  Polaris)  

•  OS  Design  for  Mul=processors  •  Guidelines  •  Design  approaches  –  Divide  and  Conquer  (Disco,  Tessela=on)  –  Reduce  Sharing  (K42,  Corey,  Linux,  FlexSC,  scalable  commuta=vity)  –  No  Sharing  (Barrelfish,  fos)  

COMP9242 S2/2015 W12 2

Mul$processor  OS  

COMP9242 S2/2015 W12 3

Uniprocessor  OS  

COMP9242 S2/2015 W12 4

CPU

App1

OS

Memory

OS data Application data

App1

App2 App3

App4

App2

Run queue

FS structs

Process control blocks

Page 2: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

2

Mul$processor  OS  

COMP9242 S2/2015 W12 5

CPU

App1

OS

CPU

App3

OS

CPU

App4

OS

CPU

App4

OS

Memory

OS data Application data

App1

App2 App3

App4 Run

queue FS

structs Process control

blocks

Mul$processor  OS  

 •  Key  design  challenges:  •  Correctness  of  (shared)  data  structures  

•  Scalability  

COMP9242 S2/2015 W12 6

CPU

App1

OS

CPU

App3

OS

CPU

App4

OS

CPU

App4

OS

Memory

OS data Application data

App1

App2 App3

App4 Run

queue FS

structs Process control

blocks

Key  design  challenges:  •  Correctness  of  (shared)  data  

structures  •  Scalability  

Correctness  of  Shared  Data  

•  Concurrency  control  •  Locks  •  Semaphores  •  Transac=ons  •  Lock-­‐free  data  structures  

• We  know  how  to  do  this:  •  In  the  applica=on  •  In  the  OS  

COMP9242 S2/2015 W12 7

Scalability  Speedup  as  more  processors  added  

COMP9242 S2/2015 W12 8

Speedu

p  (S)  

number  of  processors  (n)  

Ideal  

S(N ) = T1TN

Page 3: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

3

Scalability  Speedup  as  more  processors  added  

COMP9242 S2/2015 W12 9

speedu

p  

number  of  processors  

Reality  

S(N ) = T1TN

Scalability  and  Serialisa$on  Remember  Amdahl’s  law  •  Serial  (non-­‐parallel)  por=on:  when  applica=on  not  running  on  all  cores  •  Serialisa=on  prevents  scalability  

COMP9242 S2/2015 W12 10 From http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg

T1 =1= (1−P)+P

TN = (1−P)+PN

S(N ) = T1TN

=1

(1−P)+ PN

S(∞)→ 1(1−P)

Serialisa$on  

Where  does  serialisa=on  show  up?  •  Applica=on  (e.g.  access  shared  app  data)  •  OS  (e.g.  performing  syscall  for  app)  How  much  =me  is  spent  in  OS?  

Sources  of  Serialisa=on:  •  Locking  (explicit  serialisa=on)  – Wai=ng  for  a  lock  è  stalls  self  –  Lock  implementa=on:    –  Atomic  opera=ons  lock  bus  è  stalls  everyone  –  Cache  coherence  traffic  loads  bus  è  slows  down  others  

Memory  access  (implicit)  •  Rela=vely  high  latency  to  memory    è  stalls  self    

Cache  (implicit)  •  Processor  stalled  while  cache  line  is  fetched  or  invalidated  •  Affected  by  latency  of  interconnect  •  Performance  depends  on  data  size  (cache  lines)  and  conten=on  (number  of  cores)  

COMP9242 S2/2015 W12 11

More  Cache-­‐related  Serialisa$on  

False  sharing  •  Unrelated  data  structs  share  the  same  cache  line  •  Accessed  from  different  processors    è  Cache  coherence  traffic  and  delay  

Cache  line  bouncing  •  Shared  R/W  on  many  processors  •  E.g:  bouncing  due  to  locks:  each  processor  spinning  on  a  lock  brings  it  into  its  own  cache  

è  Cache  coherence  traffic  and  delay  Cache  misses  •  Poten=ally  direct  memory  access  è  stalls  self  • When  does  cache  miss  occur?  –  Applica=on  accesses  data  for  the  first  =me,    Applica=on  runs  on  new  core    –  Cached  memory  has  been  evicted  –  Cache  footprint  too  big,  another  app  ran,  OS  ran  

COMP9242 S2/2015 W12 12

Page 4: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

4

Mul$processor  Hardware  

COMP9242 S2/2015 W12 13

Mul$-­‐What?  

• Mul=processor,  SMP  •  >1  separate  processors,  connected  by  off  chip  bus  

• Mul=core  •  >1  processing  cores  in  a  single  processor,  connected  by  on  chip  bus  

• Mul=thread,  SMT  •  >1  hardware  threads  in  a  single  core  

• Mul=core  +  Mul=processor  •  >1  mul=core  processors  •  >1  mul=core  dies  in  a  package  (mul=-­‐chip  module)  

COMP9242 S2/2015 W12 14

Interes$ng  Proper$es  of  Mul$processors  

•  Scale  and  Structure  •  How  many  cores  and  processors  are  there  • What  kinds  of  cores  and  processors  are  there  •  How  are  they  organised  

•  Interconnect  •  How  are  the  cores  and  processors  connected  

• Memory  Locality  and  Caches  • Where  is  the  memory  • What  is  the  cache  architecture  

•  Interprocessor  Communica=on  •  How  do  cores  and  processors  send  messages  to  each  other  

COMP9242 S2/2015 W12 15

Contemporary  Mul$processor  Hardware  •  Intel:    •  Nehalem,  Westmere:    10  core,  QPI  •  Sandy  Bridge,  Ivy  Bridge:    –  5  core,  ring  bus,  integrated  GPU,  L3,  IO  

•  Haswell  (Broadwell):    –  18  core,  ring  bus,  transac=onal  memory,  slices  (EP)  

•  AMD:  •  K10  (Opteron:  Barcelona,  Magny  Cours)  –  12  core,  Hypertransport  

•  Bulldozer,  Piledriver,  Steamroller  (Opteron,  FX)  –  16  core,  Clustered  Mul=thread:  module  with  2  integer  cores  

•  Oracle  (Sun)  UltraSparc  T1,T2,T3,T4,T5  (Niagara)  •  16  cores,  8  threads/core  (2  simultaneous),  crossbar,  8  sockets  

•  ARM  Cortex  A9,  A15  MPCore,  big.LITTLE  •  4  -­‐8  cores,  big.LITTLE:  A7  +  A15  

COMP9242 S2/2015 W12 16

Page 5: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

5

Scale  and  Structure  •  ARM  Cortex  A9  MPCore  

COMP9242 S2/2015 W12 17 From http://www.arm.com/images/Cortex-A9-MP-core_Big.gif

Scale  and  Structure  

•  ARM  big.LITTLE  

COMP9242 S2/2015 W12 18 From http://www.arm.com/images/Fig_1_Cortex-A15_CCI_Cortex-A7_System.jpg

Scale  and  Structure  

•  Intel  Nehalem  

COMP9242 S2/2015 W12 19 From www.dawnofthered.net/wp-content/uploads/2011/02/Nehalem-EX-architecture-detailed.jpg

Interconnect  

•  AMD  Barcelona  

COMP9242 S2/2015 W12 20 From www.sigops.org/sosp/sosp09/slides/baumann-slides-sosp09.pdf

Page 6: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

6

Memory  Locality  and  Caches  

 

COMP9242 S2/2015 W12 21 From www.systems.ethz.ch/education/past-courses/fall-2010/aos/lectures/wk10-multicore.pdf

Interprocessor  Communica$on  •  Oracle  Sparc  T2  

COMP9242 S2/2015 W12 22

UltraSPARC® IIIiprocessor

1x

2004 2005 2006 2007 2008

UltraSPARC® T1processor32 threadseight cores

14x

UltraSPARC T2 processor64 threadseight cores

35x

“Victoria Falls”128 threads

16 cores65x

(two sockets)

FB DIMM FB DIMM FB DIMM FB DIMM

SPU SPU SPU SPU SPU SPU SPU SPU

FPU FPU FPU FPU FPU FPU FPU FPU

2x 10Gigabit Ethernet

Power <95 W x8 @ 2.0 GHz

NIU(Ethernet+)

Sys I/FBuffer Switch Core PCIe

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

C0 C1 C2 C3 C4 C5 C6 C7

MCU

Full Cross Bar

MCU MCU MCU

FB DIMM FB DIMM FB DIMM FB DIMM

FPU FPU FPU FPU FPU FPU FPU FPU

2x 10Gigabit Ethernet

Power <100 W x8 @2. GHz

NIU(E-NET+)

Sys I/FBuffer Switch Core

PCIe

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

C0 C1 C2 C3 C4 C5 C6 C7

MCU

Full Cross Bar

MCU MCU MCU

From Sun/Oracle

Interprocessor  Communica$on  

COMP9242 S2/2015 W12 23 From http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

Interprocessor  Communica$on/Structure/Memory  

COMP9242 S2/2015 W12 24 From http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4

Page 7: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

7

Experimental/Future  Mul$processor  Hardware  

• Microson  Beehive  •  Ring  bus,  no  cache  coherence  

•  Tilera  Tile64,  Tile-­‐Gx  •  100  cores,  mesh  network  

•  Intel  Polaris  •  80  cores,  mesh  network  

•  Intel  SCC  •  48  cores,  mesh  network,  no  cache  coherency  

•  Intel  MIC  (Mul=  Integrated  Core)  (Knight’s  Corner  -­‐  Xeon  Phi)  •  60+  cores,  ring  bus  

COMP9242 S2/2015 W12 25

Scale  and  Structure  •  Tilera  Tile64  (newest:  EzChips  TILE-­‐Gx),  Intel  Polaris  

COMP9242 S2/2015 W12 26

PCIe 1

MAC/ PHY

SerDes

GbE

GbE 1 Flexible I/O

Flexible I/O

UART, HPI, I2C, JTAG,SPI

DDR2 Controller 3 DDR2 Controller 2

DDR2 Controller 1 DDR2 Controller 0

XAUI 1 MAC/ PHY

SerDes

PCIe 0 MAC/ PHY

SerDes

SerDes

0

Reg File

P2

P1

P0

L2 CACHE

PROCESSOR CACHE

SWITCH

2D DMA

L-1I

MDN TDN

UDN IDN

STN

L-1D

I-TLB D-TLB

From www.tilera.com/products/processors/TILE64

Cache  and  Memory  

•  Intel  SCC  

COMP9242 S2/2015 W12 27 From techresearch.intel.com/spaw2/uploads/files/SCC_Platform_Overview.pdf

Interprocessor  Communica$on  

•  Beehive  

COMP9242 S2/2015 W12 28 From projects.csail.mit.edu/beehive/BeehiveV5.pdf

Page 8: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

8

Interprocess  Communica$on  •  Intel  MIC  (Mul=  Integrated  Core)  (Knight’s  Corner/Landing  -­‐  Xeon  Phi)  

COMP9242 S2/2015 W12 29 From http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/

Summary  •  Scalability  •  100+  cores  •  Amdahl’s  law  really  kicks  in  

•  Heterogeneity  •  Heterogeneous  cores,  memory,  etc.  •  Proper=es  of  similar  systems  may  vary  wildly  (e.g.  interconnect  topology  and  latencies  between  different  AMD  plaoorms)  

•  NUMA  •  Also  variable  latencies  due  to  topology  and  cache  coherence  

•  Cache  coherence  may  not  be  possible  •  Can’t  use  it  for  locking  •  Shared  data  structures  require  explicit  work  

•  Computer  is  a  distributed  system  • Message  passing  •  Consistency  and  Synchronisa=on  •  Fault  tolerance  

COMP9242 S2/2015 W12 30

OS  DESIGN  for  Mul$processors  

COMP9242 S2/2015 W12 31

Op$misa$on  for  Scalability  

•  Reduce  amount  of  code  in  cri=cal  sec=ons  •  Increases  concurrency  •  Fine  grained  locking  –  Lock  data  not  code  –  Tradeoff:  more  concurrency  but  more  locking  (and  locking  causes  serialisa=on)  

•  Lock  free  data  structures  •  Avoid  expensive  memory  access  •  Avoid  uncached  memory  •  Access  cheap  (close)  memory  

 

COMP9242 S2/2015 W12 32

Page 9: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

9

Op$misa$on  for  Scalability  

•  Reduce  false  sharing  •  Pad  data  structures  to  cache  lines  

•  Reduce  cache  line  bouncing  •  Reduce  sharing  •  E.g:  MCS  locks  use  local  data  

•  Reduce  cache  misses  •  Affinity  scheduling:  run  process  on  the  core  where  it  last  ran.  •  Avoid  cache  pollu=on  

COMP9242 S2/2015 W12 33

OS  Design  Guidelines  for  Modern  (and  future)  Mul$processors  •  Avoid  shared  data  •  Performance  issues  arise  less  from  lock  conten=on  than  from  data  locality  

•  Explicit  communica=on  •  Regain  control  over  communica=on  costs  (and  predictability)  •  Some=mes  it’s  the  only  op=on  

•  Tradeoff:  parallelism  vs  synchronisa=on  •  Synchronisa=on  introduces  serialisa=on  • Make  concurrent  threads  independent:  reduce  crit  sec=ons  &  cache  misses    

•  Allocate  for  locality  •  E.g.  provide  memory  local  to  a  core  

•  Schedule  for  locality  • With  cached  data  • With  local  memory  

•  Tradeoff:  uniprocessor  performance  vs  scalability  

COMP9242 S2/2015 W12 34

Design  approaches  

•  Divide  and  conquer  •  Divide  mul=processor  into  smaller  bits,  use  them  as  normal  •  Using  virtualisa=on  •  Using  exokernel  

•  Reduced  sharing  •  Brute  force  &  Heroic  Effort  –  Find  problems  in  exis=ng  OS  and  fix  them  –  E.g  Linux  rearchitec=ng:  BKL  -­‐>  fine  grained  locking  

•  By  design  –  Avoid  shared  data  as  much  as  possible  

•  No  sharing  •  Computer  is  a  distributed  system  –  Do  extra  work  to  share!  

COMP9242 S2/2015 W12 35

Divide  and  Conquer  

Disco  •  Scalability  is  too  hard!  

•  Context:    •  ca.  1995,  large  ccNUMA  mul=processors  appearing  •  Scaling  OSes  requires  extensive  modifica=ons  

•  Idea:  •  Implement  a  scalable  VMM  •  Run  mul=ple  OS  instances  

•  VMM  has  most  of  the  features  of  a  scalable  OS:  •  NUMA  aware  allocator  •  Page  replica=on,  remapping,  etc.    

•  VMM  substan=ally  simpler/cheaper  to  implement  • Modern  incarna=ons  of  this  •  Virtual  servers  (Amazon,  etc.)  •  Research  (Cerberus)  

COMP9242 S2/2015 W12 36 Running commodity OSes on scalable multiprocessors [Bugnion et al., 1997] http://www-flash.stanford.edu/Disco/

Page 10: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

10

Disco  Architecture  

COMP9242 S2/2015 W12 37

Disco  Performance  

COMP9242 S2/2015 W12 38

Space-­‐Time  Par$$oning  

Tessella$on  •  Space-­‐Time  par==oning  •  2-­‐level  scheduling  

•  Context:    •  2009-­‐…  highly        parallel  mul=core        systems  •  Berkeley  Par  Lab  

COMP9242 S2/2015 W12 39 Tessellation: Space-Time Partitioning in a Manycore Client OS [Liu et al., 2010] http://tessellation.cs.berkeley.edu/

Tessella$on  

COMP9242 S2/2015 W12 40

Page 11: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

11

Reduce  Sharing  K42  •  Context:  •  1997-­‐2006:  OS  for  ccNUMA  systems  •  IBM,  U  Toronto  (Tornado,  Hurricane)  

•  Goals:    •  High  locality  •  Scalability  

•  Object  Oriented  •  Fine  grained  objects  

•  Clustered  (Distributed)  Objects  •  Data  locality  

•  Deferred  dele=on  (RCU)  •  Avoid  locking    

•  NUMA  aware  memory  allocator  • Memory  locality  

COMP9242 S2/2015 W12 41 Clustered Objects, Ph.D. thesis [Appavoo, 2005] http://www.research.ibm.com/K42/

K42:  Fine-­‐grained  objects  

COMP9242 S2/2015 W12 42

K42:  Clustered  objects  •  Globally  valid  object  reference  

•  Resolves  to    •  Processor  local  representa=ve  

•  Sharing,  locking    strategy  local  to  each  object  

•  Transparency  •  Eases  complexity  •  Controlled  introduc=on  of  locality  

•  Shared  counter:  •  inc,  dec:  local  access  •  val:  communica=on  

•  Fast  path:  •  Access  mostly  local  structures  

COMP9242 S2/2015 W12 43

K42  Performance  

COMP9242 S2/2015 W12 44

2.4.19

Page 12: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

12

Corey  •  Context  •  2008,  high-­‐end  mul=core  servers,  MIT  

•  Goals:  •  Applica=on  control  of  OS  sharing  

•  OS  •  Exokernel-­‐like,  higher-­‐level  services  as  libraries  •  By  default  only  single  core  access  to  OS  data  structures  •  Calls  to  control  how  data  structures  are  shared  

•  Address  Ranges  •  Control  private  per  core  and  shared  address  spaces  

•  Kernel  Cores  •  Dedicate  cores  to  run  specific  kernel  func=ons  

•  Shares  •  Lookup  tables  for  kernel  objects  allow  control  over  which  object  iden=fiers  are  visible  to  other  cores.  

 COMP9242 S2/2015 W12 45 Corey: An Operating System for Many Cores [Boyd-Wickizer et al., 2008]

http://pdos.csail.mit.edu/corey

Linux  Brute  Force  Scalability  

•  Context  •  2010,  high-­‐end  mul=core  servers,  MIT  

•  Goals:  •  Scaling  commodity  OS  

•  Linux  scalability  (2010  –  scale  Linux  to  48  cores)  

COMP9242 S2/2015 W12 46 An Analysis of Linux Scalability to Many Cores [Boyd-Wickizer et al., 2010]

Linux  Brute  Force  Scalability  •  Apply  lessons  from  parallel  compu=ng  and  past  research  •  sloppy  counters,    •  per-­‐core  data  structs,    •  fine-­‐grained  lock,  lock  free,    •  cache  lines    •  3002  lines  of  code  changed  

•  Conclusion:    •  no  scalability  reason  to  give  up  on  tradi=onal  opera=ng  system  organiza=ons  just  yet.  

COMP9242 S2/2015 W12 47

Scalability  of  the  API  

•  Context  •  2013,  previous  mul=core  projects  at  MIT  

•  Goals  •  How  to  know  if  a  system  is  really  scalable?  

• Workload-­‐based  evalua=on  •  Run  workload,  plot  scalability,  fix  problems  •  Did  we  miss  any  non-­‐scalable  workload?  •  Did  we  find  all  boulenecks?  

•  Is  there  something  fundamental  that  makes  an  system  non-­‐scalable?  •  The  interface  might  be  a  fundamental  bouleneck  

COMP9242 S2/2015 W12 48 The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors [Clements et al., 2013]

Page 13: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

13

Scalable  Commuta$vity  Rule  •  The  Rule  • Whenever  interface  opera1ons  commute,  they  can  be  implemented  in  a  way  that  scales.  

•  Commuta=ve  opera=ons:    •  Cannot  dis=nguish  order  of  opera=ons  from  results  •  Example:  –  Creat:  –  Requires  that  lowest  available  FD  be  returned  –  Not  commuta=ve:  can  tell  which  one  was  run  first  

• Why  are  commuta=ve  opera=ons  scalable?  •  results  independent  of  order  ⇒  communica=on  is  unnecessary  • without  communica=on,  no  conflicts  

•  Informs  sonware  design  process  •  Design:  design  guideline  for  scalable  interfaces  •  Implementa=on:  clear  target  •  Test:  workload-­‐independent  tes=ng  

COMP9242 S2/2015 W12 49

Commuter:  An  Automated  Scalability  Tes$ng  Tool  

COMP9242 S2/2015 W12 50

(sv6)

FlexSC  •  Context:  •  2010,  commodity  mul=cores  •  U  Toronto  

•  Goal:  •  Reduce  context  switch  overhead  of  system  calls  

•  Syscall  context  switch:  •  Usual  mode  switch  overhead  •  But:  cache  and  TLB  pollu=on!  

COMP9242 S2/2015 W12 51 FlexSC: Flexible System Call Scheduling with Exception-Less System Calls [Soares and Stumm., 2010]

FlexSC  

•  Asynchronous  system  calls  •  Batch  system  calls  •  Run  them  on  dedicated  cores  

•  FlexSC-­‐Threads    • M  on  N  • M  >>  N  

COMP9242 S2/2015 W12 52

Page 14: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

14

FlexSC  Results  

COMP9242 S2/2015 W12 53

Apache FlexSC: batching, sys call core redirect

No  sharing  

• Mul=kernel  •  Barrelfish  •  fos:  factored  opera=ng  system  

COMP9242 S2/2015 W12 54 The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/

Barrelfish  

•  Context:    •  2007  large  mul=core  machines  appearing  •  100s  of  cores  on  the  horizon  •  NUMA  (cc  and  non-­‐cc)  •  ETH  Zurich  and  Microson  

•  Goals:  •  Scale  to  many  cores  •  Support  and  manage  heterogeneous  hardware  

•  Approach:  •  Structure  OS  as  distributed  system  

•  Design  principles:  •  Interprocessor  communica=on  is  explicit  •  OS  structure  hardware  neutral  •  State  is  replicated  

•  Microkernel  •  Similar  to  seL4:  capabili=es  

COMP9242 S2/2015 W12 55 The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/

Barrelfish  

COMP9242 S2/2015 W12 56

Page 15: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

15

Barrelfish:  Replica$on  

•  Kernel  +  Monitor:  •  Only  memory  shared  for  message  channels  

• Monitor:  •  Collec=vely  coordinate  system-­‐wide  state  

•  System-­‐wide  state:  • Memory  alloca=on  tables  •  Address  space  mappings  •  Capability  lists  

• What  state  is  replicated  in  Barrelfish  •  Capability  lists  

•  Consistency  and  Coordina=on  •  Retype:  two-­‐phase  commit  to  globally  execute  opera=on  in  order  •  Page  (re/un)mapping:  one-­‐phase  commit  to  synchronise  TLBs  

COMP9242 S2/2015 W12 57

Barrelfish:  Communica$on  •  Different  mechanisms:  •  Intra-­‐core  –  Kernel  endpoints  

•  Inter-­‐core  –  URPC  

•  URPC  •  Uses  cache  coherence  +  polling  •  Shared  bufffer  –  Sender  writes  a  cache  line  –  Receiver  polls  on  cache  line  –  (last  word  so  no  part  message)  

•  Polling?  –  Cache  only  changes  when  sender  writes,  so  poll  is  cheap  

–  Switch  to  block  and  IPI  if  wait  is  too  long.  

COMP9242 S2/2015 W12 58

Barrelfish:  Results  • Message  passing  vs  caching  

COMP9242 S2/2015 W12 59

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy

(cyc

les ×

100

0)

Cores

SHM8SHM4SHM2SHM1MSG8MSG1

Server

Barrelfish:  Results  •  Broadcast  vs  Mul=cast  

COMP9242 S2/2015 W12 60

0

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32La

ten

cy (c

ycle

s ×

100

0)

Cores

BroadcastUnicast

MulticastNUMA-Aware Multicast

Page 16: Mul$processor+OS+cs9242/15/lectures/12-multiproc-4up.pdf2 Mul$processor+OS+ 5 COMP9242 S2/2015 W12 CPU App1 OS CPU App3 OS CPU App4 OS CPU App4 OS Memory OS data Application data App1

16

Barrelfish:  Results  •  TLB  shootdown  

COMP9242 S2/2015 W12 61

0

10

20

30

40

50

60

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Late

ncy

(cyc

les

× 1

00

0)

Cores

WindowsLinux

Barrelfish

Summary  

COMP9242 S2/2015 W12 62

Summary  •  Trends  in  mul=core  •  Scale  (100+  cores)  •  NUMA  •  No  cache  coherence    •  Distributed  system  •  Heterogeneity  

•  OS  design  guidelines  •  Avoid  shared  data  •  Explicit  communica=on  •  Locality  

•  Approaches  to  mul=core  OS  •  Par==on  the  machine  (Disco,  Tessella=on)  •  Reduce  sharing  (K42,  Corey,  Linux,  FlexSC,  scalable  commuta=vity)  •  No  sharing  (Barrelfish,  fos)  

COMP9242 S2/2015 W12 63


Recommended