+ All Categories
Home > Technology > The Linux Scheduler: a Decade of Wasted Cores

The Linux Scheduler: a Decade of Wasted Cores

Date post: 16-Apr-2017
Category:
Upload: yeokm1
View: 623 times
Download: 0 times
Share this document with a friend
68
The Linux Scheduler: a Decade of Wasted Cores Authored by: 1. Jean-Pierre Lozi (Université Nice Sophia Antipolis) 2. Baptiste Lepers (École Polytechnique Fédérale de Lausanne) 3. Justin Funston (University of British Columbia) 4. Fabien Gaud (Coho Data) 5. Vivien Quéma (Grenoble Institute of Technology) 6. Alexandra Fedorova (University of British Columbia) Eurosys Conference (18 – 21 April 2016) Paper Paper: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf Reference Slides: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf Reference summary: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/ 1 Papers We Love #20 (30 May 2016) By: Yeo Kheng Meng ( [email protected])
Transcript
Page 1: The Linux Scheduler: a Decade of Wasted Cores

1

The Linux Scheduler:a Decade of Wasted Cores

Authored by: 1. Jean-Pierre Lozi (Université Nice Sophia Antipolis)2. Baptiste Lepers (École Polytechnique Fédérale de Lausanne)3. Justin Funston (University of British Columbia)4. Fabien Gaud (Coho Data)5. Vivien Quéma (Grenoble Institute of Technology)6. Alexandra Fedorova (University of British Columbia)

Eurosys Conference (18 – 21 April 2016) PaperPaper: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdfReference Slides: http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdfReference summary: https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

Papers We Love #20 (30 May 2016) By: Yeo Kheng Meng ([email protected])

Page 2: The Linux Scheduler: a Decade of Wasted Cores

2

This presentation is best viewed with the animations enabled

Page 3: The Linux Scheduler: a Decade of Wasted Cores

3

Some history• Everybody wants ↑ CPU performance• Before 2004: • ↓ transistor size • ↓ power of each transistor (Dennard Scaling)• ↑ CPU frequency -> ↑ CPU performance

• ~2005-2007 to present:• End of Dennard Scaling• Increased use of multicores to ↑ CPU performance

• But did Linux properly take advantage of these cores?

Page 4: The Linux Scheduler: a Decade of Wasted Cores

4

Objective/Invariant of a Linux scheduler

• Load balance evenly on CPU cores to maximise resources• No idle CPU cores if some cores have waiting threads

Page 5: The Linux Scheduler: a Decade of Wasted Cores

5

Test setup• AMD Bulldozer Opteron 6272 (Socket G34) + 512GB RAM

• 8 NUMA nodes x 8 core (64 threads)

• NUMA: Non Uniform Memory Access• Cores within nodes have faster access to local memory closer to them compared to foreign memory• Each Opteron NUMA node has faster access to its last-level (L3) cache• Total RAM spit into 64GB RAM chunks among nodes

• Linux Kernel up to 4.3• TPC-H benchmark

• Transaction Processing Performance Council• TPC-H: Complex database queries and data modification

http://developer.amd.com/resources/documentation-articles/articles-whitepapers/introduction-to-magny-cours/

Page 6: The Linux Scheduler: a Decade of Wasted Cores

6

What is the problem?

• Idle cores exist despite other cores being overloaded• Performance bugs in the Linux Kernel

Page 7: The Linux Scheduler: a Decade of Wasted Cores

7

What are the bugs?1. Group imbalance (Mar 2011)2. Scheduling Group Construction (Nov 2013)3. Overload-on-Wakeup (Dec 2009)4. Missing Scheduling Domains (Feb 2015)

Page 8: The Linux Scheduler: a Decade of Wasted Cores

8

First some concepts• Thread weight

• Higher priority -> Higher weight• Decided by Linux

• Timeslice: • Time allocated for each thread to run on CPU in a certain time interval/timeout• CPU cycles divided in proportion to thread’s weight

• Runtime:• Accumulative thread time on CPU. • Once runtime > timeslice, thread is preempted.

• Runqueue• Queue of threads waiting to be executed by CPU• Queue sorted by runtime• Implemented as red-black tree

• Completely Fair Scheduler (CFS)• Linux’s scheduler based on Weighted Fair Queuing to schedule threads

Page 9: The Linux Scheduler: a Decade of Wasted Cores

9

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

CPU CoreTime elapsed (s)

Sorted threads

Runtime

A 0

B 0

C 0

D 0

E 0

Runqueue sorted by Runtime

Thread A

0

Page 10: The Linux Scheduler: a Decade of Wasted Cores

10

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

CPU CoreTime elapsed (s)

Sorted threads

Runtime

B 0

C 0

D 0

E 0

Runqueue sorted by Runtime

Thread A

Thread B

0.050

Page 11: The Linux Scheduler: a Decade of Wasted Cores

11

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

Sorted threads

Runtime

C 0

D 0

E 0

A 0.05

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

Thread B

Thread C

0.150.05

Page 12: The Linux Scheduler: a Decade of Wasted Cores

12

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

Sorted threads

Runtime

D 0

E 0

A 0.05

B 0.10

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

Thread C

Thread D

0.350.15

Page 13: The Linux Scheduler: a Decade of Wasted Cores

13

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

Sorted threads

Runtime

E 0

A 0.05

B 0.10

C 0.20

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

Thread D

Thread E

0.600.35

Page 14: The Linux Scheduler: a Decade of Wasted Cores

14

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

Sorted threads

Runtime

A 0.05

B 0.10

C 0.20

D 0.25

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

Thread E 1.000.60

Page 15: The Linux Scheduler: a Decade of Wasted Cores

15

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 200 * 1 0.05B 20 20 / 200 * 1 0.10C 40 40 / 200 * 1 0.20D 50 50 / 200 * 1 0.25E 80 80 / 200 * 1 0.40

Total 200

Time interval: 1 second

Sorted threads

Runtime

A 0.05

B 0.10

C 0.20

D 0.25

E 0.40

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

1.00

Page 16: The Linux Scheduler: a Decade of Wasted Cores

16

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 250 * 1 0.04B 20 20 / 250 * 1 0.08C 40 40 / 250 * 1 0.16D 50 50 / 250 * 1 0.20E 80 80 / 250 * 1 0.32F 50 50 / 250 * 1 0.20

Total 250

Time interval: 1 second

Sorted threads

Runtime

F 0

A 0.05

B 0.10

C 0.20

D 0.25

E 0.40

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

0

Thread F

Page 17: The Linux Scheduler: a Decade of Wasted Cores

17

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 250 * 1 0.04B 20 20 / 250 * 1 0.08C 40 40 / 250 * 1 0.16D 50 50 / 250 * 1 0.20E 80 80 / 250 * 1 0.32F 50 50 / 250 * 1 0.20

Total 250

Time interval: 1 second

Sorted threads

Runtime

A 0.05

B 0.10

C 0.20

D 0.25

E 0.40

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

Thread F 0.200

Thread A

Page 18: The Linux Scheduler: a Decade of Wasted Cores

18

Completely Fair Scheduler (Single-Core)

Thread name

Weight Timeslice calculation(Weight / Total) * Interval

Assigned Timeslice

A 10 10 / 250 * 1 0.04B 20 20 / 250 * 1 0.08C 40 40 / 250 * 1 0.16D 50 50 / 250 * 1 0.20E 80 80 / 250 * 1 0.32F 50 50 / 250 * 1 0.20

Total 250

Time interval: 1 second

Sorted threads

Runtime

B 0.10

C 0.20

F 0.20

D 0.25

E 0.40

Runqueue sorted by Runtime

CPU CoreTime elapsed (s)

Thread A 0.240.20

Thread B

Page 19: The Linux Scheduler: a Decade of Wasted Cores

19

What about multi-cores? (Global runqueue)CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3

Global Runqueue

Problems• Context Switching requires access to runqueue• Only one core can access/manipulate runqueue at any one time• Other cores must wait to get new threads

Page 20: The Linux Scheduler: a Decade of Wasted Cores

20

What about multi-cores? (Per-core runqueue)CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3

Core 0 Runqueue

Core 1 Runqueue

Core 2 Runqueue

Core 3 Runqueue

…Scheduler Objectives in multi-runqueue system1. Runqueues has to be periodically load-balanced (4ms) or when threads are added/awoken2. No runqueue should have high-proportion of high-priority threads3. Should not balance every time there is a change in queue

• Load-balancing is computationally-heavy• Moving threads across different runqueues will cause cache misses• DO IT FEWER, DO IT BETTER

4. No idle cores should be allowed -> Emergency load-balancing when one core goes idle

Page 21: The Linux Scheduler: a Decade of Wasted Cores

21

Naïve runqueue load-balancing algorithms• Balance runqueues by same number of threads?• Ignores thread-priority, some threads more important than others

• Balance runqueues by thread weights?• Some high priority threads can sleep a lot• Scenario: One sleepy high priority thread in a queue• -> Waste of CPU resources

Core 0 Runqueue

Thread A (W= 80, 25%)

-

-

-

Total Weight = 80

Core 1 Runqueue

Thread B (W=25, 60%)

Thread C (W=25, 40%)

Thread D (W=10, 50%)

Thread E (W=20, 50%)

Total Weight = 80

Page 22: The Linux Scheduler: a Decade of Wasted Cores

22

Slightly improved load-balancing algorithm• Concept of “load”

• Balance runqueues by total load

Core 0 Runqueue

Thread load

Thread A (W=80, 25%)

20

- -

- -

- -

Core 1 Runqueue

Thread load

Thread B(W=25, 60%)

15

Thread C(W=25, 40%)

10

Thread D(W=10, 50%)

5

Thread E(W=20, 50%)

10

Core 0 Runqueue

Thread load

Thread A (W=80, 25%)

20

Thread E(W=20, 50%)

10

- -

- -

Core 1 Runqueue

Thread load

Thread B(W=25, 60%)

15

Thread C(W=25, 40%)

10

Thread D(W=10, 50%)

5

- -

Page 23: The Linux Scheduler: a Decade of Wasted Cores

23

Pseudocode of load-balancing algorithm

• Scheduling group (SG) is a subset of Scheduling domain (SD)• SG comprises of CPU core(s)

1. For each SD, from lowest hierarchy to highest

2. Select designated core in SD to run algorithm (first idle core or core 0)

11. Compute average loads of all SG in SD

13. Select SG with highest average load

15. If load of other SG > current SG, balance load by stealing work from the other SG

Page 24: The Linux Scheduler: a Decade of Wasted Cores

24

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

balances with

Load balancing hierarchal order

Page 25: The Linux Scheduler: a Decade of Wasted Cores

25

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Between pairs of coresEg. Core 0 balances with Core 1, Core 2 with Core 3,…, Core 62 with Core 63

Load balancing hierarchal order (Level 1)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: CPU pairsNumber of SDs: 32Scheduling Groups: CPU cores

Page 26: The Linux Scheduler: a Decade of Wasted Cores

26

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First pair of every node balances with others cores in same node

Load balancing hierarchal order (Level 2)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: NUMA nodesNumber of SDs: 8Scheduling Groups: CPU pairs

Page 27: The Linux Scheduler: a Decade of Wasted Cores

27

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 0 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {0, 1, 2, 4, 6}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 28: The Linux Scheduler: a Decade of Wasted Cores

28

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 1 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {1, 0, 3, 4, 5, 7}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 29: The Linux Scheduler: a Decade of Wasted Cores

29

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 2 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {2, 0, 3, 4, 5, 6, 7}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 30: The Linux Scheduler: a Decade of Wasted Cores

30

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 3 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {3, 1, 2, 3, 4, 5}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 31: The Linux Scheduler: a Decade of Wasted Cores

31

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 4 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {4, 0, 1, 2, 3, 5, 6}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 32: The Linux Scheduler: a Decade of Wasted Cores

32

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 5 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {5, 1, 2, 3, 5, 7}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 33: The Linux Scheduler: a Decade of Wasted Cores

33

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 6 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {6, 0, 2, 4, 7}

Load balancing hierarchal order (Level 3) Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 34: The Linux Scheduler: a Decade of Wasted Cores

34

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with Node 7 balances with nodes one hop away, steals threads from heaviest nodeNodes in current domain: {7, 1, 2, 3, 5, 6}

Load balancing hierarchal order (Level 3)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Directly-connected nodesNumber of SDs: 8Scheduling Groups: NUMA nodes

Page 35: The Linux Scheduler: a Decade of Wasted Cores

35

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking first core/node 0. Second scheduling group is constructed by picking first node (Node 3) not covered in first group

Load balancing hierarchal order (Level 4)Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 1Scheduling Groups: Directly-connected nodes

Page 36: The Linux Scheduler: a Decade of Wasted Cores

36

Bug 1: Group Imbalance• When a core tries to steal work from

another SG, it compares the average load of the SG instead of looking at each core.

• Load is only transferred if average load of target SG > current SG

• Averages don’t account for spread

Page 37: The Linux Scheduler: a Decade of Wasted Cores

37

NUMA node 1NUMA node 0

Bug 1 Example Scenario CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3

Core 0 Runqueue

Total = 0

Core 1 Runqueue

A: Load = 1000

Total = 1000

Core 2 Runqueue

B: Load = 125

C: Load = 125

D: Load = 125

E: Load = 125

Total = 500

Core 3 Runqueue

F: Load = 125

G: Load = 125

H: Load = 125

I : Load = 125

Total = 500

Thread running… Thread running… Thread running… Thread running…

Average Load = 500 Average Load = 500

BalancedBalanced

Load of individual runqueues are unbalanced -> Averages do not tell the true story

Page 38: The Linux Scheduler: a Decade of Wasted Cores

38

NUMA node 0

Bug 1 Solution: Compare minimum loads

NUMA node 1

38

CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3

Core 0 Runqueue Core 1 Runqueue

A: Load = 1000

Core 2 Runqueue

B: Load = 125

C: Load = 125

D: Load = 125

E: Load = 125

Core 3 Runqueue

F: Load = 125

G: Load = 125

H: Load = 125

I : Load = 125

Thread running… Thread running… Thread running… Thread running…

Minimum load = 0 Minimum load = 500

Page 39: The Linux Scheduler: a Decade of Wasted Cores

39

NUMA node 0

Bug 1 Solution: Compare minimum loads

NUMA node 1

39

CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3

Core 0 Runqueue

D: Load = 125

E: Load = 125

Core 1 Runqueue

A: Load = 1000

Core 2 Runqueue

B: Load = 125

C: Load = 125

Core 3 Runqueue

F: Load = 125

G: Load = 125

H: Load = 125

I : Load = 125

Thread running… Thread running… Thread running… Thread running…

Minimum load = 250 Minimum load = 250

Page 40: The Linux Scheduler: a Decade of Wasted Cores

40

NUMA node 0

Bug 1 Solution: Compare minimum loads

NUMA node 1

40

CPU Core 0 CPU Core 1 CPU Core 2 CPU Core 3

Core 0 Runqueue

D: Load = 125

E: Load = 125

Core 1 Runqueue

A: Load = 1000

Core 2 Runqueue

B: Load = 125

C: Load = 125

I : Load = 125

Core 3 Runqueue

F: Load = 125

G: Load = 125

H: Load = 125

Thread running… Thread running… Thread running… Thread running…

Page 41: The Linux Scheduler: a Decade of Wasted Cores

41

Bug 1: Actual Scenario• 1 lighter load “make” process with 64 threads• 2 heavier load “R” processes of 1 thread each

• Heavier R Threads run on cores in node 0 and 4 skewing up their average load• Other cores in node 0 and 4 are thus underloaded• Other nodes are overloaded

Page 42: The Linux Scheduler: a Decade of Wasted Cores

42

Bug 1 Solution Results

• Speed of “make” process increased by 13%• No impact to R threads

vs

Before After

Page 43: The Linux Scheduler: a Decade of Wasted Cores

43

Bug 2: Scheduling Group Construction• Occurs when core pinning: Run programs on certain subset of cores

• No load balancing when threads on pinned on nodes 2 hops apart

Page 44: The Linux Scheduler: a Decade of Wasted Cores

44

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Actual scenario

An application is pinned on nodes 1 and 2.

Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Page 45: The Linux Scheduler: a Decade of Wasted Cores

45

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Scheduling domain hierarchy1. 2 cores2. 1 node3. 3 nodes4. All nodes

Bug 2: Actual scenario

1. App is started and spawns multiple threads on first core (Core 16) on Node 2

Page 46: The Linux Scheduler: a Decade of Wasted Cores

46

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Actual scenario

2. Load is balanced across first pair

Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodesScheduling Domains: Core 16-17 pair

Scheduling Groups: Cores {16}, {17}

Page 47: The Linux Scheduler: a Decade of Wasted Cores

47

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Actual scenario

2. Load is balanced across entire node

Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodesScheduling Domains: Node 2

Scheduling Groups: Cores {16, 17}, {18, 19}, {20, 21}, {22, 23}

Page 48: The Linux Scheduler: a Decade of Wasted Cores

48

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Actual scenario

3. Load is balanced across nodes one-hop away but cannot be transferred due to core-pinning. Load is not transferred to Node 1 yet.

Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: Nodes directly connected to Node 2Scheduling Groups: Nodes {2}, {0}, {3}, {4}, {5}, {6}

Page 49: The Linux Scheduler: a Decade of Wasted Cores

49

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Actual scenario Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

4. Node 2’s threads cannot be stolen by Node 1 as they are in the same scheduling group with same average loads.Cause: Scheduling Groups at this level constructed in perspective of Core/Node 0

Scheduling Domains: All nodes in machineScheduling Groups: {0, 1, 2, 4, 6}, {1, 2, 3, 4, 5, 7}

Page 50: The Linux Scheduler: a Decade of Wasted Cores

50

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Solution Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Construct SG based on perspective of “leader” node 2 for Level 4

(Scheduling Domain: {2, 0, 3, 4, 5, 6, 7})

Page 51: The Linux Scheduler: a Decade of Wasted Cores

51

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Solution Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Construct the other SG based on perspective of other “leader” Node 1 not in previous SG

(Scheduling Domain: {1, 0, 3, 4, 5, 7})

Page 52: The Linux Scheduler: a Decade of Wasted Cores

52

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

Bug 2: Solution Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

(Scheduling Domain: {1, 0, 3, 4, 5, 7}, {2, 0, 3, 4, 5, 6, 7})

Nodes 1 and 2 are now in different scheduling groups, so Node 1 can now steal load from Node 2

Page 53: The Linux Scheduler: a Decade of Wasted Cores

53

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 0 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 3) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 0 vs Node leader 3

Page 54: The Linux Scheduler: a Decade of Wasted Cores

54

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 1 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 2) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 1 vs Node leader 2

Page 55: The Linux Scheduler: a Decade of Wasted Cores

55

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 2 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 1) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 2 vs Node leader 1

Page 56: The Linux Scheduler: a Decade of Wasted Cores

56

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 3 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 0) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 3 vs Node leader 0

Page 57: The Linux Scheduler: a Decade of Wasted Cores

57

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 4 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 7) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 4 vs Node leader 7

Page 58: The Linux Scheduler: a Decade of Wasted Cores

58

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 5 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 0) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 5 vs Node leader 0

Page 59: The Linux Scheduler: a Decade of Wasted Cores

59

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 6 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 1) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 6 vs Node leader 1

Page 60: The Linux Scheduler: a Decade of Wasted Cores

60

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

balances with First scheduling group constructed by picking Node 7 and directly connected nodes. Second scheduling group is constructed by picking first node (Node 0) not covered in first group

New Level 4 Balancing situation Scheduling domain hierarchy1. 2 cores2. 1 node3. Directly-connected nodes4. All nodes

Scheduling Domains: All nodesNumber of SDs: 8Scheduling Groups: Directly-connected nodes Node leader 7 vs Node leader 0

Page 61: The Linux Scheduler: a Decade of Wasted Cores

61

Bug 2: Solution and Results• Construct Scheduling Groups based on perspective of core

Page 62: The Linux Scheduler: a Decade of Wasted Cores

62

Bug 3: Overload-on-Wakeup• Scenario

1. Thread A is running on Node X2. A thread sleeps in a core in Node X3. Node X gets busy4. The sleeping thread A wakes up5. Scheduler only wakes it up on a core in Node X even if

other nodes are idle

• Rationale: Maximise cache reuse

Other cores

Some thread…

Core in Node X

Thread ASome heavy thread…

Some heavy thread…

Some heavy thread…

Some heavy thread…

Thread A: Zzz…

Thread A

Page 63: The Linux Scheduler: a Decade of Wasted Cores

63

Bug 3: Actual scenario• 64 worker threads of TPC-H + threads from other processes

• Thread stays on overloaded core despite existence of idle cores

Page 64: The Linux Scheduler: a Decade of Wasted Cores

64

Bug 3: Solution and results• Wake thread up on core idle for longest time

Page 65: The Linux Scheduler: a Decade of Wasted Cores

65

Bug 4: Missing scheduling domains• Regression from a refactoring process

Issue: When a core is disabled and then re-enabled using the /proc interface, load balancing between any NUMA nodes is no longer performed.

Bug: The bug is due to an incorrect update of a global variable representing the number of scheduling domains (sched_domains) in the machine.

Cause: When a core is disabled, this variable is set to the number of domains inside a NUMA node. As a consequence, the main scheduling loop (line 1 of Algorithm 1) exits earlier than expected.

Page 66: The Linux Scheduler: a Decade of Wasted Cores

66

Bug 4: Actual scenario

• The vertical blue lines represent the cores considered by Core 0 for each (failed) load balancing call. • There is one load balancing call every 4ms. • We can see that Core 0 only considers its sibling core and cores on the same node for load

balancing, even though cores of Node 1 are overloaded.

Page 67: The Linux Scheduler: a Decade of Wasted Cores

67

Bug 4: Solution and Results• Fix the regression: Regenerate Scheduling domains when a core is re-enabled

Page 68: The Linux Scheduler: a Decade of Wasted Cores

68

Lessons learned and possible solutions• Issues:• Performance bugs hard to detect. Bugs lasted for years!• Importance of visualisation tools to help identify issues• Scheduling designs/assumptions must adapt to hardware changes• Newer scheduling algorithms/optimisations come out from research

• Possible long-term solution:• -> Increased modularity of scheduler instead of monolithic

• Bugs over the years led to the decade of wasted cores!


Recommended