Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

SYNAR Systems Networking and Architecture GroupSYNAR Systems Networking and Architecture Group

Scheduling on Heterogeneous Multicore Processors Using

Architectural Signatures

Daniel Shelepov and Alexandra FedorovaSchool of Computing Science,

Simon Fraser University,Vancouver, Canada

SYNAR Systems Networking and Architecture GroupArchitectural Signatures in a NutshellTask:

to schedule jobs appropriately given a variety of different cores available

Caveats: Scheduler doesn’t know job behaviour a prioriScalability: hundreds of cores potentially available

Our approach: Analyze job performance offlineDescribe findings in a job’s architectural signatureScheduler uses signatures to make intelligent core assignment decisions

SYNAR Systems Networking and Architecture Group

Talk OutlineBackgroundMethodologyResultsSummary and Future Work


Background: Heterogeneous CPUs

Heterogeneous CPUs = several types of cores:

Simple vs. Complex: cache size, issue width, presence of advanced features, power consumptionSpecialized (possibly) Example: many FPUs

Expose a common ISAMay contain 100s or 1000s of cores (“manycore”)Bottom line: better efficiency = saved power

Future: heterogeneous multi- and manycore CPUs

Now: homogeneous multicore CPUs

Complex Simple SpecializedCores:


Background: Heterogeneous Scheduling

Scheduler needs to be aware of:

underlying core featuresjob performance on various cores

Otherwise, no informed scheduling decision can be made => no benefit from heterogeneity

Scheduler

?


Architectural Signature Approach

A signature is provided along with the job binary.

Signaturesare constructed offlineare μarch.-independentprovide guidance for selecting appropriate cores

Scheduler

ü




Constructing Signatures

OFFLINE ANALYSISGenerate performance-predicting metrics that a scheduler is able to use

Examples: optimal cache size, inherent ILP, clock speed sensitivity

PREDICTION MODELCreate a model for generating meaningful performance-predicting metrics from collected profiling data

SCHEDULINGInterpret performance-predicting metrics and schedule

OFFLINE PROFILINGCollect microarchitecture-independent profiling data

Examples: instruction mix, memory access patterns


Case Study: Clock Speed Sensitivity

Frequency changes affect different jobs differently.

Clock speed sensitivity is the means to capture these differences.

0

0.25

0.5

0.75

1

1.25

1.5

1.75

3GHz 2.67GHz 2.33GHz 2GHz

core frequency

norm

aliz

ed c

ompl

etio

n tim

e

swimeon

Completion time at different clock speeds


Offline Profiling

We use MICA, a custom toolkit for Pin by Hoste and Eeckhout [2] (http://trappist.elis.ugent.be/~kehoste/MICA/).MICA gathers a variety of μarch.-independent metrics.For clock speed sensitivity, we want reuse distance data.

http://trappist.elis.ugent.be/~kehoste/MICA/


Offline Analysis

Reuse distances are used to estimate abstract L2 cache miss rates.L2 cache miss rates are used to estimate clock speed elasticity, a metric that puts a number on sensitivity.

requires a prediction model for elasticity as function of cache miss rate (see next slide)

Elasticity values are placed into the architectural signature.


Prediction Model

•The graph shows a mapping of SPEC CPU benchmarks displaying estimated L2 miss rates and clock speed elasticity

•We build a linear model and then use it to predict elasticity during offline analysis

-1.1

-0.9

-0.7

-0.5

-0.3

-0.1

0 5 10 15 20

L2 miss rate, per 1000 inst.

cloc

k fr

eque

ncy

elas

ticity• Constructed once, it can be

used for all future analysis, unless a better model is proposed

Mor

e se

nsiti

veLe

ss s

ensi

tive


SchedulingRecall: the architectural signature contains elasticity valuesElasticity is straightforward to interpretUsing elasticity, the scheduler categorizes jobs into: highly, moderately and insensitiveFinally, we’re ready to schedule


Clock Speed Sensitivity Data Flow

MICA reuse distance data

abstract L2 cache miss rates

clock speed elasticity values

clock speed sensitivity category




Evaluating Clock Speed Sensitive Scheduling

Completion times with our clock speed aware prototype normalized to completion times with the default Linux 2.6.18 scheduler

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

w upw ise mgrid apsi facerec geometricmean

rela

tive

chan

ge in

com

plet

ion

time

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

gcc

gap

eon

fma3

d

mcf

equa

ke

wup

wis

e

luca

s

geom

etric

mea

n

rela

tive

chan

ge in

com

plet

ion

time

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

eon crafty mcf equake geometricmean

rela

tive

chan

ge in

com

plet

ion

time

Highly heterogeneous workload. Two 2GHz

cores, two 3GHz cores

Balanced workload. One of each of 2GHz, 2.33GHz,

2.67GHz, 3GHz cores

Uniform workload. Two 2GHz cores, two 3GHz

cores.




Summary

A framework for developing microarchitecture-independent architectural signatures to assist heterogeneity-aware schedulingProof of concept: clock speed aware schedulingResults: tangible benefits even on mildly heterogeneous platforms

up to 4% average throughput increase on a multicore system with 2GHz and 3GHz cores


Future WorkExtend our framework to include other core characteristics (cache size, issue width,..)Develop and analyze a heterogeneity-aware scheduler in a real operating system (Sun Solaris)Compare that scheduler with other heterogeneity-aware schedulers


References[1] M. Becchi and P. Crowley. Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures. In Proceedings of the Conference on Computing Frontiers, 2006[2] K. Hoste and L. Eeckhout. Microarchitecture-Independent Workload Characterization. IEEE Micro Hot Tutorials, 27(3):63-72, 2007.[3] R. Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, N. Jouppi, and K. Farkas. Single-ISA Heterogeneous Multicore Architectures for Multithreaded Workload Performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004


Appendix A: Existing Approaches

Algorithms by Becchi [1] and Kumar [3]These rely on performance monitoring to determine optimal assignment.Potential drawbacks:

don’t scale well to many types of coreslimited applicability to short-lived threads

Scheduler

ü


Appendix B: Inputs Sets and PerformanceVarying input sets can drastically affect performance

ref vs. test input in SPEC CPU2000

One architectural signature can provide for at most one inputDifficult problem that we are not currently tacklingThere are smart ways to create parameterized approximations that account for data input size:

Y. Zhong, S. G. Dropsho and C. Ding. Miss rate prediction across all program inputs. In Proceedings of Parallel Architechtures and Compilation Techniques, 2003.


Appendix C: ElasticityWe need two measurements of completion time at two different frequenciesThen we calculate clock speed elasticity of completion time as follows (E = Elasticity, T = Completion time, F = clock speed):

The larger the magnitude, the more sensitive is the completion time to clock speedIn this case, -1.0 is considered very elastic (sensitive), because it means that an increase in frequency by a factor of X will decrease the completion time by the same factor.

21

21

12

12, *

TTFF

FFTTE FT


Appendix D: Different Cache SizesL2 miss rates (and elasticity) depend heavily on cache size => it has to be taken into accountSolution: calculate miss rates and elasticity for common cache configurations, the scheduler picks appropriateReasonable approach, because cache size aware scheduling takes precedence before clock speed aware scheduling

Date post:	09-Feb-2016
Category:	Documents
Upload:	noelle
View:	34 times
Download:	0 times

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

Documents