Seminar “Parallel Computing“ Summer term 2008 Seminar paper Parallel Computing (703525) Optimisation: Operating System Scheduling on multi-core architectures Lehrveranstaltungsleiter: T. Fahringer Name Matrikelnummer Thomas Zangerl 1
Transcript
1. Seminar Parallel Computing Summer term 2008 Seminar paper
Parallel Computing (703525) Optimisation: Operating System
Scheduling on multi-core architecturesLehrveranstaltungsleiter: T.
Fahringer Name Matrikelnummer Thomas Zangerl 1
2. Seminar Parallel Computing Summer term 2008AbstractAs
multi-core architectures begin to emerge in every area of
computing, operating systemscheduling that takes the peculiarities
of such architectures into account will becomemandatory. Due to
architectural differences to traditional multi-processors, such as
sharedcaches, memory controllers and smaller cache sizes available
per computational unit, it doesnot suffice to simply schedule tasks
on multi-core processors in the same way as on
SMPsystems.Furthermore, current research motivates architectural
changes in CPU design, such as multi-core processors with
asymmetric core-performance and so called many-core architectures
thatintegrate up to 100 cores in one package. Such architectures
will exhibit a fundamentallydifferent behaviour with regard to
shared resource utilization and performance of non-parallelizable
code compared to current CPUs. It will be the responsibility of the
operatingsystem to spare the programmer as much platform-specific
knowledge as possible andoptimize overall performance by employing
intelligent and configurable schedulingmechanisms. 2
3. Seminar Parallel Computing Summer term 2008
Abstract......................................................................................................................................21.
Introduction............................................................................................................................4
1.1 Why use multi-core processors at
all?..............................................................................4
1.2 Whats so different about multi-core
scheduling?............................................................52.
OS process scheduling state of the
art...................................................................................7
2.1 Linux
scheduler................................................................................................................7
2.1.1 The Completely Fair Scheduler
................................................................................7
2.1.2 Scheduling
domains...................................................................................................7
2.2 Windows
scheduler..........................................................................................................9
2.3 Solaris
scheduler.............................................................................................................103.
Ongoing research on the
topic.............................................................................................11
3.1
Cache-Fairness................................................................................................................11
3.2 Balancing core
assignment..............................................................................................12
3.3 Performance
asymmetry.................................................................................................13
3.4 Scheduling on many-core
architectures..........................................................................154.
Conclusion...........................................................................................................................185.
References............................................................................................................................20
3
4. Seminar Parallel Computing Summer term 20081.
Introduction1.1Why use multi-core processors at all?In the last few
years, multi-core CPUs have become a standard component in nearly
all sortsof computers not only servers and high-end workstations
but also desktop and laptop PCsfor consumers and even game consoles
nowadays usually come with CPUs with more thanone core.This
development is not surprising; already in 1965 Gordon Moore
predicted that the numberof transistors that can be
cost-effectively built onto integrated circuits were going to
doubleevery year ([1]). In 1975, Moore corrected that assumption to
a period of two years; nowadaysthis period is frequently assumed to
be 18 months).Moores projection has more or less been accurate up
to today and consumers have gottenused to the constant speedup of
computer hardware it is expected by buyers that a newcomputer shows
a significant speedup to a two years older model (even though an
increase intransistor density does not always lead to an equal in
increase in computing speed). For chipmanufacturers, it has become
increasingly difficult however to keep up with Moores law. Inorder
to implement the exponential increase of integrated circuits, the
transistor structureshave to become steadily smaller. On the one
hand, the extra transistors were used for theintegration of more
and more specialized instruction sets on CISC chips. On the other
hand,smaller transistor sizes led to higher clock rates of the
CPUs, because due to physical factors,the gates in the transistors
could perform faster state switches.However, since electronic
activity always produces heat as an unwanted by-product, the
moretransistors are packed together in a small CPU die area, the
higher the resulting heatdissipation per unit area becomes ([2]).
With the higher switching frequency, the electronicactivity was
performed in smaller intervals, and hence more and more
heat-dissipationemerged. The cooling of the processor components
became more and more a crucial factor indesign considerations and
it became clear, that the increasing clock frequency could no
longerserve as the primary reason for processor speedup.Hence,
there had to be a shift in paradigm in order to still make
applications run faster; on theone hand the amazing abundance of
transistors on processor chips was used to increase thecache sizes.
This alone, however, would not result in an adequate performance
gain, since itonly helps memory intensive applications to a certain
degree. In order to effectivelycounteract the heat problem while
making use of the small structures and high number oftransistors on
a chip, the notion of multi-core processors for consumer PCs was
introduced.Since the CMOS technology met its limits for the further
increase of CPU clock frequencyand the number of transistors that
could be integrated on a single die allowed for it, the
ideaemerged, that multiple processing cores could be placed in a
single processor die.In 2006, Intel released the Core
microprocessor, a die package with two processor coreswith their
own level 1 caches and a shared level 2 cache ([3]).Also in 2006,
AMD, the second major CPU manufacturer for the consumer market,
releasedthe Athlon X2, a processor with quite similar architecture
to the Core platform, butadditionally featuring the concept of also
sharing a CPU-integrated memory-controller amongthe cores ([[4]).
4
5. Seminar Parallel Computing Summer term 2008Both
architectures have been improved and sold with a range of consumer
desktop and laptopcomputers - but also servers and workstations -
up to today; therefore the presence of multi-core processors in a
large number of todays PCs can be assumed.1.2Whats so different
about multi-core scheduling?One could assume that the scheduling
process on such multi-core processors wouldnt differmuch from
conventional scheduling intuitively the run-queue would just have
to be replacedby n run-queues, where n is the number of cores and
processes would simply be scheduled tothe currently shortest
run-queue (with some additional process-priority treatment, maybe).
While that might seem reasonable, there are some properties of
current multi-corearchitectures that speak strongly against such a
nave approach. First, in many multi corearchitectures, each core
manages its own level 1 cache (Figure 1). By just
navelyrescheduling interrupted processes to a shorter queue which
belongs to another core (taskmigration), parts of the processes
cache working set may become unnecessarily lost and theoverall
performance may slow down. This effect becomes even worse if the
underlyingarchitecture is not a multi-core but a NUMA system where
memory access can become verycostly if the process is scheduled on
the wrong node. Figure 1: Typical multi-core architectureA second
important point is that the performance of different cores in a
multi-core systemmay be asymmetric regarding the performance of the
different cores ([5]). This effect canemerge due to several
reasons: 5
6. Seminar Parallel Computing Summer term 2008 Design
considerations. Many slow cores can be used for increasing the
throughput in parallel computation while a few faster cores
contribute to the efficient processing of costly tasks which can
not be parallelized. ([6]). Even algorithms that are parallelizable
contain parts that have to be executed sequentially, which will
benefit from the higher speed of the fastest core. Hence
performance-asymmetry has been shown to be a very efficient
approach in multi-core architectures ([7]). Transistor failures.
Some parts of the CPU may get damaged over time and become
automatically disabled by the CPU. Since such components may fail
in certain cores independently from the other cores, performance
asymmetry may arise in symmetric cores over time ([5]).
Power-saving policies. Different cores may switch to different P-
or C-power-states at different times in order to save power. At
different P-states, equal cores show a different clock-frequency.
If an OS scheduler manages to take this into account for processes
not in need of all system resources, the system can remain more
energy- efficient over the execution time while giving away only
little or no performance at all. ([8]).Hence, performance
asymmetry, the fact that various CPU components can be shared
amongcores, and non-uniform access to computation resources such as
memory, mandate the designof efficient multi-core scheduling
mechanisms or scheduling frameworks at the operatingsystem
level.Multi-core processors have gone mainstream and while there
may be the demand that they areefficiently used in terms of
performance, the currently fashionable term Green-IT alsomotivates
the energy-efficient use of the CPU cores.Section 2 will explore
how far current operating systems have evolved in support of the
newarchitectures. 6
7. Seminar Parallel Computing Summer term 20082. OS process
scheduling state of the art2.1 Linux scheduler2.1.1The Completely
Fair SchedulerThe Linux scheduler in versions prior to 2.6.23
performed its tasks in complexity O(1) bybasically just using
per-CPU run-queues and priority arrays ([9]). Kernel version
2.6.23,which was released on October 9 2007, introduced the
so-called completely fair scheduler(CFS). The change in scheduler
was mainly motivated by the failure of the old scheduler
tocorrectly predict whether applications are interactive
(I/O-bound) or CPU-intensive ([10]).Therefore the new scheduler has
completely abandoned the notion of different kinds ofprocesses and
treats them all equally. The data-structure of a red-black tree is
used to align thetasks according to their right to use the
processor resources for a predefined interval untilcontext-switch.
The process positioned at the leftmost node in the data structure
is entitledmost to use the processor resources at the time it
occupies that position in the tree. Theposition of a process in the
tree is only dependent on the wait-time of the process in the
run-queue (including the time the process is actually waiting for
events) and the process priority([11]). This concept is fairly
simple, but works with all kinds of processes,
especiallyinteractive ones, since they get a boost just by getting
account for their I/O-waiting time.However, the total scheduling
complexity is increased to O(log n) where n is the number
ofprocesses in the run-queue, since at every context-switch, the
process has to be reinserted intothe red-black tree.The scheduling
algorithm itself has not been designed in special consideration of
multi-corearchitectures. When Ingo Molnar, the designer of the
scheduler, was asked, what theimplications on HT/SMP/NUMA
architectures would be, he answered, that there wouldinevitably be
some effect, and if it is negative, he will fix it. He admits that
the fairness-approach can result in increased cache-coldness for
processes in some situations. ([12]).However, the red-black trees
of CFS are managed per runqueue ([13]), which assists incooperation
with the Linux load-balancer.2.1.2Scheduling domainsLinux
load-balancing takes care of different cache models and computing
architectures but atthe moment not necessarily of performance
asymmetry. The underlying model of the Linuxload balancer is the
concept of scheduling domains, which was introduced in Kernel
version2.6.7 due to the unsatisfying performance of Linux
scheduling on SMP and NUMA systemsin prior versions
([14]).Basically, scheduling domains are hierarchical sets of
computation units on which schedulingis possible; the scheduling
domain architecture is constructed based on the actual
hardwareresources of a computing element ([9]). Scheduling domains
contain lists with schedulinggroups that share common properties.
7
8. Seminar Parallel Computing Summer term 2008For example, the
way scheduling should be done on two logical processors of a
HT-systemsand two physical processors of a SMP system is different;
the HT cores share a commoncache and memory hierarchy, therefore
task migration is not a problem, if one of the logicalcores becomes
idle. However, in a SMP or multi-core system, in which the cache or
parts ofthe cache is administrated by each core separately,
migrating tasks with a large working setmay become problematic.This
applies even more to NUMA machine, where different CPU may be
closer or moreremote to the memory the process is using. Therefore
all this architectures have to be treateddifferently.The scheduling
domain concept introduces scheduling domains, a logical union of
computingresources that share common properties, with whom it is
reasonable to treat them equally andCPU groups within these
domains. Those groups contain hardware-addressable
computingresources that are part of the domain on which the
balancer can try to even the domain loadout.Scheduling domains are
hierarchically nested there is a top-level domain containing
allother domains of the physical system the Linux is running on.
Depending on the actualarchitecture, the sub-domains represent NUMA
node groups, physical CPU groups, multi-core groups or SMT groups
in a respective hierarchical nesting. This structure is
builtautomatically based on the actual topology of the system and
for reasons of efficiency eachCPU keeps a copy of every domain it
belongs to. For example, a logical SMT processor thatat the same
time is a core in a physical multi-core processor on a NUMA node
with multiple(SMP) processors would totally administer 4
sched_domain structures, one for each level ofparallel computing it
is involved in ([15]). Figure 2: Example hierarchy in the Linux
scheduling domains Load-balancing takes place at scheduling domain
level, between the different groups. Eachdomain level is sensitive
with respect to the constraints set by its properties regarding
loadbalancing. For example, load balancing happens very often
between logical simultaneous 8
9. Seminar Parallel Computing Summer term 2008multithreading
cores, but very rarely on the NUMA level, where remote memory
access iscostly.The scheduling domain for multi-core processors was
added with Kernel version 2.6.17 ([16])and especially considers the
shared last level cache that multi-core architectures
frequentlypossess. Hence, on a SMP machine with two multi-core
packages, two tasks will be scheduledon different packages, if both
packages are currently idle, in order to make use of the
overalllarger cache.In recent Linux-kernels, the multi-core
processor scheduling domain also offers support forenergy-efficient
scheduling, which can be used if e.g. the powersave governor is set
in thecpufreq tool. Saving energy can be achieved by changing the
P- and the C-states of the coresin the different packages. However,
P-states are transitions are made by adjusting the voltageand the
frequency of a core and since there is only one voltage regulator
per socket on themainboard, the P-state is dependent on the busiest
core. So, as long as any core in a package isbusy, the P-state will
be relatively low, which corresponds to a high frequency and
voltage.While the P-states remain relatively fixed, the C-states
can be manipulated. Adjusting the C-states means turning off parts
of the registers, blocking interrupts to the processor, etc.
([17])and can be done on each core independently. However, the
shared cache features its own C-state regulator and will always
stay in the lowest C-state that any of the cores has.Therefore,
energy-efficiency is often limited to adjusting the C-state of a
non-busy core whileleaving other C-states and the packages P-state
low.Linux scheduling within the multi-core domain with the
powersave-governor turned on willattempt to schedule multiple tasks
on one physical package as long as it is feasible. This way,other
multi-core packages will be allowed to transition into higher P-
and C-states. The authorof [9] claims, that generally the
performance impact will be relatively low and that theperformance
loss/power saving trade-off will be rewarding, if the
energy-efficient schedulingapproach is used.2.2 Windows schedulerIn
Windows, scheduling is conducted on threads. The scheduler is
priority-based withpriorities ranging from 0 to 31. Timeslices are
allocated to threads in a round-robin fashion;these timeslices are
assigned to highest priority threads first and only if know thread
of agiven priority is ready to run at a certain time, lower
priority threads may receive thetimeslice. However, if
higher-priority threads become ready to run, the lower priority
threadsare preempted.In addition to the base priority of a thread,
Windows dynamically changes the priorities oflow-prioritized
threads in order to ensure felt responsiveness of the operating
system. Forexample, the thread associated with the foreground
window on the Windows desktop receivesa priority boost. After such
a boost, the thread-priority gradually decays back to the
basepriority ([21]).Scheduling on SMP-systems is basically the
same, except that Windows keeps the notion of athreads processor
affinity and an ideal processor for a thread. The ideal processor
is theprocessor with for example the highest cache-locality for a
certain thread. However, if theideal processor is not idle at the
time of lookup, the thread may just run on another processor.In
[21] and other sources, no explicit information is given on
scheduling mechanismsespecially specific to multi-core
architectures, though. 9
10. Seminar Parallel Computing Summer term 20082.3Solaris
schedulerIn the Solaris scheduler terminology, processes are called
kernel- or user-mode threadsdependant on the space in which they
run. User threads dont only exist in the user space whenever a user
thread is created, a so-called lightweight process is set up that
connects theuser thread to a kernel thread. These kernel threads
are object to scheduling.Solaris 10 offers a number of scheduling
classes, which may co-exist on one system. Thescheduling classes
provide an adaptive model for the specific types of applications
which areintended to run on top of the operating system. ([18])
mentions Solaris 10 scheduling classesfor: Standard (timeshare)
threads, whose priority may be adapted by the scheduler.
Interactive applications (the currently active window in the
graphical user interface). Fair sharing of system resources
(instead of priority-based sharing) Fixed-priority threads. The
priority of threads scheduled with this scheduler does not vary
over the scheduling time. System (kernel) threads. Real-time
threads, with a fixed priority and time share. Threads in this
scheduling class may preempt system threads.The scheduling classes
for timeshare, fixed-priority and fair sharing are not recommended
forsimultaneous use on a system, while other combinations of
scheduling classes on a set ofCPUs are possible.The timeshare and
interactive schedulers are quite similar to the old Linux scheduler
(beforeCFS) in their attempt of trying to identify I/O bound
processes and providing them with apriority boost. Threads have a
fixed time quantum they may use once they get the context
andreceive a new priority based on whether they fully consume their
time quantum and on theirwaiting time for the context. Fair share
scheduling uses a fixed time quantum (share) allocatedto processes
([19]) as a base for scheduling. Different processes (actually
collection ofprocesses, or, in Solaris 10 terminology, projects)
compete for quanta on a computingresource and their position in
that competition depends on how large the value they have
beenassigned is in relation to the total quanta number on the
computing resource.Solaris explicitly deals with the scenario, that
parts of the processors resources may beshared, as it is likely
with typical multi-core processors. There is a kernel abstraction
calledprocessor group (pg_t), that is built according to the actual
system topology and representslogical CPUs that share some physical
properties (like caches or a common socket). Thesegroupings can be
investigated by the dispatcher e.g. in order to maintain logical
CPU affinityfor the purpose of cache-hotness where it is
reasonable. Quite similar to the concept ofLinuxs scheduling
domains, Solaris 10 tries to simultaneously achieve load balancing
onmultiple levels (for example if there are physical CPUs with
multiple cores and SMT) ([20]). 10
11. Seminar Parallel Computing Summer term 20083. Ongoing
research on the topicResearch on multi-core scheduling deals with a
number of different topics, many of which areorthogonal (e.g.
maximizing fairness and throughput). The purpose of this section is
topresent an interesting selection of different approaches to
multi-core scheduling. Sections 3.1and 3.2 summarize proposals to
improve fairness and load-balancing on current
multi-corearchitectures while sections 3.3 and 3.4 concentrate on
approaches for scheduling onpromising new computing architectures,
such as multi-core processors with asymmetricperformance and
many-core CPUs.3.1Cache-FairnessSeveral studies (e.g. [22], [23])
suggest that operating system schedulers insufficiently dealwith
threads that allocate large parts of the shared level 2 cache and
thus slow-up threadsrunning on the other core that uses the same
cache.The situation is unsatisfactory due to several reasons:
First, it can lead to unpredictableexecution times and throughput
and second, scheduling priorities may loose theireffectiveness
because of threads running on cores with aggressive co-runners
(i.e. threadsrunning on another core in the same package).Figure 3
shows such a scenario: Thread B uses the larger part of the shared
cache and thusmaybe negatively influences the cycles per
instruction that thread A achieves during its CPUtime
share.L2-cache-misses are more costly than L1-cache-misses, because
the latency to the memory isbigger than to the next cache-level.
However, it is mostly the L2-cache that is shared amongdifferent
cores.The authors of [22] try to mitigate the above mentioned
effects by introducing a cache-fairness aware scheduler for the
Solaris 10 operating system. Figure 3: Unfair cache utilization by
thread BIn their scheduling algorithm, the threads on a system are
grouped into a best effort class anda cache-fair class. Best effort
threads are penalized for the sake of performance stability
ofcache-fair threads, if necessary, but not vide-versa. However, it
is taken care, that this doesnot result in inadequate
discrimination of best effort threads.Fairness is enforced by
allocating longer time shares to cache-fair threads that suffer
fromcache-intensive co-runners at the expense of these co-runners,
if they are best effort threads.Figure 4 illustrates that process.
11
12. Seminar Parallel Computing Summer term 2008 Figure 4:
Restoring fairness by adjusting timesharesIn order to compute the
quantum that the thread is entitled to, the algorithm uses an
analyticalmodel to estimate a few reference values that would hold,
if the thread had run under faircircumstances, namely the fair L2
cache miss rate, the fair CPI rate and the fair number
ofinstructions (during a quantum). All those values are based on
the assumption of faircircumstances, so the difference to the
actual values is computed and a quantum extension iscalculated
which should serve as compensation.Those calculations are done once
for new cache-fair class threads their L2 cache miss ratesand other
data is measured with different co-runners. Subsequently, the
dispatcherperiodically selects best-effort threads from whom he
takes parts of their quanta and assignsthem as compensation to the
cache-fair threads. New cache-fair threads are not
explicitlyscheduled with different co-runners in the analysis
phase, but whenever new combinations ofcache-fair threads with
co-runners occur, analytical input is mined from the CPUs
hardwarecounters.The authors of [22] state that according to their
measurements, the penalties on best-effortthreads are low, while
the algorithm actually enforces priority better than standard
schedulingand improves fairness. These claims are supported by
experimental data gathered using theSPEC CPU2000 suite on an
UltraSPARC T1 processor. The experiments measure the time ittakes a
thread to complete a specific benchmark while a cache-intensive
second thread isexecuted in the same package. The execution times
under these circumstances are comparedwith the times of threads
running with co-runners with low-cache requirements. Thiscomparison
shows differences of up to 37% in execution time between the two
scenarios on asystem with a standard scheduler, while using the
cache-fair scheduler, the variabilitydecreases to 7%.At the same
time, however, measurements of the execution times of threads in
the best-effortscheduling class reveal a slowdown of up to 8% for
some threads (while some evenexperience a speed-up).3.2Balancing
core assignmentFedorova et al. ([27]) argue, that during operating
system scheduling, several aspects have tobe considered besides
optimal performance. It is shown that scheduling tasks on the cores
inan imbalanced way results in jittery performance (i.e.
unpredictability of a tasks completiontime) and, as a consequence,
insufficient priority enforcement.It could be part of the operating
system scheduler to ensure that the jobs are evenly assignedto
cores. In the approach described in [27] this is done by using a
self-tuning schedulingalgorithm based on a per-core benefit
function. This benefit function is based on three inputcomponents:
12
13. Seminar Parallel Computing Summer term 2008 1) The
normalized core preference of a thread, which is based on the
instructions per cycle that a thread j can achieve on a certain
core i ( IPC j ,i ), normalized by max( IPC j ,k ) (where k is an
arbitrary CPU/core) 2) The cache-affinity, a value which is 1 if
the thread j was scheduled on core i within a tuneable time period
and 0 otherwise 3) The average cache investment of a thread on a
core which is determined by inspecting the hardware cache miss
counter from time to timeThis benefit function can then be used to
determine whether it would be feasible to migratethreads from core
i to core j. For each core, a benefit function Bi , 0 is computed
that representsthe case of no thread migrations taking place. For
each thread k on a core, and updated benefitvalue for the
hypothetical scenario of the migration of the thread onto another
core Bi ,k , iscomputed. Of course, the benefit will increase,
since fewer threads are executed on the core.But the thread that
would have been taken away in the hypothetical scenario would have
to bemigrated to another core, which would influence the benefit
value of the target core.Therefore, also the updated benefit value
of any other system core j to which the thread inquestion would be
migrated to, has to be computed and is called B j ,k + .The
hypothetical migration of thread k from core i to core j becomes
reality if Bi ,k + B j ,k + > Bi , 0 + B j ,0 + a * DCAB + b *
DRTF . DCAB represents a system-wide balance-constraint, while DRTF
ensures a per-job response-time-fairness (i.e. the slowdown that
resultsfor the thread in question from the thread-migration does
not exceed some maximum value).These two constants, together with
the criterions included in the benefit function itself (mostnotably
cache-affinity) should help to ensure that the self-tuning fulfils
the three goals ofoptimal performance, core-assignment balance and
response-time-fairness.However, the authors have not yet actually
implemented the scheduling modification in Linuxand hence the
statements on its effectiveness remain somehow
theoretical.3.3Performance asymmetryIt has been advocated that
building multi-core chips with asymmetric performance of
thedifferent cores can have advantages for the overall processing
speed of a CPU. For example, itcan prove beneficial if one fast
core can be used to speed up parts that can hardly beparallelized
while multiple slower cores come to play when parallel code parts
are executed.By keeping the cores for parallel execution slower
than the core(s) for serial execution, die-area and cost can be
saved ([24]) while power consumption may be reduced.[25] closely
analyzes the impact of performance asymmetry on the average speedup
of anapplication with increase of cores. The paper concludes, that
[a]symmetric multicore chipscan offer maximum speedups that are
much greater than symmetric multicore chips (andnever worse).
Hence, performance asymmetry at the processor core level seems to
be apromising approach for future multi-core architectures.[26]
suggests that asymmetric multiprocessing CPUs do exceptionally well
for moderatelyparallelized applications, but dont scale much worse
with highly parallel programs. (seeFigure 5) 13
14. Seminar Parallel Computing Summer term 2008 Figure 5:
Comparison of speedup with SMP and AMP using highly parallel
programs (left), moderately parallel programs (middle) and highly
sequential programs (right)1Apart from explicit design, performance
asymmetry can also occur in initially symmetricmulti-core
processors by either power-saving mechanisms (increasing the
C-state) or failureof transistors that leads to disabling of parts
of the cores components.Problems arise by the fact that the
programmer usually assumes symmetric performance ofthe processor
cores and designs her programs accordingly. Therefore, the
operating systemscheduler should support processor performance
asymmetry, which is currently not the casefor most schedulers.
However, it would be imaginable to see this as a Linux
schedulingdomain in the future.[5] describes AMPS, an approach on
how a performance-asymmetry aware scheduler forLinux could look
like. Basically the scheduler consists of three components: An
asymmetry-specific load balancing, a faster-core first scheduler
and a migration mechanism specificallyfor NUMA machines that will
not be covered in detail here. The scheduling mechanism triesto
achieve a better performance, fairness (with respect to the thread
priorities) andrepeatability of performance results.In order to
conduct load-balancing, the core performance is assessed in a first
step. AMPSmeasures core performance at boot time, using benchmarks,
and sets the performancequantifier for the slowest core to 1 and
for faster cores, to a number higher than 1. Then thescaled load of
a core is the run-queue length of a core divided by its performance
quantifier.If this scaled load is within a maximum and a minimum
threshold, then the system isconsidered to be load-balanced,
otherwise threads will be migrated. By the scaling factor,faster
cores will receive more workload than slower ones.Besides the
load-balancing, cores with lower scaled load are preferred in
thread placement.Whenever new threads are scheduled, the new scaled
load that would apply if the thread wasscheduled to one specific
core is computed. Then the thread is scheduled to the core with
theleast new scaled load; in case of a tie, faster cores are
preferred. Additionally, the load-balancer may migrate threads even
if the source core of migration can become idle by doingit. AMPS
only migrates threads, if the new scaled load of the destination
core does not exceed1 Image source:
http://www.intel.com/technology/magazine/research/power-efficiency-0206.htm
14
15. Seminar Parallel Computing Summer term 2008the scaled load
on the source-core. This way, the benefits of available fast cores
can be fullyexploited while not overburdening them.It can be
expected, that frequent core migration results in performance loss
by cache misses(e.g. in the level cache). However, experimental
results in [5] reveal no excessiveperformance loss by the fact that
task migration among cores occurs more often than instandard
schedulers. Figure 6: Speedup of the AMPS scheduler compared to
Linux scheduler on AMP with two fast and six slow cores2Instead,
performance on SMP systems measurably improves (see Figure 6,
standard schedulerspeed would be 1, median speedup is 1.16); while
fairness and repeatability is preservedbetter than with standard
Linux (there is less deviation in the execution times of
differentprocesses).3.4Scheduling on many-core architecturesCurrent
research points at the potential of future CPU architectures that
consist of a multitudeof different cores (tens to hundreds) in
order to prolong the time the industry can keep up withMoores
prediction. Such CPU architectures are going to require novel
approaches in threadscheduling, since the available shared memory
per core is potentially smaller, while mainmemory access time
increases and single-threaded performance decreases. Even more so
thanwith current CMP chips, schedulers that treat such large scale
CMP architectures just likeSMP systems, are going to fail with
respect to performance.[28] identifies three major challenges in
scheduling on many-core architectures, namely thecomparatively
small cache sizes which render memory access costly, the fact that
non-specialized programmers will have to program code for them and
the wide range ofapplication scenarios that have to be considered
for such chips. The latter two challengesresult from the projected
general-purpose use of future many-core architectures.In order to
deal with these challenges, an OS-independent experimental
runtime-environmentcalled McRT is presented in [28]. The runtime
environment was built from scratch and is2 Picture taken from [5]
15
16. Seminar Parallel Computing Summer term 2008independent of
operating system kernels hence the performed operations occur at
user-level.The connection to the underlying operating system is
established using so called hostadapters while programming
environments like pthreads or OpenMP can invoke thescheduling
environment via client adaptors. For programmability it provides
high leveltransactions (instead of locks) and the heterogeneity is
alleviated by giving the user the choiceamong different runtime
policies (which influence the scheduling behaviour).The overall
target of the scheduler is to maximize resource utilization and
provide the userwith flexible scheduler configurability. Basically
the scheduler is configured using threeparameters P, Q and T which
respectively denote the number of (logical) processors, taskqueues
and threading abstractions. P and Q change the scheduling behaviour
from strictaffinity to work stealing. T can be used to specify
different behaviour for different threads.This way, the concept of
scheduler domains can be realized.It is notable, that the
scheduling system does not use pre-emption; instead threads
areexpected to yield at certain times. This design choice has been
motivated by the authorsbelief that pre-emption stems from the need
of time-sharing expensive and fast resourceswhich will become
obsolete with many-core architectures. The runtime actively
supportsconstructs such as barriers, which are often needed in
parallel programming. Barriers aredesigned to avoid busy waiting
for example, a thread yields once it has reached the barrierbut
wont be re-scheduled until all other threads have reached the
barrier.With a pre-emptive scheduling mechanism, the thread would
receive the context from time totime just to check whether other
threads have reached the barrier with the integrated barriersupport
based on a co-operative scheduling approach used in McRT, this wont
happen.The client-side adaptor, e.g. for OpenMP, promises to
directly translate many OpenMPconstructs to the McRT API.[28] also
contains a large section on experimental results from benchmarks of
typical desktopand scientific applications, such as the XviD MPEG4
encoder, singular value decomposition(SVD) and neural networks
(SOM). The results were gathered on a cycle-accurate
many-coresimulator with 32 Kbyte L1 cache shared among 4
simultaneous multithreaded cores whichform one physical core. 32
such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chipL3
cache. The simulator provides a cost-free MWait instruction that
allows a thread to tell theprocessor that it is blocked and the
resource it is waiting for. Only if the resource becomesavailable
the CPU will execute the thread. Hence, threads that are waiting
barriers and locksdont consume system resources. It is important to
consider that such a mechanism does notexist on current physical
processors, when viewing the experimental speedup results forMcRT
16
17. Seminar Parallel Computing Summer term 2008 Figure 7:
Speedup of XviD encoding in the McRT framework (compared to single
core performance)3The experiments reveal that XviD encoding scales
very well on McRT (Figure 7, 1080p and768p denote different video
resolutions; the curve labelled linear models the ideal
speedup).However, the encoding process was explicitly tweaked for
the many-core scenario under thecondition, that only very little
fast memory exists per logical core, parallelization wasntconducted
at frame level instead single frames were partitioned into parts of
frames whichwere encoded in parallel using OpenMP.The scalability
of SVD and SOM is quite similar to that of XviD; more figures
andevaluations of different scheduling strategies can be found in
[28].3 Picture is from [28] 17
18. Seminar Parallel Computing Summer term 20084. ConclusionThe
probability is high, that processor architectures will undergo
extensive changes in orderto keep up with Moores law in the future.
AMPs and many-core CPUs are just two proposalsfor innovative new
architectures that may help in prolonging the time horizon within
whichMoores law can stay valid. Operating system schedulers are
going to have to adapt to thechanging underlying architectures.The
scheduler domains of Linux and Solaris add some urgently needed
OS-flexibility because computing architectures that exhibit
different behaviours with regard to memoryaccess or single-threaded
performance can be quite easily integrated into the
load-balancinghierarchy; however, it must be argued that in the
future probably more will have to be done inthe scheduling
algorithms themselves. But scheduler domains at least provide the
requiredflexibility at the level of threads.Sadly, it has to be
said, that current Windows schedulers wont scale with the number of
coresor performance asymmetry at any rate. Basically, the Windows
scheduler treats multi-corearchitecture like SMP systems and hence
can not give proper care to the peculiarities of suchCPUs, like the
shared L2 cache (and possibly, the varying or simply bad
single-threadedperformance that is going to be a characteristic of
emerging future architectures). Ty Carlson,director of technical
strategy at Microsoft, even mentioned at a panel discussion that
currentWindows releases (including Vista) were designed to run on
1,2, maybe 4 processors4 butwouldnt scale beyond. He seems to be
perfectly right, when he says that future versions ofWindows would
have to be fundamentally redesigned.Current research shows the road
such a redesign could follow. The approach described insection 3.3
seems to perform pretty well for multi-core processors with
asymmetricperformance. The advantage is that the load balancing
algorithm can be (and was)implemented as a modification to current
operating system kernels and hence, can be madeavailable quickly,
once asymmetric architectures gain widespread adaption.
Experimentalevaluation of the scheduling algorithm reveals
promising results, also with regard to fairness.Section 3.4
presents the way, in which futuristic schedulers on upcoming
many-corearchitectures could operate. The runtime environment McRT
makes use of interestingtechniques and the authors of the paper
manage to intelligibly explain why pre-emptivescheduling is going
to be obsolete on many-core architectures. However, their
implementationis realized in user-space and burdens the
programmer/user with a multitude of configurationoptions and
programming decisions that are required in order for the framework
to guaranteeoptimal performance.[29] introduces an easier-to-use
thread scheduling mechanism based on the efforts of
McRT,experimental assessments which could testify on its
performance, although planned, haventbeen conducted yet. It will be
interesting to keep an eye on the further development ofscheduling
approaches for many-core architectures, since they might gain
fundamentally inimportance in the future.Achieving fairness and
repeatability on todays available multi-core architectures are
themajor design goals of the scheduling techniques detailed in
sections 3.1 and 3.2. The firstapproach is justified by a number of
experimental results that show that priorities are actuallyenforced
much better than with conventional schedulers; however it remains
to be seen,4 See: http://www.news.com/8301-10784_3-9722524-7.html
18
19. Seminar Parallel Computing Summer term 2008whether the
complexity of the scheduling approach and the amount of overhead
potentiallyintroduced by it, justify that improvement. Maybe it
would be advantageous to considerimplementing such mechanisms
already at the hardware level, if possible.The algorithm mentioned
in section 3.2 hasnt been implemented yet, so it is an open
questionwhether such a rather complicated load-balancing algorithm
would be feasible in practice.From the description one can figure,
that it takes a lot of computations and thread migrationsin order
to ensure the load-balance and it would be interesting to see the
overhead from thecomputations and the cache-misses imposed by the
mechanism on the system. Without anyexperimental data, those
figures are hard to assess. 19
20. Seminar Parallel Computing Summer term 20085. References[1]
G. E. Moore: Cramming more components onto integrated circuits,
Electronics, Volume 38, Number 8, 1965.[2] Why Parallel Processing:
http://www.tc.cornell.edu/Services/Education/Topics/Parallel/Concepts/
2.+Why+Parallel+Processing.htm[3] O. Wechsler: Inside Intel Core
Microarchitecture, Intel Technology Whitepaper,
http://download.intel.com/technology/architecture/new_architecture_06.pdf[4]
Key Architectural Features AMD Athlon X2 Dual-Core Processors,
http://www.amd.com/us-
en/Processors/ProductInformation/0,,30_118_9485_13041%5E13043,00.html[5]
Li et al.: Efficient Operating System Scheduling for
Performance-Asymmetric Multi- Core Architectures, In: International
conference on high performance computing, networking, storage, and
analysis, 2007.[6] Balakrishnan et al.: The Impact of Performance
Asymmetry in Emerging Multicore Architectures, In Proceedings of
the 32nd Annual International Symposium on Com- puter Architecture,
pages 506517, June 2005.[7] M. Annavaram, E. Grochowski, and J.
Shen: Mitigating Amdahls law through EPI throttling. In Proceedings
of the 32nd Annual International Symposium on Computer
Architecture, pages 298309, June 2005.[8] V. Pallipadi, S.B.
Siddha: Processor Power Management features and Process Scheduler:
Do we need to tie them together? In: LinuxConf Europe 2007[9] S.B.
Siddha: Multi-core and Linux Kernel,
http://oss.intel.com/pdf/mclinux.pdf[10]
http://kernelnewbies.org/Linux_2_6_23[11]
http://lwn.net/Articles/230574/[12] J. Andrews: Linux: The
Completely Fair Scheduler, http://kerneltrap.org/node/8059[13] A.
Kumar: Multiprocessing with the Completely Fair Scheduler,
http://www.ibm.com/developerworks/linux/library/l-cfs/index.html[14]
Kernel 2.6.7 Changelog:
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.7[15]
Scheduling domains: http://lwn.net/Articles/80911[16] Kernel 2.6.17
Changelog:
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.17
20
21. Seminar Parallel Computing Summer term 2008[17] T. Kidd:
C-states, C-states and even more C-states,
http://softwareblogs.intel.com/2008/03/27/update-c-states-c-states-and-even-more-c-states/[18]
Solaris 10 Process Scheduling:
http://www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html[19]
Solaris manpage FSS(7):
http://docs.sun.com/app/docs/doc/816-5177/fss-7?l=de&a=view&q=FSS[20]
Eric Saxe: CMT and Solaris Performance,
http://blogs.sun.com/esaxe/entry/cmt_performance_enhancements_in_solaris[21]
MSDN section on Windows scheduling:
http://msdn.microsoft.com/en-us/library/ms685096%28VS.85%29.aspx[22]
A. Fedorova, M. Seltzer and M. D. Smith: Cache-Fair Thread
Scheduling for Multicore Processors, Technical Report TR-17-06,
Harvard University, Oct. 2006[23] S. Kim, D. Chandra and Y.
Solihin: Fair Cache Sharing and Partitioning in a Chip
Multiprocessor Architecture, In Proceedings of the International
Conference on Parallel Architectures and Compilation Techniques,
2004[24] D. Menasce and V. Almeida: Cost-Performance Analysis of
Heterogeneity in Supercomputer Architectures, In: Proceedings of
the 4th International Conference on Supercomputing, June 1990.[25]
M. D. Hill and M. R. Marty: Amdahls Law in the Multicore Era, In:
IEEE Computer, 2008[26] B. Crepps: Improving Multi-Core
Architecture Power Efficiency through EPI Throttling and Asymmetric
Multiprocessing, Intel Technology Magazine,
http://www.intel.com/technology/magazine/research/power-efficiency-0206.htm[27]
A. Fedorova, D. Vengerov and D. Doucette: Operating System
Scheduling on Heterogeneous Core Systems, to appear in Proceedings
of the First Workshop on Operating System Support for Heterogeneous
Multicore Architectures, 2007.[28] B. Saha et al.: Enabling
Scalability and Performance in a Large Scale CMP Environment,
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on
Computer Systems, 2007.[29] M. Rajagopalan, B. T. Lewis and T. A.
Anderson: Thread Scheduling for Multi-Core Platforms, in:
Proceedings of the Eleventh Workshop on Hot Topics in Operating
Systems, 2007. 21