Power Optimized Many-Cores with User Centric … Optimized Many-Cores with User Centric Notion of...

Power Optimized Many-Cores with User Centric Notion ofParallelism

Simon Holmbacka* Sébastien Lafond† Johan Lilius†

*Turku Centre for Computer Science - TUCS†Department of Information Technologies, Åbo Akademi University

Joukahaisenkatu 3-5 20520, Turku, Finland

[email protected]

ABSTRACTDVFS (voltage and frequency scaling) and DPM (sleep states)are two commonly used methods to minimize the power dissi-pation on modern microprocessors. While they work well insidetheir own domains, no global coordination is usually done tooptimize both methods’ utilization in many-core systems. Thispaper proposes a unified power optimizer for DVFS and DPMin many-core systems. The optimizer is a model based plug-insystem which is driven by a) power model of both DVFS andDPM b) user centric notion of parallelism to express applica-tion performance and scalability. We show that a unification ofDVFS and DPM is possible and how the techniques should becoordinated with different levels of parallelism in the applica-tions. We evaluate the optimizer in both hot and cold ambienttemperature to recognize and show how different power savingtechniques should be used under different external influences.

1. INTRODUCTIONDVFS is a power saving technique which relies on dynamicallyadjusting voltage and frequency levels in a microprocessor ac-cording to the workload to minimize dynamic power dissipa-tion. This technique has already been in use for decades andwas an important part of the power management in the latestsingle core processors. Due to the advancements in multi-coresystems with decreasing manufacturing technologies, the staticpower dissipation has become equally, or even more important [7]than the dynamic power dissipation due to increased leakage cur-rents. Therefore, the less mature and more diverse, sleep state-based power management strategy (DPM) is currently being in-troduced into multi/many-core devices to minimize the staticpower. DPM is used to shut down cores by placing them into asleep state. As the core is sleeping, the power can be physicallycut and the static power is significantly reduced.

Naturally no workload can be placed on a sleeping core, whichinfluences the mapping problem especially for parallel applica-tions. The question becomes to determine whether a parallelapplication should be scheduled on few cores with high clock fre-quency or on many cores with low frequency, and what combina-tion gives the most power/energy efficient solution. Embeddedin this decision is also the extent of inherited parallelism withinthe application itself i.e. how far the application scales. This in-formation influences largely the number of usable active cores forparallel applications and is an important optimization parame-ter. Based on the work in [4], we acknowledge that full fairnessin scheduling or full load consolidation is seldom the general so-lution to regulate the power optimally. Hence, a more intelligentand dynamic mechanism should be implemented to recognize the

different types of applications, their scalability as well as exter-nal influences and optimize accordingly. In related work, a threestep mechanism was used in [2] which firstly selected the num-ber of active cores required and secondly selected an optimalclock frequency for the active cores and finally task assignment.With a similar approach, the authors in [10] choose to calculatethe minimum frequency and the maximum sleep time allowedin a real-time scheduler to minimize energy consumption. Hy-PowMan presented in [1] used a set of policy experts to eitheroptimize according to DVFS or DPM depending on processorstate. In contrast to these works, our power optimizer performsoptimization for both DVFS and DPM in a single run in orderto determine the optimal combination of both methods at run-time. We present in this paper a model-based power optimizerfor many-core systems. The optimizer uses flexible input modelswhich are programmer definable and can be exchanged even dur-ing runtime. We promote energy efficient programming by allow-ing the programmer to explicitly define scalability parameters inthe program, and demonstrate how power and performance canbe modeled as a practical non linear optimization problem withthis information. We evaluate different use cases with a many-core simulation framework capable of real-time power tracing ofindividual cores and QoS tracing of the applications.

2. SYSTEM IDENTIFICATIONThe key issue for model based control systems is to identify thesystem as a mathematical expression used for control decisions.The model should be as accurate as possible to the real case, butalso remain simple in order to not introduce unnecessary compu-tational overhead. Our plug-in based optimizer has the flexibilityto exchange system models during runtime. Exchangeable mod-els are useful in cases when external conditions affect the opti-mization results – for example, ambient temperature conditionscan affect the relation between static and dynamic power [4], andthus the optimization result. We demonstrate this flexibility byidentifying two power models for our system: Room temperature(+20◦C) and Freezer (−20◦C). The system identification is, inthis paper, done for an Exynos 4412 microprocessor based onquad-core ARM Cortex-A9.

2.1 Power modelWe identified both power models of the Exynos chip by increasingthe frequency and # active cores step-wise while fully loading thesystem. As workload we ran the stress benchmark under Linuxon four threads during all tests, which stressed all active cores onthe CPU to their maximum performance. The dissipated powerwas measured for each point and is shown Figure 1 (hot case tothe left and cold case to the right).

Figure 1: Power as function of #cores and frequency (fullyloaded). Hot temperature to the left and cold to the right

As seen in the figures, the power dissipation of the chip peakedmuch higher when being in hot ambient temperature, especiallyfor the high clock frequencies and with many cores. The resultingbehavior of the frequency-to-power relation is clearly not linear,especially for the hot case. This is confirmed by looking backon the relationship between dynamic power Pd, frequency andvoltage for microprocessors Pd = C · f · V 2. The second factoris the static power Ps which is an effect of leakage currents inthe transistors and is present as long as the core power sourceis enabled. The leakage currents increase as the temperatureincreases because of a higher voltage threshold in the transistors,which leads to higher total power dissipation.

We denote the control variables for DVFS and DPM as q and crespectively. The goal is to define a surface as close as possibleto the data values in Figure 1. The third degree polynomial

P (q, c) = p1 + p2q + p3c + p4q2 + p5qc + p6q

3 + p7q2c (1)

was used for power model identification, where px are coefficientsto define the surface. We used Levenberg-Marquardt’s algorithm[6] for multi dimensional curve fitting to find the optimal coeffi-cients which minimizes the error (difference) between the modeland the real data. Table 1 shows the results for the hot case andthe cold case and Figure 2 illustrates the surface of the hot case

Table 1: Coefficients for power models

Hot p1 p2 p3 p4 p5 p6 p72.34 0.058 0.598 -0.025 -0.161 0.010 0.012

Cold p1 p2 p3 p4 p5 p6 p72.29 0.061 0.302 -0.019 -0.057 0.006 0.004

Figure 2: Surface of the hot use case derived from Eq. 1. Dotsare real data measurements

with the given parameters where DVFS and DPM utilization isgiven in the range [1,8] where 1 is minimum utilization and 8 ismaximum. To verify our model we calculated the error differencebetween the real data and the derived model. The presented val-ues in Table 2 show a small average error for both cases. Thehot case showed however a higher maximum error than the coldcase because of a more difficult surface to fit with a third degreepolynomial.

Table 2: Differences between real data and model (hot and cold)

Hot Max diff Avg. diff Cold Max diff Avg. diff10.2% 0.6% 2.4% 0.03%

Figure 3: Verification of power model with real data (circles) andmodel (line). Left: hot case, right: cold case

2.2 Performance modelIn order to determine which power saving technique to use, theoptimizer requires knowledge on how much it affects the appli-cations. For example a sequential program would not gain anyperformance by increasing the #cores, while a parallel applica-tion might save more energy by increasing the #cores insteadof increasing the clock frequency. Similarly to the power model,the performance model plug-in is equally flexible and can be ex-changed during runtime. We modeled DVFS performance as alinear combination of clock frequency q as:

Perf(Appn, q) = Kq · q (2)

since increasing the clock frequency by 2x usually increases anapplication performance by roughly the same number.

In contrast to the rather easy relation between performance andclock frequency, modeling the performance as a function of #coresis more difficult since the result depends highly on the inheritedparallelism and scalability in the program. To assist the opti-mizer, we added the notion of expressing parallelism at compiletime as an extension to our QoS language [5]. The programmeris allowed to enter the parallelism of a program in the range [0,1] where 0 is a completely sequential program and 1 is an idealparallel program. In case the exact number is not known, theprogrammer can approximate a value to assist the optimizationalgorithm for finding at least a nearly optimal result.

Our example model for DPM performance uses Amdahl’s law

S(N) =1

(1− P ) + PN

(3)

where P is the parallel proportion (scalability parameter) of theapplication and N is the number of processing units. The finalperformance model for DPM is rewritten as

Perf(Appn, c) = Kc ·1

(1− P ) + Pc

(4)

where Kc is a constant and c is the number of cores. This modelsa higher performance increase as long as the #cores is low butdecreases as the #cores increase.

3. POWER OPTIMIZATIONWith the derived models, we evaluated different optimizationmethods to determine which configuration gives the most optimal

actuator combination. The power model for the quad-core ARMdescribed in Section 2.1 was used as our general reference modelfor constructing a many-core simulation environment. Note thatthe constructed mathematical model is agnostic to the number ofcores and the number of frequency steps; it is the relation (pat-tern of the surface) between the actuators which are consideredin our simulations.

We expressed the optimization problem defined as follows:

Minimize{P (q, c)}Subject to:

∀n ∈ Applications : −En + q + c+ > Qn(5)

where En is the difference (error value) between the QoS set-point and the actual performance, (q, c) are the power savingtechniques and Qn is a user defined lower QoS limit [5]. The op-timization rule states to minimize the power P while still provid-ing sufficient performance to keep the QoS limit. This is achievedby regulating (q, c) to a level sufficiently high such that all errorsEn are eliminated for each application n.

Since P (q, c) given in Eq. 1 is clearly non-linear, we decidedto use the fmincon non-linear optimization solver for the prob-lem. Issues with such problems are firstly the inability to ensureglobal optimum, and secondly a high complexity with respect tothe control variables. Fortunately the optimization is executedfrequently, and a guaranteed global optimum is not a definite re-quirement for each iteration as long as the solution is sufficientlyclose to the optimum. In this case we merely use two control vari-ables: DVFS (q) and DPM (c), which means that the executionoverhead is low.

Our chosen baseline method implemented the SQP [3] solver withonly the plain objective function and side constraints given in Eq.5. The baseline was compared to the following methods:1) SQP [3] with Gradient2) Interior Point [9] with Gradient3) Interior Point [9] with Gradient and Hessian

The gradient function g =

[ ∂f∂q

∂f∂c

]approximates the search di-

rection with a first order system, and should result in a faster so-lution where f is the objective function. We also provided the an-

alytical partial derivatives of the side constraints c =[

∂c∂c,∂f

]to the solver for a more accurate solution. Interior point basedmethods also allow the user defined Hessian Matrix which ap-proximates the search function with a second order system H =

∂2f∂A2 =

∂2f∂c2

∂2f∂c,∂q

∂2f∂c,∂q

∂2f∂q2

to further increase the accuracy of the

solver. We measured the execution time for finding 70 solutionsfor all algorithm configurations in the Matlab environment. Ta-ble 3 shows the results. The plain algorithms used only the costfunction and the side constraints given in Eq. 5. The SQP

Table 3: Average execution times for 70 solutions

SQP SQP+Grad. Int.P Int.P+Grad. Int.P+Hess.

16.44 ms 13.97 ms 37.08 ms 29.88 ms 287.27 ms

with Gradient input had the shortest execution time, and theInterior Point with Hessian input was clearly the most expensivealgorithm. Secondly we run several applications with differentperformance requirements to trace the QoS and energy consump-

tion for a 200 sec run. Table 4 shows the average values for alltests for both temperatures.

Table 4: Energy and QoS for different algorithms

SQP SQP,Grad. Int.P Int.P,Grad. Int.P,Hess.

E [J] 658 662 675 669 656QoS [%] 88.9 89.1 88.4 88.3 87.7

We chose finally the SQP with gradient as our candidate for thefinal evaluation since its execution time was shortest the energyand QoS results for the test applications had a good overall value.

4. EVALUATIONThe evaluation platform was the earlier mentioned 12 core sim-ulator. At time=7k samples, the platform was set to switchfrom a hot to a cold environment and the power model was thenexchanged. Three applications were simulated: A1) parallel ap-plication (P=0.9) A2) partially parallel application (P=0.5) A3)poorly parallel application (P=0.1). The performance and QoSsetpoint was constant for each application which means that allapplications should carry out the same amount of work.

Figure 4: Actuator usage as function of time for different typesof applications

Figure 4 shows the actuation decisions based on the underlyingoptimizer for each simulated application. The upper part showsA1 in which a larger part of DVFS is used for the first 7k samples,after which the actuators are inverted due to the model exchange.The hot part of A2 uses DVFS to an even larger extent since thescalability of A2 is much lower than A1. The almost completelysequential application A3 uses naturally DVFS to the largestextent during the whole simulation without being able to useDPM almost at all.

For the sake of illustration, we show in Figure 5 the power dissi-pation response of A3 (left) and A1 (right) for the same simula-tion. A3 uses initially all resources because of a short initializa-tion time after which it enables only a few cores running on highfrequency. As the temperature drops, the power dissipation de-creases even though the clock frequency remains the same. A1,on the other hand, uses more cores because of the high paral-lelism, and dissipates less total power. As the chip is placed inthe cool environment, DPM usage for A1 is increased even more.

We then evaluated the optimizer with a simulated MPEG de-coder in a 200 sec run, the first use case used a high definition

Figure 5: Power dissipation response for A3 (left) and A1 (right).Temperature condition changed at time=7k samples

image output and the second used a low definition image out-put. For each case, we simulated MPEG decoders with differentlevels of parallelism P . The energy consumption of both caseswas compared to the actuation policy in the default Linux Com-pletely Fair Scheduler (CFS) [8], in which: 1) applications arealways scheduled on all cores (as far as the application scales)2) cores with no tasks are activated but idle 3) DVFS utilizationscales linearly according to the workload.

Figure 6 shows non-linear energy curves; in the High image qual-ity case the energy consumption is highest around P = 0.8 anddecreases in both directions, and Table 5 shows the QoS for bothcases compared to the CFS policy. This is because a lower scal-ability prohibits the system from using a sufficient amount ofcores and results in poor QoS as seen in Table 5. On the otherend, increasing the scalability allows the system to activate morecores and hence reduce the clock frequency. The static power Ps

increased by activating cores is significantly less than the dy-namic power Pd saved when decreasing the clock frequency. Inthe Low case, the resource requirements are much lower and adecreased clock frequency can take place already at P = 0.7 asseen in Figure 6. In all cases the energy consumption reaches anenergy plateau at certain points. At this point the scalability ofthe application is not strongly worth improving – from an energypoint of view – since mapping the application onto more coreswill only result in a static power increase roughly proportionalto the related dynamic power decrease.

Figure 6: Energy consumption for High and Low performancecompared with the standard Linux CFS policy (Rings are data)

The optimized cased showed overall lower energy consumptionthan the default CFS case, which is due to two reasons: 1) lowscalability forces mapping onto only a few cores. For the CFScase no cores can be shut down and they dissipate waste powerwhile idling. 2) with very high scalability the applications arescheduled on too many cores, which leads to an increase in Ps

which is larger than the total Pd savings.

Table 5: QoS (in %) for High and Low use case compared with thestandard Linux CFS policy. P is the scalability parameter.

P 0.94 0.93 0.92 0.91 0.9 0.85 0.8 0.7 0.5QoS H 90.1 96.7 97.8 97.8 99.4 97.3 92.7 34.8 6.7QoS-CFS H 89.7 96.7 96.2 99.1 99.1 97.6 92.9 20.3 5.0QoS L 95.5 96.5 95.8 92.4 92.3 92.6 93.9 95.9 73.7QoS-CFS L 92.4 97.5 98.7 95.1 93.4 88.2 90.6 96.1 77.5

5. CONCLUSIONSWe have presented a model-based approach to optimize DPMand DVFS utilization in many-core systems. Optimization deci-sions are made based on a power model for the underlying plat-form and its power saving features, and a performance modelwhich allows user centric notion of parallelism to steer the opti-mization decisions in the desired direction. We demonstrate howenergy can be saved by using the optimal combination of DPMand DVFS to maximize power proportionality by shutting downunused resources and by preventing unnecessary parallelism inmulti-threaded programs leading to static power waste.

6. REFERENCES[1] K. Bhatti, C. Belleudy, and M. Auguin. Power

management in real time embedded systems throughonline and adaptive interplay of dpm and dvfs policies. InEUC, 2010 IEEE/IFIP 8th International Conference on,pages 184–191, 2010.

[2] M. Ghasemazar, E. Pakbaznia, and M. Pedram.Minimizing energy consumption of a chip multiprocessorthrough simultaneous core consolidation and dvfs. InCircuits and Systems (ISCAS), Proceedings of 2010 IEEEInternational Symposium on, pages 49–52, 2010.

[3] P. E. Gill, W. Murray, Michael, and M. A. Saunders.Snopt: An sqp algorithm for large-scale constrainedoptimization. SIAM Journal on Optimization,12:979–1006, 1997.

[4] F. Hallis, S. Holmbacka, W. Lund, R. Slotte, S. Lafond,and J. Lilius. Thermal influence on the energy efficiency ofworkload consolidation in many-core architectures. In The24th Tyrrhenian International Workshop on DigitalCommunications, 2013.

[5] S. Holmbacka, D. Agren, S. Lafond, and J. Lilius. Qosmanager for energy efficient many-core operating systems.In Proceedings of the 21st PDP Conference. IEEEComputer society, 2013.

[6] K. Iondry. Iterative Methods for Optimization. Society forIndustrial and Applied Mathematics, 1999.

[7] R. Jejurikar, C. Pereira, and R. Gupta. Leakage awaredynamic voltage scaling for real-time embedded systems.In Proceedings of the 41st annual Design AutomationConference, DAC ’04, pages 275–280, New York, NY,USA, 2004. ACM.

[8] M. T. Jones. Inside the linux scheduler. Jun 2006.

[9] N. Karmarkar. A new polynomial-time algorithm for linearprogramming. In Proceedings of the 16th ACM symposiumon Theory of computing, STOC ’84. ACM, 1984.

[10] M. Marinoni, M. Bambagini, and Prosperi. Platform-awarebandwidth-oriented energy management algorithm forreal-time embedded systems. In ETFA, 2011 IEEE 16thConference on, pages 1–8, 2011.

Date post:	11-May-2018
Category:	Documents
Upload:	hoangkhuong
View:	228 times
Download:	1 times

Power Optimized Many-Cores with User Centric … Optimized Many-Cores with User Centric Notion of...

Documents