www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 0
D3.5 EUBra-BIGSEA QoS infrastructure services final version
Author(s)
Jussara Almeida (UFMG), Danilo Ardagna (POLIMI), Igor Ataíde
(UFCG), Enrico Barbierato(POLIMI), Ignacio Blanquer (UPV),
Sergio López (UPV), Túlio Braga (UFMG), Andrey Brito (UFCG),
Ana Paula Couto (UFMG), Athanasia Evangelinou (POLIMI), Iury
Ferreira (UFCG), Armstrong Goes (UFCG), Marco Gribaudo
(POLIMI), Raffaela Mirandola (POLIMI), Fábio Morais (UFCG)
Status Final
Version v1.1
Date 04/10/2017
Dissemination Level
X PU: Public
PP: Restricted to other programme participants (including the Commission)
RE: Restricted to a group specified by the consortium (including the Commission)
CO: Confidential, only for members of the consortium (including the Commission)
Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT.
This document describes the final version of the infrastructure services that enable the EUBra-BIGSEA ecosystem. These services include: (i) tools for predicting big data application performance, (ii) mechanisms for horizontal and vertical elasticity, and (iii) approaches based on pro-activity and on optimization to perform horizontal and vertical scaling of running big data applications, as well as load
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 1
balancing of the infrastructure.
Document identifier: EUBRA BIGSEA -WP3-D3.5
Deliverable lead POLIMI
Related work package WP3
Author(s) Danilo Ardagna (POLIMI), Ignacio Blanquer (UPV), Andrey Brito (UFCG)
Contributor(s) Jussara Almeida (UFMG), Igor Ataíde (UFCG), Enrico Barbierato(POLIMI), Ignacio
Blanquer (UPV), Túlio Braga (UFMG), Ana Paula Couto (UFMG), Athanasia
Evangelinou (POLIMI), Iury Ferreira (UFCG), Armstrong Goes (UFCG), Marco
Gribaudo (POLIMI), Raffaela Mirandola (POLIMI), Fábio Morais (UFCG)
Due date 30/09/2017
Actual submission date
Reviewed by Sandro Fiore (CMCC) and Dorgival Guedes (UFMG)
Approved by PMB
Start date of Project 01/01/2016
Duration 24 months
Keywords Quality of Service, Cloud services, Monitoring
EUBra-BIGSEA is funded by the European Commission under the
Cooperation Programme, Horizon 2020 grant agreement No 690116.
Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 2
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 3
Versioning and contribution history
Version Date Authors Notes
0.1 12/07/2017 Andrey Brito (UFCG), Danilo Ardagna
(POLIMI), Ignacio Blanquer (UPV)
TOC, initial version
0.2 29/08/2017 Athanasia Evangelinou (POLIMI) Initial version Sections 2, 4.
0.3 29/08/2017 Athanasia Evangelinou (POLIMI), Enrico
Barbierato (POLIMI)
Initial version Section 11 and
added Appendixes A and B.
0.4 11/09/2017 Danilo Ardagna (POLIMI), Athanasia
Evangelinou (POLIMI), Raffaela Mirandola
(POLIMI), Marco Gribaudo (POLIMI), Túlio
Braga (UFMG), Jussara Almeida (UFMG)
Added Section 3, 10, Section 11
completed.
0.5 13/09/2017 Danilo Ardagna (POLIMI), Enrico Barbierato
(POLIMI)
Initial version Section 12.
0.6 14/09/2017 Danilo Ardagna (POLIMI), Athanasia
Evangelinou (POLIMI), Túlio Braga (UFMG),
Ignacio Blanquer (UPV)
Added Section 8, Appendix B
completed
0.7 16/09/2017 Danilo Ardagna (POLIMI), Andrey Brito
(UFMG), Ignacio Blanquer (UPV)
Start consolidating, Section 12
completed.
0.8 17/09/2017 Athanasia Evangelinou (POLIMI), Enrico
Barbierato (POLIMI), Igor Ataíde (UFCG),
Fábio Morais (UFCG), Armstrong Goes
(UFCG), Iury Ferreira (UFCG)
Section 10 completed. Added
Sections 5, 6, 7, 9 and Appendixes
C, D, E, F, G.
0.9 26/09/2017 Danilo Ardagna (POLIMI), Andrey Brito
(UFMG), Ignacio Blanquer (UPV)
Added Sections 1 and 13, finalising
consolidation.
1.0 27/09/2017 Danilo Ardagna (POLIMI), Andrey Brito
(UFMG), Ignacio Blanquer (UPV)
Deliverable sent to reviewers.
1.1 04/10/2017 Danilo Ardagna (POLIMI), Andrey Brito
(UFMG), Ignacio Blanquer (UPV)
Review comments addressed and
final version.
Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0.
Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent
the views expressed by the European Commission or its services.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 4
While the information contained in the document is believed to be accurate, the author(s) or any other participant in the
EUBra-BIGSEA Consortium make no warranty of any kind with regard to this material including, but not limited to the implied
warranties of merchantability and fitness for a particular purpose.
Neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be responsible or liable
in negligence or otherwise howsoever in respect of any inaccuracy or omission herein.
Without derogating from the generality of the foregoing neither the EUBra-BIGSEA Consortium nor any of its members, their
officers, employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from
any information advice or inaccuracy or omission herein.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 5
Table of Contents
Table of Contents 5
1. Introduction 12
1.1. Scope of the Document 12
1.2. Target Audience 12
1.3. Structure 12
2. WP3 Architecture summary 13
3. OPT_JR implementation 16
3.1. Problem definition 16
3.2. Algorithms 19
4. Hybrid Machine Learning Module Implementation 24
4.1. Configuration and Deployment 24
4.2. Configuration and dataset preparation 25
4.3. Description of scripts and functions 26
4.3.1. Extrapolation on few and many cores and Interpolation capabilities 30
4.4. Package Information and Code Structure 31
5. Pro-active policies implementation 33
5.1. Vertical Scaling Pro-active Policies 33
5.1.1. Contextualization 33
5.2. Pro-active Policies for OpenNebula 34
5.2.1. Spark-Mesos Plugin 34
5.2.2. OpenNebula Actuator 36
5.3. OpenStack Opportunistic Plugin 41
5.3.1. Opportunistic Service 41
5.3.2. Opportunism and Load Balancer 42
6. Load balancing implementation 43
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 6
6.1. Architecture 43
6.2. Heuristics 44
6.2.1. Balance Instances 44
6.2.2. CPU-Cap Aware 45
6.2.3. Sysbench Performance+CPU Cap 47
6.3. Integration with Opportunistic Services 48
7. Pro-active policies validation 50
7.1. Vertical Scaling Pro-active Policies 50
7.1.1. Infrastructure 50
7.1.2. Experiment design 52
Controller configurations 52
Application deadline 53
Metrics collection method 53
7.1.3. Analysis 53
7.2. Vertical Scaling Pro-active Policies 57
7.2.1. Infrastructure and experiment design 57
Actuation configurations 58
7.2.2. Analysis 58
8. Frameworks experiments validation 61
8.1. Use Cases 61
8.2. System Architecture 61
8.2.1. Job submit and execution 63
8.2.2. Monitoring the application 63
8.2.3. Decision making 64
8.2.4. Elasticity 65
8.3. Results of the experiments 65
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 7
8.3.1. Chronos Application 65
8.3.2. Marathon Application 67
9. Load balancing validation 70
9.1. Infrastructure 70
9.2. Experiment Design 70
9.3. Analysis 71
10. Performance Prediction service validation 85
10.1. Experimental Settings 85
10.2. dagSim Validation 87
10.3. Lundstrom Validation 91
11. HML validation 100
11.1. Experiments setting 100
12. Optimization service validation 106
12.1. OPT_IC validation on BULMA application 106
12.2. OPT_JR validation 108
13. Conclusions 127
GLOSSARY 128
Annex 130
Appendix A - Hybrid Machine learning module usage manual 130
A.1 Deployment and Installation instructions 130
A.2 User Manual 131
Appendix B - Performance Prediction and Optimization service enhancements implemented in the new release 136
Appendix B.1 Performance prediction and Optimization (OPT_IC) service new release 136
WS_2P Enhancement: Prediction of the residual execution time of an application 136
Appendix B.2 WS_OPT_IC Enhancement: Revise the optimal number of cores and VMs while an application is running 140
Appendix B.3 Performance and optimization service tutorial 144
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 8
Technical tutorial 145
Database Creation 145
Creation Database Container (Bash Script) 145
Creation Docker Container Services 148
Useful Commands 150
Operational guide 150
Configuration file 150
Data Log files 152
Spark log parser 153
dagSim 156
OPT_IC 157
OPT_JR 160
Appendix C - Controller Service Guide 163
Installation and configuration 163
Creating a new controller plugin 163
Creating a new actuator plugin 166
Appendix D - Load Balancer Guide 168
Installation 168
Configuration 169
Heuristics Configuration 171
BalanceInstancesOS heuristic configuration 172
CPUCapAware heuristic configuration 172
SysbenchPerfCPUCap heuristic configuration 173
Optimizer Configuration 173
Execution 174
Creating a new Heuristic 175
Creating a new Optimizer Plugin 175
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 9
Appendix E - Monitoring-Hosts Daemon 177
Installation 177
Configuration 178
Execution 179
Appendix F - EUBra-BIGSEA Applications Submission Guide 181
Appendix G - Load Balancer Results 187
Scenario 2 - Initial Cap 25% 187
Scenario 3 - Initial Cap 50% 196
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 10
Executive Summary
This document describes the final release of the infrastructure services of the EUBra-BIGSEA platform.
This deliverable puts together results from previous deliverables, i.e., D3.1 QoS Monitoring System
Architecture, D3.2 Big Data Application Performance models and run-time Optimization Policies, D3.3
EUBra-BIGSEA QoS infrastructure services initial version. Moreover, this document is an update of the
deliverable D3.4 EUBra-BIGSEA QoS infrastructure services intermediate version. For the sake of the
reader’s convenience, the context of the tools developed within the workpackage and their
interrelationship are summarized and updated but we did our best to minimize the overlap between the
two documents.
In its final version, the EUBra-BIGSEA platform enables users to provide data processing applications that
will be profiled and modelled in a pre-production phase and, then, be deployed for either asynchronous
execution (e.g., driven by external triggers such as the availability of new data) or for periodic execution.
In both cases, the platform will be able to estimate the initial amount of resources needed to run the
application within the specified deadlines and will be able to trigger adaptation of the running
infrastructure to match the resources needed in order to satisfy the deadlines.
This document targets developers both internal and external to the project. Internally, the document
will help partners from connected workpackages to fully use the infrastructure services and leverage
them to provide higher-level abstractions. Externally, the open-source tools available in the EUBra-
BIGSEA Github repositories enable developers outside of the consortium to configure their own clusters
and execute applications. In addition, this document guides developers in customization activities such
as the usage of the service platform, including pro-active policies and optimization services, with cloud
infrastructures different to the ones considered by the EUBra-BIGSEA partners.
Finally, the main advances in this released version of the platform can be summarized as follows: (i) new
release of the hybrid machine learning module, performance prediction service and optimization
services, (ii) pro-active policies module enhancement, (iii) final implementation of the load balancing
module, (iv) extended validation of all workpackage components, including vertical elasticity, (v) initial
validation of the contextualization service, monitoring, pro-active policies and optimization service on
the BULMA case study application developed within WP7, (vi) integration of the core architectural
components.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 11
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 12
1. Introduction
1.1. Scope of the Document
This document describes the final version of the infrastructure services that enable the EUBra-BIGSEA ecosystem. These services include: (i) tools for predicting Big Data application performance, (ii) mechanisms for horizontal and vertical elasticity, and (iii) approaches based on pro-activity and on optimization to perform horizontal and vertical scaling of running big data applications, as well as load balancing of the infrastructure. The results of the initial validation performed by considering the execution of the case study applications developed within WP7 are also reported.
1.2. Target Audience
This document has two main targets. One target is internal, the document serves as a documentation of
the WP3 activities on performance prediction, pro-active and optimization-based policies, service
contextualization, and on mechanism for scalability. With this target in mind, this document describes
the software architecture and the mutual interrelationship of WP3 services. Therefore, it aids the EUBra-
BIGSEA team of technical experts involved in the development of the application layers within WP4 and
WP5. The second target is the general cloud community: a full description of all the open-source
components is provided, as well as a reference for the installation of the cloud services that enable the
different features developed in the project.
1.3. Structure
Section 2 summarises and revise the software architecture of EUBra-BIGSEA. Section 3 describes the
final implementation of the optimization module minimising soft-deadline applications tardiness under
heavy load. Section 4 provides an update, with respect to deliverable D3.2, of the hybrid machine
learning tools developed for predicting big data applications execution time. Sections 5 and 6 describe
the implementation of the pro-active policies module and load balancing, respectively. Sections 7 to 12
are devoted to the initial validation of all the platform services, from the pro-active policies tested on
OpenNebula to the optimization-based policy module tested also on the BULMA application developed
in the context of WP7. Conclusions are drawn in Section 13.
Additionally, the deliverable includes a glossary and several annexes describing tool tutorials and
installation guidelines and summarising the main changes implemented in the platform components
with respect to the release reported in deliverable D3.4.
1.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 13
2. WP3 Architecture summary
EUBra-BIGSEA has developed mechanisms and policies to deal with resource provision, reduction of Big
Data application execution costs and Quality of Service (QoS) guarantees. EUBra-BIGSEA proposes
innovative and intelligent modules, which address challenges in exploiting the best use of infrastructure
resources and minimizing costs by monitoring and allocating resources dynamically to meet deadlines.
Figure 1 shows the overall architecture and the execution workflow, which enables a flexible interaction
among the different components.
Figure 1 - EUBra-BIGSEA platform architecture
The reference architecture consists of a monitoring system, based on Monasca, which gathers
information related to the metrics (as batch queue length, network load, CPU utilization, application
performance, application progress, container resources) and two software layers implementing dynamic
adaptation policies. The first adaptation policies module is responsible for specifying and enacting high-
level pro-active rules to trigger infrastructure adaptation and load balancing while the other
implements advanced adaptation policies based on performance modelling and optimization methods
to automatically adapt the system configuration, provide QoS guarantees and reduce resource usage
costs. The main components of the EUBra-BIGSEA architecture include: (i) the Broker API to submit job
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 14
requests and to specify additional information concerning application configuration and deadlines, (ii)
the Performance Prediction Service, which - given the current configuration - estimates application
execution time, (iii) the Optimization Service, which - given a deadline - estimates, at deployment time,
the initial configuration that minimizes cost (OPT_IC component), or minimizes tardiness of soft-
deadline applications under heavy loads (OPT_JR component), (iv) the Pro-active Policies module, which
provides the bridge between the users' job submission and the execution platforms, and adds or
removes resources based on triggers associated with thresholds in the monitoring infrastructure, and
finally (v) the contextualization and configuration service, which deploys applications starting from
Tosca blueprints and actuates the Pro-active Policies module decisions. The overall architecture ensures
at any point in time the availability of the appropriate number of resources to deal with the existing
workload (additional details are available in deliverable D3.4).
Initially, users submit an application with a specific deadline through the broker API, which triggers the
execution of the Optimization Service. In turn, the Machine Learning sub-component is used by the
optimization policies module to estimate a Big Data application execution time, and identify an initial
solution in terms of the number of VMs. The latter is refined by a local search heuristic algorithm, which
is based on the performance prediction service in order to identify the final solution to be deployed. The
initial deployment is obtained by relying on the configuration and contextualization service.
During the application execution, the QoS monitoring system is responsible for gathering information
periodically and verifying if there is enough capacity for it execution. In particular, the pro-active policies
module interacts with the performance prediction service inquiring, given the current resource
allocation, an estimate for the application's finishing time. In case of violation or over-provisioning, the
pro-active policies module enacts high-level rules and triggers the infrastructure adaptation by
horizontally or vertically scaling the resource containers currently deployed and balancing the load
among the physical servers.
There are four execution plugins for the Broker, each one is adequate for some specific application and
infrastructure context. Two of these have already been presented before: OpenStack Generic, which
starts virtual machines on OpenStack, executes all the required tasks for the job and has a log-based
monitoring; and Spark-Sahara, which uses Sahara (Data Processing as a Service for OpenStack) to deploy
clusters with predefined configurations and executes Spark jobs. There are two new implementations
made to synchronize the efforts among WP3 teams: Spark-Mesos and Marathon-Chronos. These new
plugins will be detailed in Section 5.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 15
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 16
3. OPT_JR implementation
As discussed in EUBra-BIGSEA deliverable D3.4, as part of task T3.5, EUBra-BigSEA implements the
OPT_IC optimization component, capable of calculating a configuration, in terms of allocated VMs,
which is minimal from a cost perspective and fulfils a given deadline.
When a new application is submitted, two cases are possible: either the system capacity can
accommodate the request and the execution starts immediately accordingly to the number of VMs
identified by OPT_IC, or the new application needs additional resources not available currently in the
cluster. In the latter case, the goal is to update the cluster resources allocation to minimize the
(weighted) sum of the tardiness of soft-deadline applications (as a result, the submitted application can
be executed if and only if the soft deadline applications can lean resources).
This section is structured as follows: in Section 3.1, the definition of the minimum tardiness optimization
problem is recalled from deliverable D3.2. In Section 3.2, the algorithms used to compute the new VMs
assignment for the soft-deadline applications are discussed. The validation of the OPT_JR component is
reported in Section 12.2.
3.1. Problem definition
The OPT_JR optimization module considers the set of entities denoted in Table 1.
The goal is to minimize the weighted tardiness of soft-deadline applications. If we denote by E[Ri] the
average execution time for application i, the tardiness is defined as E[Ri]-Di if Ri >Di, 0 otherwise. The
overall objective is given by:
(1)
Note that OPT_JR is used when the deadline Di cannot be met when the system is under heavy load; as a
result, for every application the tardiness is strictly positive.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 17
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 18
Table 1 - Description of the terms used
Entity Description
Ad The set of indexes of soft deadline applications
Machine learning model application i profile
Di The application deadline
Mi Total RAM memory available at the YARN Node Manager for application i
mi Container size (in terms of RAM) for application i
Vi Total number of vCPUs available at the YARN Node Manager for application i
vi Container size (in terms of vCPUs) for application i
The initial number of cores estimated for application i
wi Application’s weight
N Available spare capacity for soft-deadline applications (total number of cores)
As discussed in Section 7 of EUBra-BIGSEA deliverable D3.2, relying on ML models, big data applications
average execution time can be approximated by:
(2)
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 19
where ni denotes the number of cores allocated to application i. If N denotes the total number of cores
available OPT_JR minimizes Equation (1) under the constraint:
By exploiting Karush Kuhn Tucker conditions, at the global optimum solution for every application the
number of cores which minimize the tardiness is given by:
(3)
3.2. Algorithms
Since (2) and (3) are raw approximations, we only exploit them to quickly obtain an initial estimate for
the optimal solution. Afterwards, in the second step of our optimization heuristic, we take this
configuration as the starting point of a design space exploration via a search technique that evaluates
application performance through more accurate performance methods and relies on dagSim or the
Lundstrom tool to identify the final solution. The solution implemented is described in Figure 2 and 3.
During the first step, for each application i) an initial solution is computed by referring to Equations (3),
then ii) a bound is calculated (step 7).
The bound is the minimum number of cores such that E[Ri]<Di and then the application tardiness
becomes 0. The procedure to compute the bound value is a simple line search, which considers as
starting point the number of cores identified by the OPT_IC tool at deployment time. The OPT_JR
optimization module retrieves this value from a DB table called OPTMIZER_CONFIGURATION_TABLE (see
deliverable D3.4). The process of calculating the bounds considers two cases: either the execution time
returned by the performance predictor adopted (i.e., Lundstrom or dagSim) for the current number of
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 20
cores exceeds the requested deadline or lays behind it. As a result, the number of cores within the given
configuration is increased or decreased by a certain step (currently equal to the value of the Vi
parameter) and the predictor is invoked again. The process is iterated as many times as required,
reducing the distance from the deadline until stops. The predictor component must also consider the
so-called rescaling issue, i.e., provided the current stage in execution, the tool rescales the residual
application execution time, provided by the performance prediction tool, considering the ratio between
the elapsed time of the application and the current stage end time (estimated again by the performance
prediction tool).
The rescaled time remaining for the application is computed according to the following formula:
where the EstimatedTimeRemaining and EstimatedStageEndTime are obtained by the performance
prediction tool and ElapsedTime is obtained considering the current time and the application start time
stored in the RUNNING_APPLICATION_TABLE (see Appendix B.1 and EUBra-BIGSEA Deliverable D3.4 for
further details).
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 22
The algorithm continues by considering the initial solutions calculated through Equations (2) and (3).
They are adjusted (steps 11-12) with the aim to round the number of cores adopted to be a multiple of
the ones available within the YARN container/VM.
The local search process takes place within the second stage of the algorithm (line 16 onward) by
looping according to a general index. At each iteration, a different algorithm called Approx_Algorithm
(line 18, the algorithm is described later in this document) is invoked to generate a list L2 of candidate
applications for which the total Objective Function (OF) can be improved by moving the same number of
cores from one application to another. The local search neighborhood is defined by this simple move.
The list L2 computed by the approximated algorithm at step 19 includes the set of moves that improve
the OF evaluated by using the ML models and formula (2). The list L1 (introduced at step 17) will store,
at each iteration, the set of moves leading to an improvement of the OF evaluated by relying on either
dagSim or Lundstrom predictors.
The earlier use of an approximation function to estimate application execution time mitigates the
performance issues deriving from the actual evaluation of the cores shift by using a more accurate
predictor (Lundstrom or dagSim). By exploiting the approximation-based algorithm (based on ML
reported in Figure 3) that explores exhaustively the neighborhood of size O(N2), where N is the number
of soft-deadline applications, the overall algorithm execution time decreases to a few msecs. As a result,
the effect of the cores shift is precisely calculated only on those promising candidates identified earlier.
Note that, to improve also the estimate provided by the ML models, at every call of the performance
predictor the parameters Xci and X0
i are updated by fitting the hyperbole of equation (2) with the last
two values (ni, E[Ri]) returned by the performance predictor.1
In every iteration, the best move that minimizes the total objective function is committed (step 42) and
the search continues, otherwise the search stops in a (local) optimum. Each member of the lists L1 and
L2 includes the number of cores of two different applications after the core shift and the expected value
of the delta of the Objective Function.
By cycling on the approximated list L2, the delta of the objective function is now evaluated for every
candidate solution by invoking the predictor component2 with the new core assignment. If the new
1 Performance predictor execution values are also cached in the DB tables (see also EUBra-BIGSEA deliverable
D3.4) to accelerate the overall procedure execution.
2 (dagSim or Lundstrom, lines 26-27).
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 23
cores configuration implies an improvement of the objective function (line 31), then the configuration is
stored in the sorted list L1 (line 33). Once the entries in the list L2 have been analysed, the process
retrieves from the sorted list L1 the configuration for which the objective is minimum; it performs the
move and the current solution is updated accordingly (line 42). The optimization process ends either
because the maximum number of iterations MAX has been reached or the last iteration has not
generated any improvement (line 38).
The approximation-based algorithm generating list L2 is reported in Figure 3. Its purpose consists in
improving the general performance. This is made possible by the fact that the evaluation of the
objective function of the different attempts to add (or remove) cores is not actually calculated by a
predictor component but only approximated (lines 8-9) and evaluated through the machine learning
model.
All the candidate configurations (for which the objective function is effectively improving, as per line 13)
are stored in a sorted list (line 14) and returned to main algorithm for actual evaluation (line 17).
Figure 3 - The Approximation-based Algorithm
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 24
4. Hybrid Machine Learning Module Implementation
This section focuses on the description of the Hybrid Machine Learning (HML) prototype, which is used
for the performance prediction of big data cloud applications. The machine learning model is used by
the optimization components (see D3.4, Section 6) in order to define the minimum cost infrastructure
configuration which fulfils the deadline constraints. In particular, this component is based on the hybrid
machine learning approach initially published in Ataie et al.3 and described in EUBra-BIGSEA Deliverable
D3.2, Section 5.5, which combines analytic and machine learning models and is able to identify accurate
models reducing the number of experiments for big data application profiling.
The HML module replaces the fluid simulation version initially foreseen for the dagSim simulator, since
providing a closed formula for the evaluation of big data application execution time, allowed us to
develop analytical optimization models, which are used to identify an initial promising solution in the
implementation of the OPT_IC and OPT_JR components.
The HML prototype is implemented to support the design-time phase. Indeed, the accuracy of the final
machine learning model needs to be assessed off-line and to be supervised by a system administrator
and cannot be fully automated and integrated into the run-time platform.
The main target of this section is the WP3 team implementing the prototype for the performance prediction and optimization services. In particular, the outcomes of the Hybrid Machine learning tool will be used as an input to the latter. Moreover, this chapter describes extensively the software architecture as well as all the open-source components.
This section consists of four parts: the supported platforms, as well as the configuration and the dataset preparation, are depicted in Section 4.1 and 4.2 respectively. The overall architecture, the scripts, the functions and the extrapolation capabilities of the approach are described in Section 4.3. The package information is illustrated in Section 4.4, while the deployment and the user installation guidelines are included in Appendix A.
4.1. Configuration and Deployment
The Hybrid Machine Learning prototype can be deployed on a machine with Octave GNU installed. The following sections describe extensively the requirements and instructions necessary for user’s guidance.
The Hybrid Machine Learning prototype for QoS prediction is available as open source under the Apache
2.0 license on GitHub (https://github.com/eubr-bigsea/HybridMachineLearning). It was implemented in
GNU Octave and consists of a set of scripts (m files) and functions, responsible for predicting with a good
accuracy the execution time of Spark applications4. The core of our prototype is the use of the Support
3 E. Ataie, E. Gianniti, D. Ardagna, and A. Movaghar, “A combined analytical modeling machine learning approach
for performance prediction of mapreduce jobs in Hadoop clusters,” in Proc. of SYNASC , 2016.
4 HML module have been tested also with MapReduce and Tez applications.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 25
Vector Regression technique to estimate the execution time of each application. The implementation
depends on the LIBSVM5 open source library under the BSD license. This library is responsible for
training a data set to obtain a model and then, using the model to predict information on a testing data
set.
4.2. Configuration and dataset preparation
The first step is to collect and process the Spark logs and generate the performance values from an
analytical model (e.g., approximated formula or dagSim), necessary to be used as input for the training
of the Hybrid algorithm which is the core of the implemented tool.
In order to process the logs of the executed applications and get the suitable summary csv files, needed
as an input for the next steps, we should run the set of scripts available on GitHub
(https://github.com/deib-polimi/Spark-Log-Parser, see also D3.4 Section 6). An example from the
processed logs generated by the Log Spark Parser is shown in Figure 4. In particular, the generated
summary csv contains a set of features which can characterize Spark jobs obtained during the execution
of a query. Those features are the completion time of jobs, the number of stages and tasks per stage,
the max/min and average time of tasks, Shuffle time and bytes sent. Moreover, the number of users, the
application data set size and the number of containers are included.
Figure 4 - Example of processed data logs in csv format for Spark jobs
5 https://www.csie.ntu.edu.tw/~cjlin/libsvm
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 26
After the successful download of the set of scripts, the execution process can be started by navigating to
the bootstrap/octave directory.
Initially, the calculation of the estimated response times of Spark application for single a user scenario
and the generation of the initial set of synthetic data samples (Knowledge Base-KB) is needed. In order
to obtain the previous information, dagSim, JMT simulators or the approximated formula should be
executed. The formula used for generating dataset is described in equation (4).
(4)
where ni is the number of tasks and ti is the average execution time of each task associated with stage i.
Moreover, nc is the number of available core during the jobs execution and k is the total number of
stages. Such approximation has been proven to be accurate in scheduling theory for systems in which
the number of tasks is greater than the available resources. The rationale of this formula is the
following: the term ni/nc equals to the number of waves requested to execute sequentially tasks at stage
i. During the first ni/nc-1 waves, tasks statistically keep the nc cores busy while during the very last wave
the final tasks (in number lower than ni) complete stage i execution. The approximation formula has
been implemented as a function and is considered one of the components of the prototype.
4.3. Description of scripts and functions
The ML model produced by our prototype reported in equation 4 is characterized by the two
parameters chi_0 and chi_c is then used by the optimization service component to predict big data
application execution time.
In the following, a detailed step-by-step description of the execution of our code is presented. The first
step is to obtain the analytical data considering the approximated formula by executing the
“retrieve_response_times_with_formula” script which is responsible for implementing Equation 4.
Alternatively, in the case of using dagSim or another AM simulation process, the data is stored as a
vector in a variable named “query_analytical_data” and is ready to be used from the corresponding
script which is responsible for the next step of the process. Regarding the
“retrieve_response_times_with_formula” script, it calls three functions named “read_data”,
"clear_outliers" and “compute_waves_prediction” to read the data from the summary csv files, clean
them and finally calculate the approximated formula which was introduced in Equation 4.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 27
Then, the identification of the optimal thresholds for the proposed Hybrid approach follows. In
particular, an optimal combination for iteration and stop thresholds is needed (see Figure 5). At this
point, we should clarify that the available operation data samples in the KB are partitioned into four
disjoint sets, called training set, cross validation set CV1 , cross validation set CV2 and test set.
The pseudo-code of our proposed hybrid algorithm, which is an improved and refined version of our
previous work (D3.2), is shown in Figure 5. A synthetic data set, used to form an initial KB, is generated
at line 2 based on the AM. The KB is then used to select and train an initial ML model at line 3. Since the
real data samples might be noisy, an iterative procedure is adopted for merging real data from the
operational system into the KB, which is implemented at lines 4–15. Adding a new configuration in the
KB means that many operational data points are included which subsequently are split into the training,
test and cross-validation sets respectively. The operational data for all available configurations are
gathered and then merged into KB at lines 5–8. Then the updated KB is shuffled and partitioned at lines
10 and 11 as stated before. Using these sets, line 12 is dedicated to the selection of an ML model among
the alternatives and to retraining it. Then some error metrics are evaluated at line 13. At lines 14 and 15,
two conditions are checked. Both conditions consider the Mean Absolute Percentage Errors (MAPEs) on
the training set and on the cross validation set CV2 (shown as tr-Error and CV2 -Error in Figure 5) to
check whether they are lower than the specific thresholds (itrThr and stopThr, respectively) or not.
The error on the training set determines if the model fits well on its training set itself. On the other
hand, the error on the CV2 set determines if the model has generalization capability. If the values of
errors for the first condition are not small enough, the algorithm jumps to line 9 to reshuffle, to split the
KB into train, CV1, and CV2 sets, and to choose a different model.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 28
Figure 5 - Hybrid Machine Learning Algorithm
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 29
Figure 6 - Components of Hybrid Machine Learning Tool
Moreover, since the goal is to identify models capable of achieving generalization, the operational data
obtained with the largest configurations are included only in the CV2. While the training set is used to
train different alternative models, the CV1 set is exploited for SVR model selection, while CV2 is used as
a stopping criterion and to determine the best values for the thresholds of the iterative algorithm. The
error on the training set determines if the model fits well on its training set itself. So, if this error is small
enough, the model will avoid underfitting or high bias. On the other hand, the error on the CV2 set
determines if the model has generalization capability. So, if this error is sufficiently small, the model will
avoid overfitting or high variance.
The test set has been used for evaluating the accuracy of the selected model and restricted to include
some specific configurations (see Section 11). The main script for obtaining the desirable thresholds
(itrThr and stopThr) is “compare_prediction_errors” which generates the state space of threshold
combinations for the Hybrid approach; its results are stored in a folder named HyOptimumFinding,
which path can be determined in the code. Before the execution of the script, the user should set up a
number of parameters which are shown in Table 2. The first one is called “configurations.runs” and is
related to the number of cores which are examined. In our experiments (see Section 11) data samples
reported from 20 to 120 cores were gathered.
Another important parameter is the configuration.missing_runs, which includes the number of cores
which are not considered for the KB during the machine learning process. Regarding the threshold
parameters, the user should set a range for which the algorithm starts the exploration process in order
to find the optimum pair, while the max number of iterations and the number of seeds should also be
provided. For every threshold combination, the algorithm runs for 50 different seed values to generate
different results. Notice that in case of extrapolation and interpolation processes the values of seeds are
different from the ones used to identify the optimal threshold combination in order to demonstrate the
effectiveness of the algorithm’s accuracy (see Table 2).
Table 2 - Parameters of compare_prediction_errors
Name of parameters Description
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 30
configuration.runs Number of examined cores
configuration.missing_runs Number of missing cores configuration used for validation
outer_thresholds Minimize the MAPE on training set
inner_thresholds Minimize the MAPE on cross validation set
max_inner_iterations Maximum number of iterations of the algorithm
seeds Seed for the random number generator
There are two main functions which are included in the previous scripts named “clear_outliers” and
“model_selection_with_thresholds”. The first is responsible for excluding rows of the arrays of data
where the value on a column is more than three standard deviations away from the mean, while the
latter performs model selection on the training set according to the performance on the cross validation
set. The general model is defined via search spans C and epsilon and in the end their optimal values are
returned. Additionally, in the latter function proper scaling is considered in order to obtain accurate
computations.
The average error for each combination of thresholds for the Hybrid approach is calculated by the
“computeAvgsOfHybridForOptimumFinding” function, which finally returns the one that minimizes the
error.
4.3.1. Extrapolation on few and many cores and Interpolation capabilities
After finding the optimal combination the process for investigating the extrapolation and interpolation
capabilities of the algorithm in the upper region of the configuration set follows by calculating the
relative error. As before, the same script (compare_prediction_errors) is used. However, in this case the
values of the optimal thresholds are inserted as an input. Regarding the extrapolation on many cores,
the execution of the process starts while progressively the configurations with the largest capacity are
removed from the training set and cross validations sets CV1 and CV2 and moved to the test set. In
other words, considering that the different configurations for cores are configuration.runs= [20 30 40 48
60 72 80 90 100 108 120], in the first scenario the training and CV1 set included data coming from the
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 31
real system only for configurations from 20 to 100 cores, while the CV2 and test sets included
experimental data for 108 and 120 cores, respectively. In the second case, real data for the training and
CV1 set from 20 to 90 cores are considered, the CV2 set included operational data for 100 cores while
the test set included experimental results in the range [108, 120] cores and so on.
Concerning the extrapolation on a few cores in the lower region of the configuration set, the process for
the calculation of the error on response time prediction starts considering that only point 20 is included
in the test set and gradually moving from the left side of the configuration axis towards the right other
points are added to the test set. However, the quality of the results obtained from the aforementioned
process can be characterized as poor, thus we would not recommend it to be applied but only for right
extrapolation.
For generating the output plots of right and left extrapolation both the
"computeExtrapolationFromRight" and "computeExtrapolationFromLeft" scripts are executed and the
graphs are stored in the folder indicated by the user.
According to our experimental process (the related results are listed in Section 11) to assess the
interpolation capability of the algorithm, three different scenarios are considered by using the
“compare_prediction_errors” script.
For generating the output plots of interpolation "computeInterpolation" script is executed while the
results are stored in the folder indicated by the user.
4.4. Package Information and Code Structure
The file structure of the delivery code is shown in Figure 7. The Hybrid Machine Learning tool is
distributed as a set of octave scripts, containing the source code and dependencies. There are five core
components, two of which have a more complex structure since a set of functions are invoked when
they are executed. The most important outcomes of the prototype are the Machine Learning Model,
which is used as an input to the optimization service, and both the extrapolation and interpolation
components, which prove the effectiveness of the approach.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 32
Figure 7 - Structure of Hybrid Machine Learning code
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 33
5. Pro-active policies implementation
The pro-active policies system comprises three coupled modules that deal with the three main
interactions of the infrastructure and the tasks: The broker, the monitor, and the controller. The main
details about the architecture and initial implementation were already described in the D3.4. In this
section, we discuss the recent updates, with the new implementations and changes made to the pro-
active policies.
5.1. Vertical Scaling Pro-active Policies
The controller service comprises three plugins: the Controller plugin, the Actuator plugin, and the Metric
Source plugin. The Controller and the Actuator plugins implement the Controller and Actuator
components, respectively. The Metric Source plugin is responsible for getting application metrics from a
metric source, like Monasca, and returning them to the Controller.
The Controller plugin receives one or more error values associated with one or more metrics, from the
metric source. The error values are the difference between the expected and the measured values of
the metrics (e.g., the difference between the expected and the observed progress). Based on these
metrics, the controller decides when and how many resources should be allocated/deallocated from the
infrastructure to run the application properly.
5.1.1. Contextualization
Currently, there are three controller plugins available. The reference controller plugin is based on fixed-
step actuations when there is a progress error as is detailed in the previous deliverable (see EUBra-
BIGSEA D3.4). The two new controllers are:
1. Proportional Controller: changes the number of allocated resources based on the error values it
receives from the metric source and a proportional factor (also known as the proportional gain)
received as a parameter. The actuation size is proportional to the error value received and to
the proportional factor. The higher the error absolute value, the higher the number of resources
added or removed;
2. Derivative-proportional Controller: changes the number of allocated resources based on the
current error value, the difference between the current value and the last received and two
parameters: the proportional and derivative factors. The actuation size is proportional to the
error value and the factors received. The number of resources to add or remove is calculated as
shown in the expressions below:
cp=-fp*error
cd=-fd*(error-lasterror)
size=cp+cd
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 34
Where, cp is the proportional component, cd is the derivative component and size is the amount
of resources. fpis the proportional factor parameter and fd is the derivative factor parameter.
The proportional component is purely reactive, while the derivative one contains basic
predictive logic.
The Actuation plugins act on the infrastructure based on action requests from the Controller plugins.
Currently, the actuation plugins available act on KVM virtual machines, getting and changing the amount
of allocated resources using libvirt. For that, they use SSH to access the infrastructure’s compute nodes.
The actuation plugins available are:
1. KVM CPU cap plugin: the plugin changes the number of CPU resources allocated to the virtual
machines by controlling the CPU cap at the hypervisor layer;
2. KVM I/O plugin: the plugin changes the number of both CPU and disk I/O resources allocated to
the virtual machines. It requires two additional parameters, used as reference to the actuation:
iops_reference and bs_reference. The iops_reference parameter is the maximum disk
throughput, in operations per second, that can be allocated to a virtual machine. The
bs_reference parameter is the maximum disk throughput, in bytes per second, that can be
allocated to a virtual machine. These values must be added to the controller configuration file
(controller.cfg), in the actuation section.
The actuation plugins implement the following methods:
1. adjust_resources(self, vm_data): modifies the environment properties, like allocated resources,
using a dictionary that maps virtual machine IDs to cap values as parameter. Typically used to
adjust the CPU cap of VMs allocated to the application;
2. get_allocated_resources(self, vm_id): returns the cap of the given virtual machine.
3. get_allocated_resources_to_cluster(self, vms_ids): returns the cap of the given cluster virtual
machines
5.2. Pro-active Policies for OpenNebula
Another improvement in the pro-active component was the implementation of plugins for running Spark
over Mesos in OpenNebula clouds. The following subsections detail the Spark-Mesos plugin, responsible
for interpreting a job description file and submitting it to a Spark cluster that uses Mesos as resource
manager, and the actuation plugin, which interacts with the cloud management platform when
actuation is needed.
5.2.1. Spark-Mesos Plugin
The Spark-Mesos plugin uses a cluster preconfigured with Mesosphere to execute and manage Spark
jobs. This plugin has basically the same functionality as the Spark-Sahara plugin, but with a different
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 35
cluster provider. A great advantage here is the possibility to use the current approaches for vertical and
horizontal scaling (EC3, IM, and CLUES) combined.
This implementation works for OpenNebula and OpenStack. There is a parameter in the request body
called vm_provider that must specify which virtual machine provider is being used. It is needed to make
the broker able to find and actuate on the hosts. It is also capable to run spark binaries in JAR or Python
formats. When the execution class is specified, it means the binary is in the JAR format.
The plugin flow for an execution in an OpenNebula Infrastructure consists of the following steps:
1. The Mesos cluster master is accessed via SSH and all the needed steps to run a Spark application
are executed;
2. The binary and the input files provided in the request body;
3. The Spark application is submitted, pointing to the Mesos cluster (using flag --master
mesos://ip:port);
4. The list of all the virtual machines running this Spark job (master and slaves) is retrieved;
5. Depending on the virtual machine provider:
1. OpenStack: the OpenStack client is used to discover the virtual machines ids on Nova,
these ids are the same ids for KVM;
2. OpenNebula: the ovm-client is used to map the IPs of the virtual machines and to
discover the KVM ids for each, which are not the same in OpenNebula;
6. The controller service is started (the same for Spark-Sahara); if the broker does not run in the
same network as the virtual machine provider, adjustments need to be made to tunnel the
access; this is the case, for example, when the worker machines are in a remote cluster,
accessible via VPN.
7. The monitoring service is started (the same for Spark-Sahara);
8. When the execution is finished, the files are removed and the services are stopped.
To execute a Spark job using this template, it will be necessary to provide some specific information
about the application, job binary location, output, and infrastructure, besides the information regarding
controller and monitor. It is important to highlight that in this case the application must organize its
parameters in this exact order: [input files] [others arguments]. This configuration file below is an
example with all these information.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 36
[manager]
ip = 10.30.1.69 port = 1514 plugin = spark-mesos bigsea_username = username bigsea_password = password [plugin] binary_url = http://owncloud.lsd.ufcg.edu.br/as6sgb21xb/SparkPi.jar
# in this case spark pi has no input files, but you can include input files in the same order as the parameters in the execution
input_files = http://storage.example/b21xb/input1.csv http://storage.example/as6sgxb/input2.csv execution_parameters = 1000
execution_class = org.apache.spark.examples.SparkPi
vm_provider = opennebula
[scaler] starting_cap = 50 scaler_plugin = progress-error actuator = kvm-upv metric_source = monasca application_type = os_generic check_interval = 10 trigger_down = 10 trigger_up = 10 min_cap = 20 max_cap = 100 actuation_size = 15 metric_rounding = 2
Figure 8 - Configuration file to execute the spark-mesos plugin
5.2.2. OpenNebula Actuator
There are only a few differences between the way to actuate in VMs provided by OpenNebula instead of
OpenStack using KVM CLI and it is basically related to how we can discover where (in which compute
node) the vm is hosted.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 37
First of all, unlike OpenStack, the VM id that OpenNebula uses to identify each VM is not the same that
KVM uses. As the controller receives only the VM id to execute the virsh command and set the CPU
cap, the broker is responsible for mapping the slave ips from Mesos and the vm ids in OpenNebula, and
then the actuator can identify its respective id in KVM. The code below shows how it works:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 38
# List the vms information on OpenNebula
list_vms_one = 'ovm list --user %s --password %s --endpoint %s' % \
(api.one_username, api.one_password, api.one_url)
stdin, stdout, stderr = conn.exec_command(list_vms_one)
# Get the ips of each spark executor from Mesos API. This method blocks the flow until the application be available on the Mesos API
executors = self._get_executors_ip()
vms_ips = executors[0]
# Extract the slaves ids from the output of the command to list opennebula vms
vms_ids = self._extract_vms_ids(stdout.read())
# This loop maps the vms ips with the opennebula ids
executors_vms_ids = []
for ip in vms_ips:
for id in vms_ids:
vm_info_one = 'ovm show % --user %s --password %s --endpoint %s' % \
(id, api.one_username, api.one_password, api.one_url)
stdin, stdout, stderr = conn.exec_command(vm_info_one)
if ip in stdout.read():
executors_vms_ids.append(id)
break
Figure 9 - How the broker discover the slave ids from Mesos and OpenNebula and calls the actuator service
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 39
On the other side, the actuator needs to look for the host where the VM is allocated and execute the
commands to set the resources allocation. The following code snippet shows how it works:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 40
# This method receives as argument a map {vm-id:CPU cap}
def adjust_resources(self, vm_data):
instances_locations = {}
# Discover vm_id - compute nodes map
for instance in vm_data.keys():
# Access compute nodes to discover vms location
instances_locations[instance] = self._find_host(instance)
for instance in vm_data.keys():
# Access a compute node and change cap
self._change_vcpu_quota(instances_locations[instance],
instance, int(vm_data[instance]))
# This method executes the adjustment actions
def _change_vcpu_quota(self, host, vm_id, cap):
# ssh for the actual host
self.conn.exec_command("ssh %s") % host
# List all the vms to get the OpenNebula id and map with the KVM id
stdin, stdout, stderr = self.conn.exec_command("virsh list")
vm_list = stdout.read().split("\n")
# Discover the KVM id using the OpenNebula id
virsh_id = self._extract_id(vm_list, vm_id)
# Set the CPU cap
self.conn.exec_command("virsh schedinfo %s " +
"--set vcpu_quota=%s " +
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 41
"> /dev/null") % (virsh_id, cap)
Figure 10 - How the actuator maps the KVM id to execute the virsh commands
5.3. OpenStack Opportunistic Plugin
In the context of cloud computing, it is a fact that lots of resources are idle most of the time. This
happens because users allocate more resources than they need, and this is typically the case even if
elasticity mechanisms are used (e.g., thresholds for scaling are typically set to 60% or 70%). As a
consequence, cloud providers started providing these resources with a lower QoS to improve the
infrastructure utilization.
Thus, we developed the concept of opportunistic resources in the EUBra-BIGSEA framework as a
different approach to pro-activity. In this context, the broker workflow was modified to add a new
decision layer. After the application submission, even with a cluster size specification by the user, the
broker service demands from the optimizer service an initial cluster configuration size to execute the
application. After that, considering the opportunistic scenario, the broker may redefine the cluster size
with opportunistic instances, what is defined by calculating the availability of resources in the
infrastructure.
Opportunism can be used with spark applications just by adding the opportunistic flag
(opportunistic=True) to the submit request made to the broker. This is supported only by the Sahara
plugin because it is the only one in which we can control instance groups. This limitation is due to
OpenStack not having properly implemented the opportunistic instances, so in order to overcome this
limitation we use a separated node group template (subset of services that will run on instances
belonging to the group) to specify opportunistic instances in Sahara and this way we can preempt them
when needed. Please refer to sahara docs6 in order to understand more on node group templates.
5.3.1. Opportunistic Service
The HSOptimizer7 is the service responsible for gathering information from the hosts of the cloud and
calculating the number of available resources it can use to spawn opportunistic instances to help speed
up the application processing time.
6 https://sahara.readthedocs.io/en/latest/devref/quickstart.html#create-node-group-templates
7 https://github.com/bigsea-ufcg/bigsea-hsoptimizer
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 42
It has two entry points on its API:
@hso_api.route('/optimizer/get_cluster_size', methods=['GET'])
def get_cluster_size()
@hso_api.route('/optimizer/get_preemtible_instances', methods=['GET'])
def get_preemtible_instances()
Figure 11 - Routes to use the opportunist services
The get_cluster_size() method is responsible for calculating the cluster size with respect to the resource
availability and the get_preemtible_instances() method gives a by-cluster priority list of instances to be
deleted.
To compute the cluster size, the optimizer will take the following steps:
1. Request from Monasca the resource utilization in the last hour of the hosts on the partition of
the cloud that uses opportunism;
2. Use a load predictor to compute the availability of resources for the next 30 minutes;
3. Return the infrastructure availability.
The predictor8 here was implemented in R and uses time series and the forecast package from R to
calculate four different estimates, returning the most conservative one.
5.3.2. Opportunism and Load Balancer
One important part of the opportunism feature is the pre-emptability of instances. This means that,
since they have lower QoS, they are the best candidates to be migrated or deleted when needed. The LB
service, detailed below, is responsible for monitoring the cloud state and making the decision to migrate
or kill instances. To perform these potentially disruptive actions, the load balancer uses the optimizer
APIs to query cluster priority list, selecting opportunistic instances as the best instances to act upon.
8 https://github.com/bigsea-ufcg/bigsea-
hsoptimizer/blob/master/horizontal_scaling_optimizer/service/horizontal_scaling/resources/predictor.R
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 43
6. Load balancing implementation
The Load Balancer is a service that runs in the cloud infrastructure. Each Load Balancer is responsible for
managing a specific subset of hosts where VMs are running EUBra-BIGSEA applications, and periodically
reallocate VMs that are on overloaded hosts based on a given heuristic. In addition to this functionality,
it is also possible to configure the Load Balancer to use the HSOptimizer (see Section 5.3) or the OPT_JR
(see Section 3) service, which help to select VMs that can be destroyed while minimizing SLAs or QoS
issues of the applications. The current version works on OpenStack infrastructures and requires that the
infrastructure supports live migrations (i.e., the hosts have shared storage for instance disks9).
6.1. Architecture
The load balancer requires access to the Monasca and Nova (OpenStack Compute) services with admin
privileges. The higher privilege is necessary so it can discover all VMs information and perform live
migrations. The load balancer also requires SSH access to hosts so it can retrieve CPU capacity limitation
(i.e., the CPU CAP) information through KVM Hypervisor. Accessing the CPU capacity information is
necessary to understand the actual allocation of machines. Cloud management platforms typically allow
CPU capacity limitation10, but do not consider it during scheduling11.
The configuration with the HSOptimizer or the OPT_JR services is optional. The architecture of the load
balancer is shown in Figure 8. Information about how to install, configure and use the load balancer is
available in the documentation12 and in Appendix E.
9 https://docs.openstack.org/admin-guide/compute-configuring-migrations.html
10 https://wiki.openstack.org/wiki/InstanceResourceQuota
11 https://docs.openstack.org/ocata/config-reference/compute/schedulers.html
12 https://github.com/bigsea-ufcg/bigsea-loadbalancer#bigsea-wp3---loadbalancer
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 44
Figure 12 - Load Balancer architecture
6.2. Heuristics
The heuristics are responsible for periodically verifying which hosts are overloaded, taking actions to
reallocate VMs of these hosts to others, and trying to make them less overloaded than before when
possible. By doing this rebalancing operations pro-actively, the performance of the virtual machines and
of the applications running on them is preserved.
Different heuristics may consider different aspects. Nevertheless, two parameters are common:
cpu_ratio and wait_rounds. The cpu_ratio parameter in the configuration file represents the ratio of the
number of CPUs that each host can use, the heuristics take this into consideration when trying to
determine when a host is overloaded, and the wait_rounds parameter represents the number of
executions of the load balancer that each migrated instance must wait before any heuristic can choose it
to be migrated again. The implementation of the heuristic part is implemented as a plugin, enabling
developers to easily add their own heuristics.
6.2.1. Balance Instances
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 45
This heuristic aims to balance the hosts that are overloaded, by migrating the virtual machines that have
a higher number of vCPUs. A host is considered overloaded when it total of allocated vCPUs divided by
its total number of CPUs is greater than the provided cpu_ratio. The heuristic pseudo-algorithm is
detailed below. More information can be found in the component documentation13.
1. Select all hosts that are overloaded
2. If no overloaded hosts or If all hosts are overloaded, nothing can be done
3. For each overloaded host
3.1 While selected_host is overloaded and still has instances that could be migrated:
3.1.1 Select instance with a high number of vcpus that was not migrated before.
3.1.2 Look for a host that is less loaded and can receive this instance without exceeding the
cpu_ratio.
4. Migrate all instances selected.
5. Update the number of wait rounds for all migrated instances.
Figure 13 - Balance Instances algorithm
6.2.2. CPU-Cap Aware
This heuristic aims to balance the hosts that are overloaded by migrating the virtual machines that are
responsible for the increased consumption of CPU in the hosts. A host is considered overloaded when
the host total consumption or the total used capacity divided by the number of CPUs is greater than the
provided cpu_ratio. More information is available in the documentation14. The total consumption and
total used capacity for each host are given by the equations below. These equations consider, for each
VM i, the number of allocated vCPUs (vcpus), the CPU CAP for the set of vCPUs in the VM and the
average utilization of these vCPUs (%CPU).
1. Total Consumption: the sum of the consumption of each virtual machine on the host.
1. Total Used Capacity: the sum of the used capacity of each virtual machine on the host.
13
https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/balance.md
14 https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/cpu_capacity.md
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 46
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 47
1. Select all hosts that are overloaded
2. If no overloaded hosts or if all are equally overloaded, nothing can be done.
3. For each overloaded host
3.1 While selected_host is overloaded and still has virtual machines that can be migrated:
3.1.1 Selected virtual machines with high utilization that were not migrated recently.
3.1.2 Look for a host that is less loaded and can receive this virtual machine without exceeding
the cpu_ratio.
4. Migrate the all virtual machines selected
5. Update the number of wait rounds for all migrated virtual machines.
Figure 14 - CPU Cap Aware algorithm
6.2.3. Sysbench Performance+CPU Cap
This heuristic aims to balance the hosts that are overloaded, considering not only allocation and usage,
but the actual performance of the physical machine. Such a heuristic is especially useful in
heterogeneous infrastructures. The decision of which instances to migrate takes into consideration
whether the available hosts have better or worse performance than the overloaded hosts. When the
performance of available hosts is better, the heuristics rebalance by selecting for migration the virtual
machines with higher consumption of CPU in the hosts. When the performance of available hosts is
worse than the overloaded one, the selected virtual machines are the ones with lower consumption of
CPU in the hosts.
A host is considered overloaded when the host total consumption or total used capacity divided by the
number of CPUs is greater than the provided cpu_ratio. More information is available in the
documentation15. To measure the performance of hosts we have created a daemon16 that runs on each
host of the infrastructure and publishes the results into Monasca so the heuristic can use the results to
make the decision. The daemon executes the Sysbench17 benchmark, information about how to use the
tool is provided in Appendix F.
15
https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/performance.md
16 https://github.com/bigsea-ufcg/monitoring-hosts
17 http://manpages.ubuntu.com/manpages/xenial/man1/sysbench.1.html
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 48
The total consumption and total used capacity for each host are given by the equations below. These
equations consider, for each VM i, the number of allocated vCPUs (vcpus), the CPU CAP for the set of
vCPUs in the VM and the average utilization of these vCPUs (%CPU).
1. Total Consumption: the sum of the consumption of each virtual machine on the host.
1. Total Used Capacity: the sum of the used capacity of each virtual machine on the host.
1. Select all hosts that are overloaded
2. If no host is overloaded hosts or If all overloaded hosts are equally loaded, nothing can be done
3. For each overloaded host
3.1. While selected_host is overloaded:
3.1.1. While list of hosts with better performance is not empty:
3.1.1.1. Search for the instance with higher utilization that was not migrated before
3.1.1.2. If selected_host is not overloaded: break
3.1.2. If selected_host is not overloaded: break
3.1.3. While list of hosts with worse performance is not empty:
3.1.3.1. Search for the instance with lower utilization that was not migrated before
3.1.3.2. If selected_host is not overloaded: break
3.1.4. If selected_host is still overloaded: break, issue warning that nothing can be done
4. Migrate all the selected instances
5. Update the number of wait rounds for all migrated instances
Figure 15 - Sysbench Performance CPU Cap algorithm
6.3. Integration with Opportunistic Services
The implementation of the integration between Optimizer Services and the Load Balancer is based on
plugins, which enables developers to write new plugins to integrate with other opportunistic services. In
this scenario, if possible, extra VM resources are allocated to run the application at the beginning in a
"best effort" fashion. So, if during the application execution the extra resources are required by other
applications the Load Balancing service acts on the infrastructure to release resources. For example, by
destroying preemptable VMs provided by the Opportunistic service (HSOptimizer). For that, the
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 49
opportunistic service provides a list of opportunistic instances that can be deleted without affecting the
application (e.g., the master node in a Spark cluster cannot be killed). The Load Balancer will contact the
HSOptimizer service when at the end of its decision there are still overloaded hosts, so it will destroy the
small number of preemptable VMs and make a new decision.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 50
7. Pro-active policies validation
In this section we detail the experiments made to validate the new approaches for the pro-active
policies described in Section 5.
7.1. Vertical Scaling Pro-active Policies
The following subsections summarize the environment and the experiments performed for the
validation of the Controller component. First, we introduce the infrastructure where the experiments
were performed, followed by the description of the experiment, and the analysis of the results.
7.1.1. Infrastructure
The experiments were performed in the production cloud of the Distributed Systems Laboratory (LSD) at
the Federal University of Campina Grande (UFCG), which uses the Newton version of OpenStack. Within
this environment, we selected a set of three dedicated servers, to which we had full access, two Dell
PowerEdge R410, and one Dell PowerEdge R420. Together these hosts had a total of 28 CPUs and
50.96GB of ram. The servers were connected through a Gigabit network and used shared storage
provided via Ceph18. The configuration of each server can be seen in Table 3.
Table 3 - Configuration of each server
Servers Model Processor CPUs Memory (GB)
c4-compute11 Dell PowerEdge R410
Intel Xeon X5675 3.07GHz
24 31.4
c4-compute12 Dell PowerEdge R410
Intel Xeon X5675 3.07GHz
24 31.4
c4-compute22 Dell PowerEdge R420
Intel Xeon E5-2407 2.20GHz
4 19.56
18
http://ceph.com/
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 51
The experiment made use of a Spark cluster, whose configuration is described in Table 4.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 52
Table 4 - Spark Cluster Configuration
Cluster configuration
Amount vCPUs Memory (GB) Disk Size (GB)
Master 1 2 4 80
Worker 3 2 4 80
7.1.2. Experiment design
The experiment’s objective is to compare the performance of the Controller component plugins with an
execution of EMaaS. Since the Controller’s main objective is to adjust the number of resources available
to the application in order to make sure the deadline is not violated, we chose “Application execution
time” and “CPU usage” as metrics to measure the controller’s performance. We have one scenario,
which consists in the execution of the following workflow with nine controller configurations, with ten-
time repetition:
1. Select the Controller configuration to use; 2. Use the Broker API to start EMaaS execution and controller execution; 3. Wait until EMaaS execution ends;
1. Collect CPU usage; 4. Collect application execution time.
Figure 16 - Driver Comparison Algorithm for the experiment
Controller configurations
We used the progress error, proportional and derivative-proportional plugins. For each plugin, we
planned three different configurations. The used configurations are described in Table 5.
Table 5 - Controller configurations used in the comparison experiment
Plugin Aggressive Regular Conservative
Progress Error Trigger down = 0 Trigger down = 10 Trigger down = 20
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 53
Trigger up = 0
Actuation size = 100
Trigger up = -10
Actuation size = 100
Trigger up = -20
Actuation size = 100
Proportional Aggressive factor = 2 Aggressive factor = 1 Aggressive factor = 0.5
Derivative-proportional proportional factor = 2
derivative factor = 2
proportional factor = 1
derivative factor = 1
proportional factor = 0.5
derivative factor = 0.5
This configuration for the progress-error controller forces it to use the full speed of a vCPU when a
vertical scaling actuation. We used KVM CPU cap plugin as the actuator plugin and 50% as starting cap.
Application deadline
In order to obtain a reasonable deadline value, we executed EMaaS, with no scaling performed and CPU
cap of 50%, and collected the execution time of eight of such executions. The chosen deadline (380
seconds) is the mean of the collected values.
Metrics collection method
CPU usage was collected using SSH to access the servers and virt-top19 to get CPU usage information
from the hypervisor. The application execution time was obtained from the Broker, which queries the
Spark API to get the value.
7.1.3. Analysis
Figure 17 depicts the values of the application execution time and CPU usage for all the used controller
configurations. The metric “CPU Usage” was calculated by adding the collected values of CPU usage of
the workers used in the execution.
19
https://linux.die.net/man/1/virt-top
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 54
Figure 17 - CPU Usage and application time for each Controller configuration
By grouping the results by Controller plugin (progress-error, proportional and proportional-derivative), it
is clear that the conservative configurations have the worst results on both evaluation metrics. This
result can be explained by the fact that EMaaS normally requires a resource boost in the end of the
execution. The conservative configurations are likely to fail to add resources quickly enough.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 55
The results also show that, for EMaaS, the aggressive configurations perform better. However, regarding
execution time, the difference between the aggressive and the regular configurations is not as big as the
difference between both and the conservative configurations. Moving from a regular configuration to a
purely aggressive one does not yield a performance improvement as big as the one result of moving
from a purely conservative to one regular. It suggests there is an optimum aggressiveness level.
The Progress error plugin has the best results regarding “Application time” and “CPU Usage”. This result
can, as well, be explained by EMaaS resources requirements. Since Actuation Size = 100 in all
configurations, this plugin can give the biggest possible resource boost to EMaaS and remove almost all
the resources when necessary. It results in the best execution times and resource usage. However, this
plugin’s behaviour seems too extreme and may generate an excessive disturbance in the environment.
The disadvantage of the progress-error plugin is the oscillation in resource usage, which can be seen in
Figures 18 and 19. Note that the spikes in the CPU consumed by Progress-Error configuration are higher;
this could bring instability to the performance of the cloud and also make the load balancer inefficient.
In Figure 19, we can see that the CPU CAP oscillates more abruptly in the two other cases than with the
Proportional-Derivative controller.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 56
Figure 18 - Examples of compute node CPU Usage for three controller configurations
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 57
Figure 19 - Examples of CPU cap over time for three controller configurations
7.2. Vertical Scaling Pro-active Policies
Another improvement in the pro-active policies module regards the actuation for vertical scaling. The
current version supports the usage of multiple resource dimensions, namely controlling CPU CAP and
I/O limits (operations per second and bandwidth). The following subsections illustrate the impact of the
usage of these metrics.
7.2.1. Infrastructure and experiment design
The infrastructure is the same as described in Section 7.1.1. The experiment’s objective is to evaluate
the effectiveness of I/O actuator in controlling the amount of allocated resources and application time.
We have two scenarios, which consist in the execution of the following workflow, with ten-time
repetition. We used an I/O bound application in the executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 58
1. Select the Actuator configuration to use; 2. Use the Broker API to start application execution and controller execution, using 50% as cap
value; 3. Wait until application execution ends; 4. Collect application execution time; 5. Use the Broker API to start application execution and controller execution, using 100% as cap
value; 6. Wait until application execution ends; 7. Collect application execution time.
Actuation configurations
We used two Actuator plugins: KVM CPU cap and KVM I/O.
Scenario Actuator
1 KVM CPU cap
2 KVM I/O
7.2.2. Analysis
Figure 20 depicts the values of the application execution time for all scenarios.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 59
Figure 20. Values of execution time for both scenarios
The results are also summarized below:
Table 6 - Benchmark execution time for the different scenarios.
Plugin Cap Median
KVM I/O 50 78.43286
KVM I/O 100 42.39330
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 60
KVM CPU cap 50 86.37899
KVM CPU cap 100 80.60844
Typically, virtual machines are configured with resource limitations to minimize interference in other
machines. For example, in OpenStack this is done through definitions in the template for the machine
configuration, the flavor20. Thus, in our case, all virtual machines start with a initial limit for both CPU
and io. As seen in the figure, using the KVM CPU CAP as actuation plugin, causes an impact on the
execution. Increasing CPU cap from 50 to 100 yields a small performance boost (86 to 81 seconds).
However, the performance boost resulting from the usage of the KVM I/O actuation, which in fact
atuates on both CPU and IO, is much larger (78 seconds to 42 seconds). Therefore, in production clouds,
where the resources in the cloud compute nodes are designed to be correlated (i.e., the IO bandwidth of
a VM is related to its CPU capacity), the default plugin for the actuation should be KVM I/O.
20
https://wiki.openstack.org/wiki/InstanceResourceQuota
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 61
8. Frameworks experiments validation
This section summarizes the vertical elasticity for Mesos frameworks implemented in the frame of
EUBra-BIGSEA. First, we introduce the use cases supported, then the architecture is presented and we
end with the results of the experiments performed.
8.1. Use Cases
Details from the system architecture are provided in deliverable D3.4. Here we present a comprehensive
and updated summary.
The architecture addresses two scenarios:
1. First scenario deals with a job that involves several iterations, each iteration with a known
execution cost. A Quality of Service is defined and all the iterations must be performed
(sequentially) on a given deadline. The objective will be to adapt the CPU share for each iteration
before they start to meet a given deadline. The implementation is used by means of Chronos. The
user submits a Chronos application that iterates multiple times through independent executions,
which could not take place concurrently due to flow dependencies, external data dependencies, or
the like. The user wants to guarantee that the individual executions are completed in a given
deadline. Each iteration is started with a specific resource allocation, and the system learns from
each past case to readjust the allocation of the resources for the new case. Moreover, each
execution may take longer or shorter than the previous one, and the system varies the resource
allocation accordingly.
2. Second scenario deals with a job whose progress cannot be evaluated but it is expected to consume
a known amount of CPU time. The objective will be to guarantee that this share of the CPU time
has been allocated to that application in a given timeframe. This use case is implemented through
Marathon. The user submits a long-living Marathon job, with the guarantee that a minimum share
of the CPU time has been allocated to that job. The system can dynamically adjust the number of
resources to guarantee that the CPU time ratio has been allocated. This scenario assumes that
other tasks have been scheduled in the same resources with different requirements and allocation
periods, so temporarily speeding-down a task may lead to a lower overall QoS violation ratio.
8.2. System Architecture
The system architecture is composed of three main components: Launcher, Supervisor, and Executor.
1. Launcher receives the job specification in JavaScript Object Notation (JSON) format. Then, it
generates and assigns to the application a universally unique identifier (UUID). According to the
scenario’s specification, Launcher submits the job with a modified job specification. Finally,
Launcher sends by REST request to Supervisor relevant information for monitoring and scaling the
application.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 62
1. Supervisor is a REST server, which receives the information sent by Launcher and the containers.
According to the scenario’s specification, it monitors the application in order to, if it is necessary,
scale the application for accomplish the quality of service agreed. Furthermore, the Supervisor
sends metrics to Monasca for visualization.
1. All the services are executed using Docker containers. In the second scenario it is necessary to
manage a Docker container “manually” since the system must perform checkpointing. For this
reason, a new component is required: Executor. The purposes of this component are the realization
of previous tasks for the management and monitoring by Monasca Agent (which it is installed on all
nodes of the infrastructure), the management of the Docker container and their checkpoints and
the information delivery to Supervisor.
Figure 21 shows how the first scenario is implemented in this architecture. Launcher and supervisor
transform the user’s job updating the information about the remaining (non-scheduled) iterations, the
deadline and other necessary information for following the job (job name, job uuid, framework id).
Then, the supervisor publish and consumes information about the progress in the Monasca Server, using
it for triggering the reallocation of resources for the remaining iterations.
Figure 21 - Execution of a Chronos job.
Figure 22 shows how the second scenario is implemented in this architecture. Main differences are 1)
There are no iterations, and progress is defined according to the consumption of cpu on a given
deadline; 2) Application will be restarted when the allocation of resources changes, so automatic
checkpointing is needed; 3) Application should be instrumented to automatically restart a checkpoint
image if it exists.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 63
Figure 22 - Execution of a Marathon job.
8.2.1. Job submit and execution
The user creates the application getting the job specification in JSON format to the component named
Launcher, which it generates and assigns to the job an UUID. Then, it uses the job specification to create
a new job specification with the modifications necessary for execution and monitoring. These
modifications depend on each of the scenarios previously described.
In the case of the second scenario the Docker container must be explicitly managed by the Launcher
while the container is executed by Executor. To get this, Launcher creates a new job specification in
JSON format that executes Executor (with the job specification parameters) on a Mesos native
container, i.e., the job is running on a Docker container, which is managed by Executor running on a
Mesos native container managed by the Marathon Executor. There are not monitoring modifications
because Executor and Monasca Agent do that.
Launcher sends, by means of an initTask REST API, a request with the application information to
Supervisor. The information is composed by the job name, the UUID, the framework name, the duration
of one iteration, the application deadline and the maximum over-progress percentage (rate of the time
interval available since the start of application until the completion time that the user consider the
state). In case of the first scenario, the number of desired iterations is sent too.
When the new job specification is created, Launcher sends to the appropriate framework the new job
specification, which it will execute.
8.2.2. Monitoring the application
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 64
Once the application is launched, it is necessary to monitor the application to collect its performance in
order to modify its assigned resources.
Monitoring of the first scenario is performed by the Supervisor and is received in a REST request (which
is embedded in its tasks by Launcher) from the containers. The request body is a JSON object formed
with the date of start and finish the execution and the UUID and name of the application. When a
request arrives, Supervisor processes it, sends the information of statistics to Monasca and starts the
decision making in order to accomplish the QoS agreed.
Monitoring of the second scenario is more complex than the first one because, in this situation,
monitoring is required during the execution. The system uses Monasca Agent to collect container
performance metrics of each container but a modification of the Docker plugin for the Monasca Agent is
needed to allow monitoring of restored containers from checkpoints. The metrics collected by Monasca
Agent are sent to Monasca, and are consulted periodically by Supervisor for decision making in order to
accomplish the QoS agreed. Furthermore, Executor sends the same information about the application as
in the request of each iteration of the first scenario. It should be noted that this information is only used
to get statistics which are sent to Monasca.
8.2.3. Decision making
Once the data collection is completed, the next step consists in determining the performance state of
the application, i.e., the task progress.
There are three states for the application: under-progress, on-time, and over-progress.
In case of the first scenario, the decision is taken after each iteration of the application is completed.
The task progress state is computed using a prediction of the termination time percentage given the
iteration achieved, the application deadline and the given maximum over-progress percentage.
The prediction of the termination time percentage is calculated using the expression: pprediction=(tprediction-
tstart)/tdeadline, (where tprediction is obtained using a formula presented later and it is the estimated job end
time in timestamp format, tstart is the start time of the job in timestamp format and tdeadline is the interval
for executing the application with the agreed QoS.
The predicted termination time in timestamp format is calculated with the equation: tprediction = tcurrent +
rem_iter · ( djob + ddeployment) and the available interval of time since the start time of the application until
the deadline. In this expression, tcurrent is the current time in timestamp format, rem_iter is the number
of remaining iterations, djob is the given duration of the job and ddeployment is the mean duration of the
deployment in the infrastructure for Chronos.
A progress threshold is defined. If the progress time is in the interval between 100%-threshold and
100%+threshold of the estimated time, the job will be on the state on-time. If the progress is below the
threshold of the estimated progress, the state is under-progress. Finally, if the estimated progress is
higher than the threshold, the job enters the over-progress state.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 65
In case of the second scenario, the decision making is done periodically. The task progress state in a
given time t is computed using the task progress ratio (as described below) and a given over-progress
percentage threshold. The state (under-progress, on-time, overprogess) of the task is defined in the
same way as in the first scenario.
The task progress ratio of the execution in a given time t is computed using the equation: progress_ratio
(t) = cputimecurrent(t) / cputimedesired(t). For this, Supervisor uses the consumed CPU time (cputimecurrent)
and the committed CPU time on a given time cputimedesired(t), both expressed in seconds.
The CPU time consumed on a given time cputimecurrent(t) is the result of the transformation of two
queries from Monasca, which are named container.cpu.user_tim} and container.cpu.system_time.
These values correspond with the total container clock ticks consumed in the node where the container
is running. Before these values can be used, we must transform them into seconds by dividing them by
the value of clock ticks per second of the node operating system.
The desired CPU time consumed for a given execution time (cputimedesired(t)), is calculated using the next
expression:
where tcurrent is the current time in timestamp format, tstart is the start time of the execution in timestamp
format, tdeadline is the interval for executing the application in the QoS agreed and djob is the estimated
duration of the job.
For this, Supervisor uses the current time and the start time, duration and deadline of the application.
8.2.4. Elasticity
If the application state is over-progress or under-progress, a new allocation of resources will be
computed. In case of the first scenario, the Supervisor will send to the Chronos scheduler the new
resource allocation before the next application execution, leading the unplanned executions to run with
the new resource allocation. The elasticity in the second scenario is more complex, as resources change
while the application is running. When the Supervisor sends the new resource allocation to the
Marathon scheduler, the Marathon executor sends to the Mesos container a termination signal before
removing it. This signal is captured by Executor, which creates a checkpoint of the running container and
stores it in an location accessible by all nodes of the infrastructure. As soon as the new container with
the new resource allocation is running, Executor restores the checkpoint created previously to resume
the execution at the same point the signal was captured.
8.3. Results of the experiments
8.3.1. Chronos Application
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 66
In this scenario, we analyse the behaviour of the system considering that the execution time grows
linearly (all the iterations require the same computing time). However, the real execution time will vary
depending on the real allocation of physical resources. Even if the Docker containers are limited
according to the Mesos Framework allocation, they will consume the maximum of free resources. The
CPU limitation of a Mesos Framework only applies If a job runs on a full node. If the CPU is free, the job
will take all the cycles it is able to consume. Obviously, this is a desired effect, so resources are not
wasted and give us the capability of speeding up and down jobs.
The results are shown in Figure 23. This figure shows the difference between the predicted time using
the equations of the previous section and the actual termination time of the iterations. This value can be
used to identify the three regions that relate to these three possible states of a job.
1. If the value is in the red area, the performance is below the QoS so the resources allocated to
the application must be increased.
2. If the value is in the yellow area, the performance is higher that the committed QoS, so
resources allocated to the application can be reduced.
3. If the value is in the green region, the performance is slightly higher what it is expected, so no
variation in the resource allocation will be performed. The white line shows the estimated
evolution.
Figure 24 shows the allocated CPU and the duration of each iteration for one of the jobs. The tests were
performed on an infrastructure with four nodes with 2 CPUs each. The experiment aimed at executing
the six jobs with 10 iterations each. Despite the total expected CPU time for a single job was initially
shorter than the deadline, there were no resources to accommodate the total execution time, to force
the system to reallocate resources. It can be clearly seen that four of the six jobs submitted have ended
earlier than the deadline, one was very close and the one missing the deadline ended with a delay of
10%. It is important to state that each iteration execution time was different.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 67
Figure 23 - Difference between the estimated time and the actual progress per iteration.
Figure 24 - Dynamic assignment of CPU and cost per iteration. Higher CPU assignments lead to shorter execution time, with
some background noise due to other load of the node.
8.3.2. Marathon Application
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 68
This study shows the behaviour of the framework elasticity for a Marathon long running job. In this case,
a single job was executed in a node, starting with an allocation of 1 CPU and letting the elasticity system
to alter the allocation of resources. The system was loaded with other tasks to force the resource
limitation of the framework to act.
As it can be seen, the CPU allocation is modified twice. First, (Figure 25, timestamp 2:33:30) the system
detects that the application is progressing much quicker than expected (), so it reduces the allocation of
resources. Then, the progress is slowed down, entering in the under-progress state close to timestamp
2:36). CPU is increased, and progress is speeding up. Next check detects that the application is in the
safe zone (close to 100% of the expected performance), so no changes are applied. Finally, it finishes
earlier than expected.
Figure 25 - Progress with respect to the estimated deadline (above) and CPU allocation (below) along the time.
Figure 26 - shows the same behaviour but with respect to the expected timeline.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 69
Figure 27 - Evolution of the application CPU consumption and the linear behaviour used to define the QoS.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 70
9. Load balancing validation
This section summarizes the experiments performed concerning the validation of the Load Balanced
tool, developed within EUBra-BIGSEA. First, we introduce the infrastructure where the experiments
were performed, followed by the description of the experiment and the analysis of the results obtained.
9.1. Infrastructure
The experiments were performed in the Production Cloud of the Distributed Systems Laboratory (LSD)
at the Federal University of Campina Grande (UFCG), which uses the Newton version of OpenStack. The
environment provided was a set of two dedicated servers, which we had full access, one Dell PowerEdge
R410 and one Dell PowerEdge R420, together they had a total of 28 CPUs and 50.96 GB of RAM, the
servers are connected through a Gigabit network and use shared storage provided via Ceph21. The
configuration of each server can be seen in Table 7 below.
Table 7 - Configuration of each server
Servers Model Processor CPUs Memory (GB)
c4-compute12 Dell PowerEdge R410
Intel Xeon X5675 3.07GHz
24 31.4
c4-compute22 Dell PowerEdge R420
Intel Xeon E5-2407 2.20GHz
4 19.56
9.2. Experiment Design
The objective of the experiment is to investigate whether the use of the heuristics can improve the
performance of overloaded servers. We have three scenarios, each scenario consists in the execution of
the three heuristics with five repetitions. The following workflow applies for each execution.
1. Select the initial cap for virtual machines [25%, 50%, 75%] 2. First execution of the Load Balancer 3. Interfere with the cloud to break the balance
1. Create a new machine with 2 vcpus in c4-compute12 and set the CAP to the initial value.
2. Remove two virtual machines with 1 vcpu in c4-compute22
21
http://ceph.com/
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 71
4. Second execution of the Load Balancer 5. Change the cap for all virtual machines to 100% 6. Third execution of the Load Balancer
Figure 28: Workflow for heuristic executions
Table 8 describes the initial configuration each server has before the begin of each execution, such as
the number of virtual machines, number of CPUs allocated and the total of CPU. Table 9 describes the
design we used with the variables and the corresponding levels.
Table 8 - Initial configuration for each server before each execution for the scenarios.
Server Number of VMs Number of allocated CPUs
Total number of CPUs
c4-compute12 3 6 24
c4-compute22 7 8 4
Table 9 - Variables and corresponding levels used.
Heuristics Initial CAP cpu_ratio wait_rounds
1. BalanceInstancesOS
2. CPUCapAware 3. SysbenchPerfC
PUCap
1. 25% 2. 50% 3. 75%
1. 0.5 1. 1
9.3. Analysis
Below we analyze the results of one scenario with the initial CAP with 75%; the results of the
experiments with initial CAP set to 25% and 50% can be found in Appendix G. In this scenario all virtual
machines start with an initial CAP of 75%. Table 10 depicts the time when the Load Balancer executed
migrations for each heuristic in all five executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 72
Table 10 - Time when migrations started.
Execution
Heuristics 1 2 3 4 5
BalanceInstancesOS
12:16
12:21
12:38
12:44
13:04
13:10
13:42
13:47
14:03
14:10
CPUCapAware 11:57
12:58
12:19
12:30
12:42
12:52
13:01
13:13
13:26
13:36
SysbenchPerfCPUCap 12:16
12:28
12:42
12:53
13:14
13:26
13:54
14:06
14:16
14:28
In the following, we look at 5 experiments for each heuristic. We monitor the following metrics:
total_consumption considers the allocated vCPUs, the CAP of these CPUs and the actual percentage
usage; total_cap considers only allocation (i.e., vCPUs allocated and their CAP, not the actual usage); the
CPU usage of the physical machine hosting the virtual machines; and, the Sysbench execution time.
For the heuristic BalanceInstancesOS the migrations occurred in the first and second executions of the
Load Balancer, for the other heuristics the migrations occurred in the first and third executions. The
BalanceInstanceOS heuristic actuates in the second round because, even if the new machines have not
used too much resource, this heuristic monitor the allocated resources. Similarly, because the two other
heuristics consider used resources, they actuate only in the third round, when the CAP is raised.
Analyzing the total_consumption (Figures 29 to 31) and total_cap (Figures 32 to 34) graphs we can see
that after each execution the Load Balancer enforced the cpu_ratio value previously configured.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 73
Figure 29 - Heuristic BalanceInstancesOS - Total Consumption per host.
In Figure 29 we present the graph of total_consumption over time for the BalanceInstancesOS heuristic.
In each graph of execution we can see that in the beginning, the total_consumption is above the
cpu_ratio limit (dashed line). This heuristic uses the ratio of allocated vCPUs over the total of CPUs to
determine if a host is overloaded. This value is always higher than the total_consumption, as it assumes
that allocated CPUs are 100% used and, therefore, if total_consumption is higher than the limit, the
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 74
allocated resources will also be. Then, if the machine is overloaded the Load Balancer takes action to
trigger the migration of virtual machines. This already takes place in the first execution round (see time
in Table 9) from c4-compute22 to c4-compute12. At the end, the total_consumption stays equal or lower
than the cpu_ratio.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 75
Figure 30 - Heuristic CPUCapAware - Total Consumption per host.
Figure 30 depicts total_consumption over time for the CPUCapAware heuristic. In each graph of
execution we can see that in the beginning, the total_consumption is above the cpu_ratio limit (dashed
line), the Load Balancer take action and trigger migration of virtual machines in the first execution (see
time in Table 9) from c4-compute22 to c4-compute12 and after this the total_consumption stays equal
or lower than the cpu_ratio.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 76
Figure 31 - Heuristic SysbenchCPUCap - Total Consumption per host.
Similarly, Figure 31 depicts total_consumption over time for the SysbenchPerfCPUCap heuristic. In each
execution we can see that in the beginning the total_consumption is above the cpu_ratio limit (dashed
line), except from the execution #2 because the virtual machines utilization was not high enough.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 77
Nevertheless, even in this execution #2, the total CAP exceeds the target values (see Figure 34). The
Load Balancer takes action and trigger migration of virtual machines in the first execution (see time in
Table 9) from c4-compute22 to c4-compute12 and after this the total_consumption stays equal or lower
than the cpu_ratio. For this figure, it is also important to notice that the low utilization of host c4-
compute22 is due to the fact that the instance migrated is the one that consumes the lowest amount of
resources, as the target host has lower performance.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 78
Figure 32 - Heuristic BalanceInstancesOS - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 79
Figure 33 - Heuristic CPUCapAware - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 80
Figure 34 - Heuristic SysbenchCPUCap - Total Cap per host.
Figures 32 through 34 present the graphs of total_ccap over time for all heuristics. The graph for
BalanceInstancesOS heuristic (Figure 32) does not increase after the virtual machines CAP is changed to
100% because the migration occurred during the second execution that was when another virtual
machine was created in c4-compute12 and the Load Balancer migrated it to c4-compute22. For the
heuristics CPUCapAware (Figure 33) and SysbenchPerfCPUCap (Figure 34) we see that approximately in
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 81
the middle of each execution the total_cap for the host c4-compute22 increases above the cpu_ratio
limit, this is caused because of the creation of a virtual machine in this host and the change in the CAP of
all virtual machines in all hosts to 100%, the Load Balancer trigger the migration in the third execution to
decrease the total_cap, by migrating virtual machines from c4-compute12 to c4-compute22.
The CPU utilization of the hosts was also collected. This information is depicted in Figures 35 through 37.
The common pattern we can see in the graphs is the decreased usage in overloaded hosts when the
Load Balance starts the execution and an increase in the utilization in the host that is receiving the
migrated virtual machines. In addition, the peak utilization right after the first execution is caused by the
removal and creation of virtual machines in the hosts. Also, note that there are peaks in the CPU that
are due to the execution of Sysbench to collect the performance metric. For c4-compute22 this impact is
much more noticeable because it has only 4 cores, in contrast to the 24 cores in c4-compute12.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 82
Figure 35 - Heuristic BalanceInstancesOS - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 83
Figure 36 - Heuristic CPUCapAware - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 84
Figure 37 - Heuristic SysbenchPerfCPUCap - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 85
10. Performance Prediction service validation
In this section we present the results of a set of experiments we performed to explore and validate the
dagSim simulator and the Lundstrom predictor on the TPC-DS industry standard benchmark22 for data
warehouse systems (which considers an SQL-like workload) as well as on some reference Machine
Learning (ML) benchmarks, namely Support Vector Machine (SVM), Logistic Regression and K-Means.
Such experiments were performed on the Microsoft Azure Cloud. Overall our tests include sequential
workloads (obtained from the SQL queries execution plan) and iterative workloads, which characterize
ML algorithms and are becoming more and more popular in the Spark community23. Moreover, our
performance prediction tools were validated considering the K-Means benchmark over COMPSs running
on the new Mare Nostrum 4 at Barcelona Supercomputing Center24.
10.1. Experimental Settings
For what concerns the TPC-DS benchmark experimental campaign, Microsoft Azure HDInsight PaaS
offering based on Spark 1.6.2 release and Ubuntu 14.04 was considered. Multiple experiments were
conducted on two different deployments. In particular, two different types of VMs, namely, A3 and
D12v2, were tested. A3 VMs include 4 cores with 7 GB of RAM and 250 GB local disk. Regarding the
second deployment, D12v2 nodes include 4 cores, 28 GB of RAM and 200 GB local SSD. In the A3 case,
the workers’ configuration consisted of 6 up to 48 cores, while in the case of D12v2 the number of cores
was varied between 12 and 52. Each Spark executor had 2 cores with 2GB RAM. Two dedicated master
nodes over D12V2 were used and the Spark diver had 4GB allocated.
We considered two TPC-DS queries named Q26 and Q52 running on 500 GB data set; their Directed
Acyclic Graphs (DAGs) are presented in Figure 38. Queries were run at least 10 times for every
configuration. As aforementioned, we also ran experiments with SVM, Logistic Regression, and K-Means
benchmarks. For each benchmark, three sizes of datasets were considered, namely 8GB, 48GB, and
96GB. Spark 2.1.0 deployment, configured to have two dedicated master nodes over D12V2 VMs. For
this set of experiments, we considered two different configurations, one with three worker nodes and
the other with 6 worker nodes, both over D4V2 VMs. The D4V2 instances have 8 cores, 28GB RAM and
400 GB Local SSD. Each executor had 4 cores and 10GB memory while the driver had 8 GB. The number
of executors varied between 6 and 12. For each given configuration, the experiments were repeated 10
times.
Regarding COMPSs experiments, we ran K-Means algorithm over different infrastructure configurations,
from 1 to 4 nodes, each one composed of 48 cores. The experiments comprise two different scenarios.
One in which the number of fragments is the same as the number of cores (thus, as the number of cores
22
4http://www.tpc.org/tpcds
23 https://databricks.com/blog/2016/09/27/spark-survey-2016-released.html
24 https://www.bsc.es/marenostrum/marenostrum/technical-information
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 86
increases so does the total number of fragments), and the other with the number of fragments fixed,
equal to the maximum number of cores (i.e., 192). Since the number of fragments is reflected in the
number of parallel tasks, this latter scenario is the one with the most similar characteristics to the
Spark's behaviour where the number of tasks is usually constant across the configuration (unless a
different setup was adopted by the system administrator).
The accuracy of the performance prediction tools is assessed through the percentage of error which is
evaluated as the relative difference between the average execution time of real experiments across the
same configuration and the execution time predicted by our tools with respect to the average time of
the real experiments:
(5)
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 87
Figure 38 - Directed Acyclic Graph for queries Q26(a) and Q52(b)
10.2. dagSim Validation
The aim of this section is to evaluate the accuracy achieved by our dagSim simulator. The results
obtained from the experiments on Azure for TPC-DS and ML workloads are presented in the following
tables.
For TPC-DS experiments executed on Azure A3 and D12v2 VM types, the results for Q26 and Q52
regarding prediction error varies from 0.02% to 8.53% for A3 deployment, while for D12v2 from 0.16%
to 19.01%. The max error values for the aforementioned deployments correspond to 48 cores
configuration for 500 GB dataset (A3 Microsoft Azure) and 28 cores for 500 GB (D12v2 Microsoft Azure)
dataset respectively. Detailed results are reported in EUBra-BIGSEA Deliverable D3.4.
Regarding K-means (see Table 11), the prediction error for dagSim ranges from -29.80% to 6.49%.
Moreover, we observe that the maximum absolute error value is obtained for the largest configuration
which consists of 96 GB data set and 48 cores.
Table 11 - K-Means experimental results on Azure and dagSim prediction
#Cores Data set size (GB) Real [ms] dagSim prediction [ms]
% Error
24 8 82,644.6 75,553 8.58%
24 48 326,593 364,556 -11.62%
24 96 846,885.8 788,371 6.91%
48 8 75,151.6 70,276 6.49%
48 48 179,737.6 219,195 -21.95%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 88
48 96 574,868.4 746,151 -29.80%
In Table 12 the prediction error for Logistic Regression is presented. According to the experimental
results, prediction errors range from 0.34% to 6.00% and the max prediction error value is reported for
48 GB dataset and 8 cores.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 89
Table 12 -Logistic Regression experimental results on Azure and dagSim prediction
#Cores Data set size (GB) Real [ms] dagSim [ms] % Error
24 8 164,574.6 156,070 5.17%
24 48 669,401 671,693.49 -0.34%
24 96 1,418,837.4 1,404,886 0.98%
48 8 166,517.8 1,565,267 6.00%
48 48 368,236.6 362,927 1.44%
48 96 1,200,678.2 1,193,930 0.56%
In relation to the SVM benchmark, the prediction error (Table 13) is also very small and almost similar to
the one we got for the Logistic regression prediction. In this case, it ranges from 0.08% to 6.16% while
the max prediction error value is reported for 8 GB dataset and 48 cores.
Table 13 - SVM experimental results on Azure and dagSim prediction
#Cores Data set size (GB) Real [ms] dagSim [ms] % Error
24 8 176,201.83 167,686 4.83%
24 48 623,714.2 621,391 0.37%
24 96 135,3471 323,928 2.18%
48 8 174,811 164,043 6.16%
48 48 357,922 352,166 1.61%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 90
48 96 635,922 635,421 0.08%
Finally, Table 14 includes the experimental results obtained from the execution of K-Means application
on COMPSs for a different number of cores considering the two different scenarios. In all cases the
prediction errors are usually very low and the maximum error value (17%) is obtained in the case where
3 nodes and 192 fragments (equal to the maximum number of cores) are used. Note that we obtained
definitely better results than those reported on EUBra-BIGSEA Deliverable D3.4 (characterized by up to -
66% error). We argue that the data of the previous runs were affected by large variability due to an
unusual high resource contention experienced by the BSC infrastructure at the end of February 2017.
The high workload was experienced since Mare Nostrum was going to be switched off for 5 months to
upgrade the system to the new version 4.
Overall, the results for both TPC-DS ML benchmark and COMPSs demonstrate that dagSim predictor has
good accuracy and is a suitable tool for Big Data applications performance prediction.
Table 14 - COMPSs application and dagSim prediction
#Nodes
#Cores #Fragments Real [ms] dagSim [ms] % Error
1 48 48 880,533 874,458 1%
2 96 96 591,215 582889 1%
3 144 144 372,232 363,,861 2%
4 192 192 268,058 259,922 3%
1 48 192 1,011,834 1,057,884 -4%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 91
2 96 192 646,748 699,206 -8%
3 144 192 610146 523223 17%
10.3. Lundstrom Validation
We now turn to the evaluation of the accuracy of the Lundstrom analytical model. In the following we
present and discuss the results obtained in the various setup configurations considered.
The TPC-DS results reported in EUBra-BIGSEA Deliverable D3.4 did not include the Lundstrom
evaluation. In this deliverable, we also present the results for the Lundstrom model over the
experiments executed on Azure A3 and D12v2 VM types, with the number of cores per executor set to
2. Accuracy error ranges from 4.43% to 23.70%. Maximum error value occurs in the scenario with 48
cores, processing 500 GB (D12v2 Microsoft Azure) dataset for both Q26 and Q52. Table 15 depicts the
results.
Table 15 - TPC-DS Q26 and Q52 on Spark over D12v2 Azure VMs
Nodes #Cores Query Real [ms] Lundstrom [ms] % Error
3 12 Q26 722,191.54 690,216.98 4.43%
Q52 719,922.93 660,772.88 8.22%
4 16 Q26 582,916.50 543,854.76 6.70%
Q52 562,701.25 517,314.43 8.07%
5 20 Q26 515,921.32 469,010.19 9.09%
Q52 471,807.51 412,718.74 12.52%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 92
6 24 Q26 447,640.05 398,265.11 11.03%
Q52 417,689.57 358,323.66 14.21%
7 28 Q26 415,717.33 367,184.59 11.67%
Q52 364,080.10 304,676.52 16.32%
8 32 Q26 366,135.08 316,546.96 13.54%
Q52 324,692.89 264,963.70 18.40%
9 36 Q26 306,080.43 256,057.86 16.34%
Q52 306,760.14 246,952.58 19.50%
10 40 Q26 287,453.29 236,815.25 17.62%
Q52 275,196.48 215,247.32 21.78%
11 44 Q26 259,665.43 209,579.07 19.29%
Q52 258,779.14 200,155.59 22.65%
12 48 Q26 248,604.10 197,181.57 20.69%
Q52 249,954.80 190,722.71 23.70%
13 52 Q26 220,247.86 181,354.17 17.66%
Q52 226,047.19 179,279.19 20.69%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 93
Prediction errors for both queries vary from 0.85% to 15.92% for A3 deployment. Maximum error value
occurs in the aforementioned deployment corresponding to 48 cores, processing 500 GB dataset for
both Q26 and Q52.
Table 16 - TPC-DS Q26 and Q52 on Spark over A3 Azure VMs
Nodes #Cores Query Real [ms] Lundstrom [ms] % Error
4 10 Q26 1,778,802.07 1,763,649.39 0.85%
Q52 1,327,441.57 1,276,419.90 3.84%
4 12 Q26 1,690,553.40 1674504.77 0.95%
Q52 1,124,490.00 1,072,515.57 4.62%
5 14 Q26 1,439,273.57 1,413,964.87 1.76%
Q52 976,408.72 924,048.86 5.36%
5 16 Q26 127,1578.5 1243,254.23 2.23%
Q52 884,947.43 832,631.01 5.91%
6 18 Q26 1,127,209.05 1,099,812.68 2.43%
Q52 816,461.67 764,813.25 6.33%
6 20 Q26 1,093,593.50 1,064,210.94 2.69%
Q52 738,760.00 687,252.22 6.97%
7 22 Q26 809,747.91 764,252.94 5.62%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 94
Q52 667,874.21 613,641.05 8.12%
7 24 Q26 911,222.20 874,843.44 3.99%
Q52 620,125.14 566,192.81 8.70%
8 26 Q26 720,810.43 682,170.15 5.36%
Q52 572,076.00 512,687.86 10.38%
8 28 Q26 - - -
Q52 538,226.91 478,838.17 11.03%
9 30 Q26 581,430.71 531,497.96 8.59%
Q52 620,208.00 572,866.98 7.63%
9 32 Q26 629,799.50 588,982.65 6.48%
Q52 492,334.53 432,698.46 12.11%
10 34 Q26 625,767.95 584,327.56 6.62%
Q52 462,200.71 402,496.46 12.92%
10 36 Q26 577,367.50 532,404.92 7.79%
Q52 442,087.13 382,594.58 13.46%
11 38 Q26 546,599.32 503,100.31 7.96%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 95
Q52 438,667.54 380,176.33 13.33%
11 40 Q26 530,579.83 487,333.68 8.15%
Q52 418,379.05 359,130.77 14.16%
12 42 Q26 488,396.22 447,348.19 8.40%
Q52 392,136.82 334,238.91 14.76%
12 44 Q26 470,695.00 427,621.31 9.15%
Q52 383,732.22 325,851.76 15.08%
13 46 Q26 446,374.92 405,558.27 9.14%
Q52 378,361.94 318,829.68 15.73%
13 48 Q26 430,512.00 387,504.98 9.99%
Q52 362,790.14 305,020.66 15.92%
Regarding the ML benchmarks, Table 17 shows the prediction errors for the Lundstrom model for the
SVM benchmarks. Errors vary from 1.27% to 10.28% while the max prediction error value is reported
for the setup with an 8 GB dataset and 48 cores.
Table 17 - SVM experiments on Spark on Azure
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 96
Nodes #Cores Dataset size
(GB)
Real [ms] Lundstrom [ms] % Error
3 24 8 190,905.83 171,651 10.09%
3 24 48 356,721.20 339,246 4.90%
3 24 96 1,366,994.80 1,349,569 1.27%
6 48 8 189,684.00 170,184 10.28%
6 48 48 372,529.20 353,248 5.18%
6 48 96 650,509.40 631,144 2.98%
Similarly to what was observed for SVM, the prediction errors for Logistic Regression range from 1.24%
to 11.55%, and the maximum error value was also obtained for 8GB data set and 48 cores.
Table 18 - Logistic Regression experiments on Spark on Azure
Nodes #Cores Dataset size
(GB)
Real [ms] Lundstrom [ms] % Error
3 24 8 178,899.80 159,459 10.87%
3 24 48 684,430.71 664,363 2.93%
3 24 96 1,431,853.00 1,414,072 1.24%
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 97
6 48 8 181,977.80 160,961 11.55%
6 48 48 382,951.80 362,505 5.34%
6 48 96 1,219,151.00 1,192,617 2.18%
Regarding K-means, the prediction error for Lundstrom ranges from 1.87% to 18.10%. Once again, the
maximum error was obtained for the smallest data set which consists of 8GB in the largest
configuration, 48 cores.
We further looked into the response times measured for individual runs of each benchmark on each
configuration and observed that the setup with the largest errors for all three benchmarks (8GB on 48
cores) coincides with the scenario with the highest variance across multiple runs. The large number of
cores used on a relatively small dataset, which might occasionally cause resource underutilization, may
explain the slightly worse performance of the model in this setup.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 98
Table 19 - K-Means experiments on Spark on Azure
Nodes #Cores Dataset size
(GB)
Real [ms] Lundstrom [ms] % Error
3 24 8 99,023.75 81,877 17.32%
3 24 48 342,195.47 325,131 4.99%
3 24 96 862,058.80 845,920 1.87%
6 48 8 90,299.38 73,958 18.10%
6 48 48 195,026.89 178,804 8.32%
6 48 96 594,309.60 572,919 3.60%
Regarding the experiments on COMPSs, the prediction errors for Lundstrom are quite small, ranging
from 0.30% to 1.15%. We observed that the maximum error was obtained for the scenario with the
largest numbers of nodes, cores, and fragments: 4, 192 and 192 respectively. Moreover, we found that
prediction error tends to increase as the number of nodes/cores increases.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 99
Table 20 - K-Means experiments on COMPSs on Mare nostrum
Nodes #Cores Fragments Real [ms] Lundstrom [ms] % Error
1 48 48 880,533 877,493 0.35%
2 96 96 591,215 588,138 0.52%
3 144 144 372,232 369,151 0.83%
4 192 192 268,058 264,979 1.15%
1 48 192 1,011,834 1,008,828 0.30%
2 96 192 646,748 643,657 0.48%
3 144 192 610,146 604,478 0.93%
Overall, across the considered set of experiments, which cover different platforms and configurations,
both dagSim and Lundstrom have shown to provide very good prediction accuracy: while dagSim
achieved 6.17% average percentage error across all scenarios, the Lundstrom tool obtained 9.03%
average error. We note that, for analytical queuing models (such as Lundstrom), 30% errors in response
time predictions can be usually expected25. Thus, both tools are suitable for predicting the performance
of Big Data applications at run-time. Moreover, both have also achieved K3.5 KPI (median error below
40%).
25
Edward D. Lazowska, John Zahorjan, G. Scott Graham, and Kenneth C. Sevcik. 1984. Quantitative System
Performance: Computer System Analysis Using Queueing Network Models. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 100
11. HML validation
This section contains the follow-up of the initial experiments included in D.3.2, which are extended and
applied in Spark technology used in order to validate the effectiveness of our hybrid algorithm. An
analytical description of the experimental process is illustrated in 11.1. In particular, the description of
the examined queries, the dataset, the tested technologies and the configuration of the experiments are
presented. Moreover, a comparison of HML with two different approaches is described, while
extrapolation and interpolation capabilities of the algorithm are examined.
11.1. Experiments setting
The datasets used for running the experiments has been generated using the TPC-DS benchmark. In
D.3.2, two set of experiments had been conducted for testing MapReduce and Tez jobs. In particular, for
MapReduce validation, the experiments were performed on a Hive ad-hoc MapReduce and Tez queries.
In order to extend our work and examine the effectiveness of hybrid algorithm we executed Q40 (see
Figure 39), from the official TPC-DS benchmark on Spark and we analysed the outcomes. The
experimental results are used to feed the synthetic data set which forms an initial KB.
Figure 39 - TPC-DS Q40
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 101
The data set size has been set to 250GB. The experiments were executed on CINECA, the Italian
supercomputing center. PICO, the Big Data cluster available at CINECA, is composed of 74 nodes, each of
them boasting two Intel Xeon 10-core 2670 [email protected], with 128 GB of RAM. Out of the 74 nodes, 66
are available for computation. The storage is constituted by 4 PB of high throughput disks based on the
GSS technology. For Hadoop experiments we deployed Hadoop 2.5.1 release and Apache Tez 0.6.2,
while for Spark applications version 1.6.0 was considered. The cluster is shared among different users;
resources are managed by the Portable Batch System (PBS) which allows submitting jobs and checking
their progress, configuring at a fine-grained level the computational requirements. For all submissions it
is possible to request a number of nodes and to define how many CPUs and what amount of memory is
needed. The YARN Capacity Scheduler is set up to provide one container per core. Since the cluster is
shared among different users, the performance of single jobs depends on the overall system load, even
though PBS tries to split the resources. Due to this, it is possible to have large variations in performance
according to the total usage of the cluster and network contention.
In particular, storage is not handled directly by PBS, thus leading to an even greater impact on
performance. Overall, queries execution required about 20,000 CPU hours on the PICO cluster. For each
configuration, execution runs more than three times the standard deviation from the mean have been
considered outliers and have been discarded. In SVR training, weighting is used as a means to suggest
the ML to give more relevance and trust to real than to synthetic samples. Therefore, the weight of real
data is assumed to be five times the weight of analytical data in all experiments regarding the hybrid
approach.
For validating the effectiveness of the proposed hybrid approach in a comparative manner, two
solutions that lack AM are considered:
1. Basic Machine Learning (BML): relies on SVR for the computation of the regression function. In
this case, the algorithm is fed with the same operational data applied by our hybrid ML at the
final iteration.
2. Iterative Machine Learning (IML): in this approach, operational data are added iteratively and
the initial KB is empty lacking the AM information. In other terms, we considered the general
structure of HML Algorithm (Figure 5) except lines 2 and 3 that corresponds to AM involvement.
To compare quantitatively our hybrid approach with BML and IML baseline methods, three performance
measures are defined:
1. The MAPE of response time: this measure focuses on the prediction accuracy. It is defined as the
percentage of relative error of the response time predicted from the learned model with respect
to the expected value of response times acquired from the operational system.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 102
2. MAPE is used to evaluate both extrapolation and interpolation capabilities. The number of
iterations: the number of iterations of the external loop of HML Algorithm (see Figure 5), which
equals the number of real data samples fed into the ML model.
3. The cost: this measure focuses on the cloud expenses for model construction and is defined as:
where AC is the set of configurations used for model selection and training, NCi and RTimei are the
number of cores and the execution time associated with each configuration i and Nei is the number of
operational data points used for training the model. Moreover, P is the price of using one core per time
unit.
The DAG structure of Q40 is shown in Figure 40.
Figure 40 - Q40 DAG
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 103
The profiling phase has been conducted extracting the number of tasks and the average task durations
from around ten runs with the same configuration.
The four key steps of the proposed hybrid procedure, the comparison with the IML approach and the
outcomes are presented below. In order to generate the data for the analytical model based on the
approximation formula, the equivalent script is used named retrieve_response_times_with_formula,
while for finding the optimal thresholds which minimize the mean relative error the
compare_prediction_errors script is executed. Finally, computeExtrapolationFromRight",
"computeExtrapolationFromLeft and "computeInterpolation" scripts are applied in order to investigate
and draw conclusions for the extrapolation and interpolation capabilities of the hybrid algorithm.
1) Data from the analytical model: In Figure 41 (a) we compare the average execution times for every
configuration obtained across the experiments to the ones obtained from Equation (4). The average
relative error of the values obtained from the approximation formula is around 34%.
2) Finding the optimal thresholds: The optimal combination of the (itrThr, stopThr) was determined only
for IML approach by applying the approximated formula on Q40 data and this combination was equal to
(25, 15). For every threshold combination, the algorithm run for 50 different seed values to generate
different results. Vice versa, in the case of hybrid approach, we used the same thresholds (34, 23)
obtained for R1 query (see EUBra-BIGSEA Deliverable D.3.2).
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 104
Figure 41 - (a) Comparison of formula-based approximation and the mean values of real data, (b) right extrapolation and (c)
cost for Q40
3) Extrapolation capabilities on many cores: Right extrapolation capability analysis results are reported
in Figure 41 (b) and (c). Figure 41 (b) shows that the use of the proposed approach defeats the IML,
providing always a lower MAPE on the test set. What is remarkable is that while we are moving to the
left side, where more points are missing the MAPE error of IML increases dramatically, demonstrating
the dominance of our proposed approach. In particular, when only the configuration with 120 cores is
missing, the error of hybrid approach is 7% approximately, contrary to 15% of IML. However, when 5
points are missing from the configuration set the error of IML shoots up to 30% while in case of the
hybrid algorithm a small increase is observed (11%).
4) Interpolation capability: Concerning interpolation, we considered three different scenarios when
applying hybrid algorithm to Q40: (i) three points 20, 72 and 120 cores are missing (ii) five points 20, 48,
72, 100, 120 cores are missing, and finally (iii) seven points 20, 40, 48, 72, 80, 100 and 120 cores are
missing. Concerning the relative error, we observe from Figure 42 that the IML approach gives better
accuracy compared to our approach.
Figure 42 - Interpolation for Q40
Although the hybrid error for three missing points is high (17%), in case of seven missing points the
hybrid algorithm get closer to HML performance. However, the number of iterations of the hybrid
approach is smaller than the one of IML and better costs can be achieved as reported in Figure 43.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 105
Figure 44 - Cost of interpolation analyses results
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 106
12. Optimization service validation
This chapter describes the validation of OPT_IC and OPT_JR optimization tools considering both a case
study application developed in WP7 named BULMA (Section 12.1) and a set of seven test cases, which
include a number of applications from the TPC-DS benchmark (Section 12.2).
12.1. OPT_IC validation on BULMA application
The WS_OPT_IC and the underlying OPT_IC optimization tool have been validated considering the
BULMA WP7 case study application developed within the task T5.7. The problem faced by BULMA is the
following. The task of identifying bus trajectories from the sequences of noisy geospatial-temporal data
sources is known as a map-matching problem. It consists of performing the linkage between the bus GPS
trajectories and their corresponding road segments (i.e., predefined trajectory or shapes) on a digital
map. In this sense, BULMA is a novel unsupervised technique capable of matching a bus trajectory with
the "correct" shape, considering the cases in which there are multiple shapes for the same route
(common cases in many Brazil cities, e.g., Curitiba and São Paulo). Furthermore, BULMA is able to detect
bus trajectory deviations and mark them in its output. The goal of BULMA is to provide high-quality
integrated geospatial-temporal training data to support predictive machine learning algorithms of
Intelligent Transport Systems (ITS) applications and services.
There are two types of files used by BULMA, shape and GPS files. The first consists of data for the same
root and corresponds to a rich/detailed set of georeferenced points describing the trajectory that a bus
should follow, i.e., the predefined bus trajectory. It is extracted from the General Transit Feed
Specification data (GTFS), which is an income file (captured by an operator or authority) containing
details of transit supply. Regarding GPS files, they contain all the GPS (Global Positioning Systems)
trajectories of a city bus fleets during a certain period of time. In our case, each GPS file corresponds to
all the GPS trajectories of the city bus fleet during a day. Thus, BULMA matches a bus trajectory with the
''correct'' shape, considering the cases in which there are multiple shapes for the same route.
As an example, Figure 44 includes two shapes describing the trajectories (with the same direction) that a
bus could follow related to route 022. Most of the state-of-the-art techniques are able to detect
whether or not a bus is performing the route 022. Thus, to the best of our knowledge none of them is
able to indicate if a bus (defined to perform route 022) will start its trajectory from point A or B. Note
that if a classification error occurs during the generation of the bus trajectories history (e.g., a bus
becomes associated with the shape depicted on the right of the figure instead of the - correct - left one),
a predictive algorithm (of wp7) will estimate an arrival time at point C erroneously, since the bus is
programmed to finish its trip at point A (according to shape depicted on the left). For all the
aforementioned reasons, BULMA solves this problem with a good accuracy.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 107
Figure 44 - Two different shapes describing the trajectories for route 022 that a bus could follow
BULMA has been implemented in Spark and experiments were performed by running the application in
the Microsoft Azure HDInsight platform, considering Spark 2.1.0. Workers were deployed on D4 v2 VMs
(including 8 cores and 28GB of memory) and every executor was configured to run with 8 cores and
16GB of RAM. To check the performance of our tools, a set of different configurations was considered
where a different number of nodes in the cluster was tested. In particular, we performed runs from 1 to
6 nodes and 5 days GPS training data was generated.
To validate OPT_IC we aimed at assessing the quality of the optimal solution obtained using OPT_IC.
Assuming to optimize the BULMA initial deployment and a deadline to meet, we focus on the response
time measured in a real cluster provisioned according to the number of VMs determined by the
optimization procedure, quantifying the relative gap as a metric of the optimizer accuracy. Formally:
(5)
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 108
where D is the deadline and TReal the execution time measured on the real cluster, so that possible
misses would yield a negative result.
We considered 56 cases, varying deadline between 5,500 sec and 11,000 sec with step 100 sec, and ran
OPT_IC to determine the optimal resource assignment to meet the QoS constraints on D4 v2 instances.
Figure 45 plots the data we obtained.
Figure 45 OPT_IC Percentage error as a function of the BULMA deadline
We experienced a deadline miss only in 5 out of 56 cases (8.9%) of cases. Moreover, the relative error is
always below 35%, with a worst case result of 35.12% and the average absolute error settling at 18.13%.
From the plot, we observe some abnormal behaviour (rapid jumps) which is related to the change of the
configuration deployment (number of required VMs) suggested by OPT_IC. In particular, at most the gap
in terms of required VMs is 1 and the accuracy is increased when the deadline is tight, i.e., for larger
clusters. Since the initial deployment can also be updated by the pro-active policies module, overall we
can conclude that the OPT_IC tool is effective in identifying the minimum cost initial deployment,
guaranteeing that deadlines are met as well.
12.2. OPT_JR validation
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 109
In the following, the experimentation of the OPT_JR tool on a set of seven test cases is reported. All the
experiments have considered Spark TPC-DS benchmark execution on the IBM Power8 (P8) cluster
available at Politecnico di Milano with 1TB data set size.
Each test case includes a set of applications characterized by: i) an application identifier (i.e., a TPC-DS
query), iii) a weight, iii)-iv) two ML parameters describing the profile application, v) the total RAM
memory available at the YARN Node Manager, vi) the container size (in terms of RAM) for the
application (e.g., the memory devoted to Spark executors), vii) the total number of vCPUs available at
the YARN Node Manager, viii) the container size (in terms of vCPUs), ix) the deadline, and finally x) a
label describing the current stage. The test details are reported in Tables 21 -27. The total number of
cores used for each test are reported in Table 26.
Test 1 is the base test. The following cases present different variations in terms of weight and deadlines
and number of applications. For example, in Test2, the deadlines of the applications have been
increased while we considered the same weights and the capacity N is increased. In Test 3 we consider a
pool of 10 applications. The aim of Test 4 to Test 7 is to investigate the impact of the weights to the
final results. Test 4 consider the same set of applications with the same parameters of Test 1 but the
weight for Q26 is increased progressively from 2 to 5.
Table 21 - Test1 parameters
app_id w chi_0 chi_c M m V v Deadline
[ms]
stage_id
Q26 1 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1
Table 22 - Test2 parameters
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 110
app_id w chi_0 chi_c M m V v Deadline
[ms]
stage_id
Q26 1 18,906.97517 12,945,621.49 28 8 4 2 400,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 200,000 J3S3
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4 2 200,000 J1S1
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 111
Table 23 - Test3 parameters
app_id w chi_0 chi_c M m V v Deadline
[ms]
stage_id
Q26 1 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 100,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1
Q26 1 18,906.97517 12,945,621.49 28 8 4 2 400,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 200,000 J3S3
Q40 1 3,8056.87096 20,224,206.15 56 18 4 2 500,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4 2 200,000 J1S1
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 600,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4 2 500,000 J1S1
Table 24 - Test4 parameters
app_id w chi_0 chi_c M m V v Deadline
[ms]
stage_id
Q26 2 18,906.97517 12,945,621.49 28 8 4
2 200,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4
2 100,000 J3S3
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 112
Q40 1 38,056.87096 20,224,206.15 56 18 4
2 1,000,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4
2 100,000 J1S1
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 113
Table 25 - Test5 parameters
app_id w chi_0 chi_c M m V v Deadline
[ms]
stage_id
Q26 3 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1
Q55 1
10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1
Table 26 - Test6 parameters
app_id w chi_0 chi_c M m V v Deadline
[ms]
stage_id
Q26 4 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1
Q55 1 10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1
Table 27 - Test7 parameters
app_id w chi_0 chi_c M m V v Deadline [ms]
stage_id
Q26 5 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 114
Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3
Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1
Q55 1
10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 115
Table 28 - Residual capacity for the soft-deadline applications considered in the test cases
Test N
Test1 150
Test2 200
Test3 220
Test4 150
Test5 150
Test6 150
Test7 150
For each test case, the following metrics have been considered:
1. The number of iterations (the maximum value MAX_IT has been fixed to 15) to identify the
final (local) optimal solution;
2. The size of the candidate lists and how this varies with the number of iterations;
3. The value of the total objective function and how this varies with the number of iterations;
4. The gap between the initial number of cores νi identified by the KKT conditions and the final
value identified at the local optima. We report also the number of cores identified by OPT_IC to
determine the initial minimum cost deployment;
5. The number of iterations necessary to determine the bound;
6. The total execution time of the OPT_JR tool required to identify the final solution.
Table 29 - Test1 results
App_id Bound #iterations
OPT_IC Final solution
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 116
#cores #cores
Q26 32 2 76 48
Q55 30 1 115 40
Q40 42 1 22 24
Q52 28 1 113 36
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 117
Table 30 - Test2 results
App_id Bound #iterations
OPT_IC
#cores
Final solution #cores
Q26 44 1 36 36
Q55 41 1 53 52
Q40 58 1 24 3
Q52 52 1 54 52
Table 31 - Test3 results
App_id Bound #iterations
OPT_IC
#cores
Final solution #cores
Q26 17 2 76 32
Q55 15 1 113 28
Q40 22 1 22 20
Q55 15 1 115 20
Q26 17 1 36 20
Q52 15 1 54 20
Q40 22 1 46 24
Q55 15 1 53 16
Q40 22 1 38 24
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 119
Table 32 - Test4 results
App_id Bound #iterations
OPT_IC
#cores
Final solution #cores
Q26 41 1 76 52
Q55 27 1 115 36
Q40 39 1 22 24
Q52 25 2 113 36
Table 33 - Test5 results
App_id Bound #iterations
OPT_IC
#cores
Final solution #cores
Q26 47 2 76 64
Q55 25 1 115 32
Q40 36 1 22 20
Q52 24 1 113 32
Table 34 - Test6 results
App_id Bound #iterations
OPT_IC
#cores
Final solution #cores
Q26 52 2 76 64
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 120
Q55 24 1 115 32
Q40 34 1 22 20
Q52 22 1 113 32
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 121
Table 35 - Test7 results
App_id Iterations (bound)
Cores (DB)
Current cores
Q26 55 2 76 68
Q55 23 1 115 28
Q40 33 1 22 20
Q52 21 1 113 32
Table 36 - Total elapsed time for tests 1 -7
Test Use of cached predictor results
Total elapsed time
[s]
Test1 No 6862.381712
Test1 Yes 0.233007
Test2 No 3850.414621
Test2 Yes 0.139280
Test3 No 46466.197959
Test3 Yes 2.412670
Test4 No 9856.271436
Test4 Yes 0.337701
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 122
Test5 No 10553.975149
Test5 Yes 0.366470
Test6 No 8543.538833
Test6 Yes 0.287102
Test7 No 9448.558896
Test7 Yes 0.341946
Figure 46 - Total OF vs Iterations number ( Test3)
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 123
Figure 47 - Total OF vs Iterations number (All tests with exception of Test3)
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 124
Figure 48 - Candidates List size vs Iterations number
Table 21 to 27 summarize the detailed parameters considered in the seven test cases. Fig. 47 shows the
size of the L1 and L2 lists versus the number of iterations over the considered scenarios. Figure 46
represents the behaviour of the total objective function with regard to the iteration (because of the
different magnitudes, the figure for test case 3 is represented separately in Figure 47).
OPT_JR was run on a Virtualbox VM based on Ubuntu 14.04.1 LTS server running on an Intel Xeon
Nehalem dual socket quad-core system with 32 GB of RAM. OPT_JR was exploiting a single core.
A dual approach was used when running the tests. In the former, all the calculations to determine the
bounds and the solutions to the local search algorithm were performed when needed, resulting in
longer execution times (not acceptable in cases such as Test3, when the number of applications is higher
than in other scenarios). To mitigate the issue, OPT_JR can cache the output from earlier dagSim or
Lundstrom calls on a DB table (called PREDICTOR_CACHE_TABLE, Table 37), resulting in a faster process
execution.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 125
Table 37 - PREDICTOR_CACHE_TABLE
Column name Description Type Is Primary
application_id The application identifier
varchar(100) Y
dataset_size The dataset size double Y
num_cores The cores number int(11) Y
val The output from the predictor
double N
predictor The predictor for which value was memorized
char Y
stage Current DAG’s stage varchar(10) Y
However, other options are available to improve OPT_JR general performance. Acknowledging that
clusters need to be managed hierarchically, it is also possible to run dagSim/Lundstrom in a multicore
fashion, therefore improving performance (both in the bound evaluation step and neighborhood
exploration).
Another approach to reducing the execution time consists in limiting the number of iterations, at the
cost of penalizing the precision of the solution provided. Future OPT_JR releases will take into account
the above architecture enhancements.
By comparing the parameters and the results for Test1 and Test2, t is possible to see that by increasing
the deadline of an application, the execution time of the overall process tends to decrease (specifically
doubling the value of the deadline, the execution time reduces of approximately half the earlier value).
This could be expected since tighter deadlines (in Test1) make the problem more difficult to be solved.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 126
The curve reporting the total objective functions (see Figures 49 and 50) lowers down progressively on
each iteration, confirming that overall local search is able to improve the initial heuristic assignment by
selecting the best core configuration switch at each iteration. As a whole, the FO total value has
improved of 24.08% in average ranging between 20.57% and 31.23% for test case 1 and 7 respectively.
Besides, the number of iterations needed to reach the best configurations is 12 that is lower than the
maximum fixed to 15 in the experiments, demonstrating that the optimization algorithm identifies in
few iterations the final local optimum solution.
On the other hand, Figure 51 shows the dimension of the candidates lists L1 and L2 with respect to the
number of iterations. In this case, there is a non-monotonic behaviour.
Finally, as per Tables 31 to 34, the values of νi and the total number of cores assigned to application 1
(Q26) in the final solution tend to grow by increasing the value of its weight of one unit per time (Test 4,
5, 6 and 7) that can also be expected (higher priority applications get more resources).
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 127
13. Conclusions
In this deliverable, we reported the advances in the implementation of the final version of the EUBra-
BIGSEA infrastructure services achieved between M19 and M21. These services include: (i) tools for
predicting big data application performance, (ii) mechanisms for horizontal and vertical elasticity, and
(iii) approaches based on pro-activity and on optimization to perform horizontal and vertical scaling of
running big data applications, as well as load balancing of the infrastructure. The services have been
validated considering industry benchmark applications and also the current version of a subset of the
applications developed within the case study in WP7.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 128
GLOSSARY
Acronym or Name Definition
AM Analytical Model
BSD Berkeley Software Distribution
CPU CAP CPU capacity limiting. Provides a fine-grained control of the CPU resources used by the
virtual CPU, limiting the percentage of physical CPU used by the virtual CPUs
CV1/CV2 Cross Validation Sets
dagSim A discrete event simulator estimating (by simulation) a DAG-based
model execution time
EMaaS Entity Matching-as-a-Service targets the problem of identifying records that refer to the
same entity in the real world
HML Hybrid Machine Learning
HML prototype A set of Octave scripts calculating, via Machine Learning algorithms, the optimal values
used by OPT_IC tool
HSOptimizer Horizontal Scaling Optimizer is an opportunistic service that can provide more VMs
than required if there are idle resources
IOPS Input-Output operations per second. Along with CPU speed, it is the most important
metric for the performance of a computation resource
JMT Java Modelling Tools
KB Knowledge Base
Log Repository A collection of nested folders, positioned and named according to a set of rules, including
the input files for the Performance Predictors
Lundstrom A Lundstrom predictor solving a DAG-based model
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 129
MAPEs Mean Absolute Percentage Errors
ML Machine Learning
OPT_IC The tool that provides the initial configuration (optimal number of Virtual Machines). It
calls and executes dagSim
OPT_JR The tool that provides the optimal reconfiguration
Spark Log Parser A tool used to parse the original log files and prepare them to be properly processed by
the Performance Predictors
WS_OPT_IC Web Service invoking OPT_IC tool
WS_OPT_JR Web Service invoking OPT_JR tool
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 130
Annex
Appendix A - Hybrid Machine learning module usage manual
This appendix is organized as follows. The first sub-appendix presents the operational guidelines to
install and launch the Hybrid machine learning tools and is followed by a in-depth manual for the user.
Each step is commented thoroughly with the support of a table showing the parameters and the
dependencies among the scripts.
A.1 Deployment and Installation instructions
Three basic steps are required to launch the tool: a) download the collection of scripts and functions and
b) install the LIBSVM library c) prepare the input log files from Spark applications and of the
performance values from an analytical model (e.g. approximated formula or dagSim).
For the deployment of the system the user needs to perform the following steps:
1. Install GNU Octave
2. Install LIBSVM
3. Download the collection of Octave scripts and functions
4. Input files for MR and Tez jobs and Spark jobs
Prerequisites:
1. The Hybrid Machine Learning Tool consists of a set of scripts implemented in Octave GNU and
can be executed in all operating systems which can support Octave.
2. Installation of LIBSVM library
3. Input files Scripts for processing logs obtained from the execution of queries via MR, DAG-based
Tez and Spark jobs on Hadoop and Spark clusters. In case of Spark logs the user should
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 131
download Spark-log-Parser (see D3.4 Appendix E1.1) which is needed for obtaining the
appropriate csv files after the processing of data logs which are used as an input to the Hybrid
Machine Learning
A.2 User Manual
1. Preparation of summary csv log files by executing the Spark-log-Parser
2. Open Octave GNU
3. Navigate to the directory where retrieve_response_times_with_formula script is located. Open
the m file and fill in the required information regarding the values of the parameters which are
described in table 38. In case of using dagSim or an alternative AM the relevant data information
should be provided as a csv file which will be used as an input to the compare_prediction_errors
script (the user should fill in the path of the previous csv file in the script).
4. Navigate to the main script called compare_prediction_errors used to identify the optimal
thresholds which minimize the error. Open the m file and fill in the requested information. More
details are shown in table 39. Subsequently, execute the script using the command window of
Octave.
5. When the execution of compare_prediction_errors is completed, the average error for each
combination of thresholds for the Hybrid approach is calculated by function
“computeAvgsOfHybridForOptimumFinding”, which finally returns the one that minimizes the
error.
6. To investigate the right and left extrapolation the optimal thresholds are inserted as an input in
the same script as previously and the execution starts. The process is repeated while in each
iteration one of the available configuration of cores is removed from the training and cross
validations sets CV1 and CV2 to the test set.
7. To generate the output plots of the right and left extrapolations, scripts
"computeExtrapolationFromRight" and "computeExtrapolationFromLeft" are executed and the
graphs are stored in the folder indicated by the user.
8. In the same way the interpolation is calculated by using “compare_prediction_errors” script
while the plots are created after executing "computeInterpolation". Some representative plots
of both extrapolation and interpolation process are presented in Figure 49.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 132
Table 38 - Parameters of retrieval_response_times_with_formula script
Parameters Description Example
configurations Number of examined cores configurations = [20 30 40 48 60 72 80 90 100 108 120]
base_path The path where the processed logs are located
base_path = "/path/to/processed/data"
task_idx Task IDs of operational data which are used by the formula
Example to retrieve task_id from csv files:
head -n 2 10.csv| tail -n 1 | tr , '\n' | grep -n nTask | cut -d : -f 1 | xargs echo
task_idx = [4 11 18 25 32 39 46 53 60 67];
filename Filename which includes the results aster script execution
filename = [base_path, "/estimated_response_times.csv"];
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 133
Table 39 - Parameters of compare_prediction_errors
Parameters Description Example
base_directory
the directory where the csv files of data are located
base_directory = /path/to/csv/with/processed/logs
hybrid_csv
The csv file where the results with errors are stored
hybrid_csv = "/path/to/csv/file/with/the/errors/hybrid.csv
configuration.runs
Number of examined cores configuration.runs = [20 30 40 48 60 72 80 90 100 108 120]
configuration.missing_runs Number of missing cores configuration.missing_runs = [20 30]
outer_thresholds Minimize the MAPE on training set outer_thresholds = 25:40 (a range in order to find the optimum threshold)
or
outer_thresholds = 25 (after finding the optimum threshold)
inner_thresholds Minimize the MAPE on cross validation set
outer_thresholds = 15:30 (a range in order to find the optimum threshold)
or
outer_thresholds = 25 (after finding the optimum threshold)
max_inner_iterations Maximum number of iteration of the algorithm
max_inner_iterations = 10;
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 134
seeds Seed for the random number generator
seeds = 101:150;
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 135
Figure 49 - (a) Left extrapolation(b) right extrapolation and (c) interpolation for R1 query
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 136
Appendix B - Performance Prediction and Optimization service enhancements implemented in the new
release
This section is organized according to three subappendices. The first and the second describe the
enhancements implemented in OPT_IC and WS_Perf modules that allow to revise the optimal number
of cores and VMs for an application and estimate its residual time while the application is running.
Finally, the third is articulated in a technical and an operation guide for the user in order to support the
deployment and the efficient usage of the tools.
Appendix B.1 Performance prediction and Optimization (OPT_IC) service new release
This appendix summarizes the main changes implemented in the WP_2P and WS_OPT_IC services
between M17 and M21 (for additional details see deliverable D3.4).
WS_2P Enhancement: Prediction of the residual execution time of an application
With its new version, the performance prediction service WS_2P can also be used to predict the
remaining time of a running job. In doing so, we can enquire the web service every time a stage finishes
to have some insights about the remaining time. In case dagSim is adopted as the underlying
performance prediction tool, the WS rest call requires the following parameters:
REST API REQUEST
Type: GET
Format: Nodes/Cores/RamG/Dataset/Application_Session_id/Stage_id
For example:
curl http://localhost:8080/bigsea/rest/ws/dagsim/6/2/8G/500/application_1483347394756_1/J2S2
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 137
Note that this service has access to the RUNNING_APPLICATION_TABLE (see EUBra-BIGSEA Deliverable
D3.4) to retrieve the submission time of the job, hence differently from the other services, it is required
to send the application_session_id rather than the application_id.
An example of output is:
5043 6045
The first value is the predicted average time remaining computed by dagSim. This value is computed as
the difference between the expected duration of the entire job and the expected end time of the stage.
The second one is a rescaled estimate of the remaining time that considers also the time already spent
by the application. The rescaled value is computed according to the following formula:
The rationale behind this formula is the following: if an application spent 10s (ElapsedTime) to perform
its work that should have been done in 8s (EstimatedStageEndTime), then the application will be
conservatively delayed by the factor of 10/8. The assumption is that such a delaying factor will remain
constant also for the remaining part of the application, i.e., the application will experience the same
resource contention, which will slow it down by the same amount with respect to the conditions
experienced for the application profiling. Note that the second value depends on the time we call the
web service, hence it may differ from the example above.
To speed up the service, we introduced a lookup table named DAGSIM_STAGE_LOOKUP_TABLE that
stores previously computed values for EstimatedStageEndTime and EstimatedTimeRemaining for each
stage.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 138
A row in this table is shown next:
application_id num_cores stage dataset_size ramGB stage_end_time remaining_time
Q26 12 J0S0 500 8G 1767 624019
The primary key of this table includes columns (application_id, num_cores, stage, dataset_size).
Note that the total number of cores used is computed as the product of the number of nodes and cores
per node.
This lookup table is checked by WS_2P before starting dagSim: if a corresponding entry already exists,
then the rescaled time remaining will be computed using these values, otherwise we start dagSim and
we will store the computed values for all the stages in this lookup table26.
In the case there is not a perfect correspondence with the data folder (as in the previous release),
dagSim will retrieve the folder representing the closest match in terms of number of nodes and cores.
The following Table shows some examples:
26
The EstimatedStageEndTime is calculated as follows:
- At first we grab the number of cores used by the running job in the
RUNNING_APPLICATION_TABLE
- Then we check if a previously computed value is present in DAGSIM_STAGE_LOOKUP_TABLE
- If it is not the case, then dagSim will be called and its result stored in
DAGSIM_STAGE_LOOKUP_TABLE
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 139
Content of log repository Input log folder best match
● 6_2_8G_500
● 10_2_8G_500
● 16_2_8G_500
● 22_2_8G_500
7 / 2 / 8G / 500 6_2_8G_500
● 6_2_8G_500
● 10_2_8G_500
● 16_2_8G_500
● 22_2_8G_500
14 / 2 / 8G / 500 16_2_8G_500
● 6_2_8G_500
● 10_2_8G_500
● 16_2_8G_500
● 22_2_8G_500
25 / 2 / 8G / 500 22_2_8G_500
In case the Lundstrom tool is used the WS rest call requires the following parameters:
REST API REQUEST
Type: GET
Format: Nodes/Cores/RamG/Dataset/Application_Session_id/Stage_id
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 140
For example:
curl http://localhost:8080/bigsea/rest/ws/lundstrom/6/2/8G/500/application_1483347394756_1/J2S2
Note that this service has access to the RUNNING_APPLICATION_TABLE (see EUBra-BIGSEA Deliverable
D3.4) to retrieve the submission time of the job, hence differently from the other services, it is required
to send the application_session_id rather than the application_id.
An example of output is:
5043 6045
The Lundstrom's output follows the same pattern as dagSim providing a unified API. The first value is the
predicted average time remaining computed by Lundstrom, using the same principles as dagSim, while
the second is a rescaled estimate of the remaining time that considers also the time already spent by the
application as discussed for dagSim.
Appendix B.2 WS_OPT_IC Enhancement: Revise the optimal number of cores and VMs while an
application is running
With its new version, the optimization service WS_OPT_IC can also consider the elapsed time of an
application and compute the increase (or decrease) of computational resources to match the deadline
while the application is running. This is done by calling the underlying resource optimization tool with a
rescaled deadline.
The WS rest call requires the following parameters:
REST API REQUEST
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 141
Type: GET
Format: Application_Session_Id/Dataset/Deadline/Stage_id
For example:
curl http://localhost:8080/bigsea/rest/ws/resopt/application_1483347394756_1/500/500000/J2S2
Note that also this service has access to the RUNNING_APPLICATION_TABLE (see EUBra-BIGSEA
Deliverable D3.4) to retrieve the submission time of the application and the current number of cores on
which it is running, hence as the WS_2P service, it is required to send the application_session_id rather
than the application_id.
An example of output is:
56 11 153000
The first two values are the number of cores and the number of VMs that should be used to satisfy the
deadline.
The third value is the rescaled deadline calculated by the web service. Note that the first and second
values should be (almost) the same values returned by a call to OPT_IC called with the rescaled
deadline.
The rescaled deadline is computed according to the following formula:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 142
The rationale behind this formula is the following. OPT_IC computes the optimal configuration
assuming that the application has not yet started. WS_OPT_IC will compute the new deadline
considering the current time of the application and the time the current stage is going to finish. So, if
an application spent 10s (ElapsedTime) to perform his work that should be done in 8s
(EstimatedStageEndTime), then the new configuration will be computed by decreasing the initial
deadline by 8/10. In this way, OPT_IC will determine an initial configuration with a capacity, possibly,
larger than the initial one to speed up the application execution to recover the accumulated delay.
The assumption is that such a delaying factor will remain constant also for the remaining part of the
application, i.e., the application will experience the same resource contention, which will slow it down
by the same amount with respect to the conditions experienced for the application profiling.
The EstimatedStageEndTime is calculated as follows:
- At first, we grab the number of cores used by the running job in the
RUNNING_APPLICATION_TABLE
- Then we check if a previously computed value is present in DAGSIM_STAGE_LOOKUP_TABLE
- If it is not the case, then dagSim will be called and its result stored in
DAGSIM_STAGE_LOOKUP_TABLE
Since the call to OPT_IC may take a long time, we implemented two methods to deal with this problem:
1. Caching the previously computed values for num_cores and num_vm in the lookup table
OPTIMIZER_CONFIGURATION_TABLE (see deliverable D3.4).
2. Interpolation of previously computed values when no exact match is found in the lookup table.
The interpolation will pick the two closest points to the desired deadline and linearly interpolate
them to find a first order approximation of num_cores_opt and num_vm_opt.
The full interaction of the WS_OPT_IC and OPT_IC is described in the following diagram:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 143
Figure 50 - Sequence diagram of the WS_OPT_IC service
Note that when the WS_OPT_IC service is called with a deadline which cannot be fulfilled, then the
output will be “Deadline too strict”. There are two cases when the deadline is considered too strict:
1. The elapsed time is greater than or equal to the deadline.
2. The call to OPT_IC with the rescaled deadline fails returning null values for the number of cores
and number of VMs. Since the actual call to OPT_IC is asynchronous, this condition is checked by
considering previously computed values in OPTIMIZER_CONFIGURATION_TABLE to determine
which is the smallest feasible deadline to complete the job.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 144
In the case there is not a perfect correspondence with the data folder (as in the previous release),
OPT_IC will retrieve the folder representing the closest match in terms of number of nodes and cores.
The following table shows some examples:
Content of log repository Input log folder best
match
● 6_2_8G_500
● 10_2_8G_500
● 16_2_8G_500
● 22_2_8G_500
7 / 2 / 8G / 500 6_2_8G_500
● 6_2_8G_500
● 10_2_8G_500
● 16_2_8G_500
● 22_2_8G_500
14 / 2 / 8G / 500 16_2_8G_500
● 6_2_8G_500
● 10_2_8G_500
● 16_2_8G_500
● 22_2_8G_500
25 / 2 / 8G / 500 22_2_8G_500
Appendix B.3 Performance and optimization service tutorial
This appendix represents a tutorial on the performance prediction and optimization services. It consists
of a technical and operational section. The technical section presents first the scripts to create the
working environment via a docker file, whose usage is described step by step.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 145
In the operational guide, the general configuration file (wsi_config.xml) is described. The structure of the
log data files is presented next. The remaining subsections are dedicated to the usage description of
software tools, specifically the discrete event simulator (dagSim) and the optimization tools (OPT_IC and
OPT_JR).
Technical tutorial
The content of this tutorial is hosted on Github.
Open your terminal and clone the repository with the command:
git clone https://github.com/eubr-bigsea/wsi.git
Database Creation
The repository contains a MySQL script in order to automatically create the tables schema.
How to install a MySQL server is out of the scope of this guide. However, the repository provides a bash
script in order to generate a proper docker container also.
Creation Database Container (Bash Script)
The repository project contains a bash script which generates a MySQL docker container from the library
image `MySQL` provided by dockerHub.
The script container will automatize the following steps:
1. Create a MySQL docker container (default named: mysql_bigsea). The server instance will listen
on the default port 3306.
2. Create the database.
3. Create the database schema (all needed tables).
4. Create a user (in order to avoid the usage of root).
5. Grant all privileges to that user on the database.
6. Generation of sample data into the tables.
The script can be retrieved from https://github.com/eubr-bigsea/wsi.git
From the root directory of the repository, move into the database scripts directory:
cd Database/
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 146
Make the bash script executable:
chmod u+x startNewDockerContainer.sh
Before to launch the script it is highly recommended to configure passwords. In order to do that just
open the script with your favourite editor, for example:
vim startNewDockerContainer.sh
The main lines are the following:
MYSQL_ROOT_PASSWORD=4dm1n
MYSQL_USER_PASSWORD=b1g534
If the password is changed, remember that you will have to properly modify the configuration file of the
services later.
Once the script has been modified, you can execute it:
sudo ./startNewDockerContainer.sh
The output should be something like:
6cd3e2d06d414a2504560ac411c4fd05a46508e3ff630e323d99c0e73ca59a76
Wait for connection... (this operation will take about 15 seconds)
mysql: [Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 147
mysql: [Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure.
Done.
You can see the docker container is running with the command:
sudo docker ps
And the output should show the row:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6cd3e2d06d41 mysql:latest "docker-entrypoint..." About a minute ago Up About a minute
0.0.0.0:3306->3306/tcp mysql_bigsea
The container will expose the default port 3306. The database, indeed, will be visible from the external.
However, if you need to connect two docker container the procedure is slightly more complicated.
Docker will automatically assign a local ip address to the container, you will need it. To obtain the ip
address you can execute the command:
sudo docker exec -it mysql_bigsea ip addr
The command will show the interfaces inside the Database container. The output should be something
like:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 148
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
374: eth0@if375: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.3/16 scope global eth0
valid_lft forever preferred_lft forever
The `eth` interface will display the proper internal IP address, in this example: inet 172.17.0.3.
Creation Docker Container Services
Once the database has been built, you are ready to execute the final docker which will build all EUBra-
BigSea services into a single container.
Go to https://github.com/eubr-bigsea/wsi .
From the root repository move into the docker directory
cd WSI/docker/
Before to build the docker images, you need to properly configure the infrastructure. This process is
quite easy, just edit the configuration (wsi_config.xml) with your favourite editor, for example:
vim wsi_config.xml
The important property to check are the following:
The listen port of the database:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 149
<entry key="AppsPropDB_port">3306</entry>
<entry key="OptDB_port">3306</entry>
1. The username for the database:
<entry key="OptDB_user">bigsea</entry>
<entry key="AppsPropDB_user">bigsea</entry>
2. The database IP:
<entry key="AppsPropDB_IP">172.17.0.1</entry>
<entry key="OptDB_IP">172.17.0.1</entry>
3. The database name:
<entry key="AppsPropDB_dbName">bigsea</entry>
<entry key="OptDB_dbName">bigsea</entry>
4. The user password:
<entry key="OptDB_pass">b1g534</entry>
<entry key="AppsPropDB_pass">b1g534</entry>
As you can see, those lines are about the database configuration. You need to set them in accordance
with your database configuration.
You can skip all other configuration properties.
If your database is inside a docker container you need to properly configure the container network in
order that the two containers can communicate among them. In order to obtain the internal IP address
of a Docker container you may refer the end of the Creation Database Container subsection.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 150
After you have configured all database properties, you can build the docker image. From the same
directory launch the command:
sudo docker build --no-cache --force-rm -t wsi
This will build a docker image tagged as wsi. Be patient, the building process may take several minutes.
Once the images have been built, the Web Service can be launched in a docker container.
Just launch the command:
sudo docker run --name wsi_service -d -p 8080:8080 wsi
Useful Commands
1. Display the configuration inside the container:
sudo docker exec -it wsi_service cat wsi_config.xml
1. Restart the container:
sudo docker restart wsi_service
2. Modify the configuration inside the container (with Nano):
sudo docker exec -it wsi_service nano wsi_config.xml
sudo docker restart wsi_service
Operational guide
Configuration file
wsi_config.xml is a configuration file located in the home directory. Each line includes the name of a
variable and its value. For example:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 151
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment/>
<entry key="RESOPT_HOME">/home/work/OPT_IC</entry>
<entry key="RESULTS_HOME">/home/work/TPCDS500-D_processed_logs</entry>
<entry key="AppsPropDB_port">3306</entry>
<entry key="OptDB_port">3306</entry>
<entry key="DAGSIM_HOME">/home/work/Dagsim</entry>
<entry key="OptDB_user">root</entry>
<entry key="AppsPropDB_user">root</entry>
<entry key="AppsPropDB_IP">localhost</entry>
<entry key="AppsPropDB_dbName">150test</entry>
<entry key="OptDB_tablename">OPT_SESSIONS_RESULTS</entry>
<entry key="OptDB_IP">localhost</entry>
<entry key="UPLOAD_HOME">/home/work/Uploaded</entry>
<entry key="OptDB_pass">test</entry>
<entry key="AppsPropDB_pass">test</entry>
<entry key="OptDB_dbName">150test</entry>
</properties>
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 152
This file contains information about the location of tools such as dagSim and OPT_IC, the data log files
and the DB credentials.
Data Log files
It is expected that the folder containing the data log files be coherent with the following format:
root_folder_name (this can be any name you like)
nNodes_nCores_Memory_dataset
application_id
logs
For example:
/home/work/TPCDS500-D_processed_logs
├── 10_2_8G_500
│ ├── Q26
│ │ ├── logs
etc.
Currently, a (already processed) set of data located under /home/work/TPCDS500-D_processed_logs
has been installed at docker level. If the user desires to use a different set of log files, the new data
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 153
structure must be compliant as per the above format. It is also requested to set up the wsi_config.xml
file accordingly (set RESULTS_HOME entry).
Spark log parser
Spark log parser is a script called process_logs.sh and is located under /home/work/SparkLogParser-
master. This tool requires (at least in one of the ways it can be used) the presence of the discrete event
simulator dagSim. Spark log parser can be configured by using a file called config.txt, such as:
#APP_REGEX='(app(lication)?[-_][0-9]+[-_][0-9]+)|(\w+-\w+-\w+-\w+-\w+-[0-9]+)'
APP_REGEX='(eventLogs-app(lication)?[-_][0-9]+[-_][0-9]+)|(\w+-\w+-\w+-\w+-\w+-[0-9]+)'
## DAGSIM parameters
## These apply only to process_logs.sh
# DagSim executable
DAGSIM=/home/work/Dagsim/dagsim.sh
# Number of users accessing the system
DAGSIM_USERS=1
# Distribution of the think time for the users.
# This element is a distribution with the same
# format as the task running times
DAGSIM_UTHINKTIMEDISTR_TYPE="exp"
DAGSIM_UTHINKTIMEDISTR_PARAMS="{rate = 0.001}"
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 154
# Total number of jobs to simulate
DAGSIM_MAXJOBS=1000
# Coefficient for the Confidence Intervals
# 99% 2.576
# 98% 2.326
# 95% 1.96
# 90% 1.645
DAGSIM_CONFINTCOEFF=1.96
q26
While most of the above parameters refer to dagSim, the APP_REGEX variable:
APP_REGEX='(eventLogs-app(lication)?[-_][0-9]+[-_][0-9]+)|(\w+-\w+-\w+-\w+-\w+-[0-9]+)'
allows to process raw log files with different prefixes (in this case, everything starting with “eventLogs-
app”).
To parse the raw data log files, you can use the following:
./process_logs.sh -p <data_log_foldername>
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 155
For example:
./process_logs.sh -p /home/work/TPCDS500-D_processed_logs
This step creates several subfolders in every logs folder.
The second step simulates the content of the log files defined so far by using dagSim. This is achieved
the command:
./process_logs.sh -s <data_log_foldername>
For example:
./process_logs.sh -s /home/work/TPCDS500-D_processed_logs
The content of the initial folder would be now:
├── 10_2_8G_500
│ ├── Q26
│ │ ├── logs
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 156
│ │ │ ├── application_1483347394756_0167
│ │ │ ├── application_1483347394756_0167_csv
│ │ │ │ ├── app_1.csv
│ │ │ │ ├── application_1483347394756_0167.lua
│ │ │ │ ├── application_1483347394756_0167.lua.template
│ │ │ │ ├── ConfigApp_1.txt
│ │ │ │ ├── ConfigApp_1.txt.bak
│ │ │ │ ├── J0S0.txt
│ │ │ │ ├── J1S1.txt
│ │ │ │ ├── J2S2.txt
│ │ │ │ ├── J3S3.txt
│ │ │ │ ├── J3S4.txt
│ │ │ │ ├── J3S5.txt
│ │ │ │ ├── J3S6.txt
│ │ │ │ ├── jobs_1.csv
│ │ │ │ ├── stages_1.csv
│ │ │ │ └── tasks_1.csv
etc.
dagSim
The discrete event simulator dagSim can be invoked by a Web Service. For example:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 157
curl localhost:8080/bigsea/rest/ws/dagsim/5/2/8G/500/Q26
In the case a sub-folder called 5_2_8G_500 does not exist, the service will consider the best match
among the existing sub-folders.
The logs sub-folder contain several sub-folders. In any case, dagSim considers always the first (with
respect to the corresponding lua configuration file).
OPT_IC
The OPT_IC optimizer can be invoked by a Web Service producing in output the optimal number of cores
and VMs. Both values are then stored in OPTIMIZER_CONFIGURATION_TABLE.
The Web service can be used in two scenarios: either for a known application or for a new one. To verify
if the application already exists in the table, the SQL statement:
select * from APPLICATION_PROFILE_TABLE;
Information regarding the output produced by OPT_IC Web Service are stored on
optimizer_configuration_table, whose content can be determined by using the SQL statement:
select * from optimizer_configuration_table;
This query may return, for example, the following result:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 158
mysql> SELECT * FROM OPTIMIZER_CONFIGURATION_TABLE;
+----------------+--------------+----------+---------------+------------+
| application_id | dataset_size | deadline | num_cores_opt | num_vm_opt |
+----------------+--------------+----------+---------------+------------+
| Q26 | 500 | 100000 | 88 | 22 |
| Q26 | 500 | 1000000 | 8 | 2 |
| Q52 | 500 | 200000 | 42 | 11 |
| Q52 | 500 | 800000 | 11 | 3 |
If the APPLICATION_PROFILE_TABLE includes a row for the considered application, the user can invoke
OPT_IC by calling the Web Service as follows:
curl localhost:8080/bigsea/rest/ws/resopt/application_id/dataset/deadline
For example:
curl localhost:8080/bigsea/rest/ws/resopt/Q52/500/800000
If an identical configuration (application_id, dataset, deadline) exists in the DB, the Web Service
retrieves the output values without performing further calculations; otherwise, a linear extrapolation
based on the two closest configurations is performed. Finally, the optimizer is asynchronously invoked
storing the actual output values on the DB for future reference.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 159
Note that if the application is used for the first time, then a preliminary procedure performed by a shell
script initialize.sh must be followed. This script implements automatically the following steps:
1. Verify that APPLICATION_PROFILE_TABLE includes the profile for the application session; if not,
the script exits with an error;
2. Add two rows on OPTIMIZER_CONFIGURATION_TABLE (with mock values for dataset_size,
deadline, num_cores_opt and num_vm_opt);
3. Execute twice the Web Service invoking OPT_IC as shown earlier (the application_id must be the
one considered; the dataset is the one used in the log data files, while the choice of the deadline
is left to the user): these steps will store in the optimizer_configuration_table two rows
containing the number of cores and VMs for a specific configuration;
4. At this step, optimizer_configuration_table should now contain 4 rows about the new
application (2 mock rows and 2 real rows);
5. Before ending, the script removes the 2 mock rows created at step 2.
The script is invoked as follows:
./initialize.sh <application_id>
For example:
./initialize.sh query52
OPT_IC can be invoked by using a shell script as follows:
./script.sh <application_id> <dataset_size> <deadline>
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 160
For example:
./script.sh query52 1000 200000
OPT_JR
OPT_JR is a tool to determine the best applications configuration in terms of number of cores
minimising the tardiness for soft-deadline applications under heavy load conditions. OPT_JR applies a
local search algorithm to explore the application’s neighborhood and find the best cores configuration.
The process is iterative and performed until the application’s configuration cannot be refined anymore.
The output of the algorithm is stored on OPT_SESSION_RESULTS_TABLE.
OPT_JR uses a performance predictor that can be either dagSim or Lundstrom. Performance is greatly
improved by estimating initially a list of applications regarded as candidates for evaluation and then
applying the actual predictor on them only. Results from the predictor are cached on
PREDICTOR_CACHE_TABLE and reused when necessary.
OPT_JR can be invoked as follows:
/OPT_JR -f=<filename> -n=<N> -k=<number> -d=<y|n> -i=<ITERATIONS> -s=<PREDICTOR> -g=<Y|N>
Where:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 161
1. <filename> is a file including one or more application profiles in csv format and stored in a folder
whose path is specified under the global configuration file wsi_config.xml (“UPLOAD_HOME”);
2. <N> is the maximum number of cores available;
3. -d specifies the debug option. When the option is chosen, the tool will work in verbose mode
printing a series of messages to be used in a debug session;
4. <ITERATIONS> specify the maximum number of iterations for the local search algorithm;
5. <PREDICTOR> is either one of “dagSim” or “lundstrom”;
6. -g option calculates or not the global objective function at each iteration, slowing the
performance. This option can be used for debug purposes.
./OPT_JR -f="Test1.csv" -n=220 -k=0 -d=y -i=10 -s=dagSim -g=Y
OPT_JR requires the global configuration file wsi_config.xml and a performance predictor (dagSim or
Lundstrom). The web service which provide a REST API to OPT_JR is called WS_OPT_JR.
The following script shows an example the way the WS_OPT_JR web service can be invoked:
#!/bin/bash
WSI_IP="localhost"
WSI_PORT="8080"
APP_IDS=("application_1483347394756_0" \
"application_1483347394756_1")
NUM_CORES=$(( (RANDOM % 1001) + 1))
# Open new session
echo "Open a new session..."
SID=$(curl "http://${WSI_IP}:${WSI_PORT}/WSI/session/new")
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 162
echo "Session ID ${SID}"
# Set the number of calls
NUMCALLS=${#APP_IDS[@]}
echo "Setting ${NUMCALLS} calls and ${NUM_CORES} cores"
curl -X POST
"http://${WSI_IP}:${WSI_PORT}/WSI/session/setcalls?SID=${SID}&ncalls=${NUMCALLS}&ncores=${NU
M_CORES}"
echo ""
# Set app param for each app
for app in ${APP_IDS[@]}; do
echo "Setting application: ${app}";
curl -X POST -H "Content-Type: text/plain" -d "${app} 3.14 3.14"
"http://${WSI_IP}:${WSI_PORT}/WSI/session/setparams?SID=${SID}"
echo ""
done;
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 163
Appendix C - Controller Service Guide
Installation and configuration
In the virtual machine that you want to install the controller follow the steps below:
1. Install git and clone the EUBra-BIGSEA Controller repository
$ sudo apt-get install git
$ git clone https://github.com/bigsea-ufcg/bigsea-controller
2. Access the EUBra-BIGSEA Controller folder and run the setup script
$ cd bigsea-controller/
$ ./setup.sh # You must run this command as superuser to install some requirements
3. Write the configuration file (controller.cfg) with the required information contained into
controller.cfg.template.
4. Start the service.
$ ./run.sh
Creating a new controller plugin
1. Write a new python class
It must implement the methods __init__, start_application_scaling, stop_application_scaling and status.
1. __init__(self, application_id, parameters)
Creates a new controller which scales the given application using the given parameters.
app_id: application id to scale.
parameters: a dictionary containing scaling parameters.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 164
Example:
{
"instances": *instance_id_0, instance_id_1, …, instance_id_n+,
"check_interval": 5,
"trigger_down": 10,
"trigger_up": 10,
"min_cap": 10,
"max_cap": 100,
"actuation_size": 25,
"metric_rounding": 2,
"actuator": “basic”,
"metric_source": “monasca”,
“application_type”: “os_generic”
}
1. start_application_scaling(self)
Starts scaling for an application.
This method is used as a run method by a thread.
1. stop_application_scaling(self)
Stops scaling of an application.
1. status(self)
Returns information on the status of the scaling of applications, normally as a string.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 165
Example:
class MyController:
def __init__(self, application_id, parameters):
# set things up
pass
def start_application_scaling(self):
# scaling logic
pass
def stop_application_scaling(self):
# stop logic
pass
def status(self):
# status logic
return status_string
2. Add the class to controller project into the plugins directory.
3. Edit Controller_Builder
Add a new condition to get_controller. Instantiate the plugin using the new condition.
Example:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 166
...
elif name == "mycontroller":
# prepare parameters
return MyController(application_id, parameters)
...
The string used in the condition must be passed in the request to start scaling as the value to “plugin”
parameter.
The examples of controller plugins contained in the repository are composed of two classes: the
Controller and the Alarm. The Alarm contains the logic used to adjust the number of resources allocated
to applications. The Controller dictates the pace of the scaling process. It controls when the Alarm is
called to check the application state and when is necessary to wait.
Creating a new actuator plugin
1. Write a new python class
It must extend Actuator and implement the methods prepare_environment, adjust_resources,
get_allocated_resources and get_allocated_resources_to_cluster. See the example below:
class MyActuator(Actuator):
def __init__(self, params):
# set things up
pass
def prepare_environment(self, vm_data):
# prepare environment logic
pass’
def adjust_resources(self, vm_data):
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 167
# adjust resources logic
pass
def get_allocated_resources(self, vm_id):
# get allocated resources logic
pass
def get_allocated_resources_to_cluster(self, vms_ids):
# get allocated resources to cluster logic
pass
2. Add the class to controller project into the plugins directory.
3. Edit Actuator_Builder
Add a new condition to get_actuator. Instantiate the plugin using the new condition. See the example
below:
...
elif name == "myactuator":
# prepare parameters
return MyActuator(params)
...
The string used in the condition must be passed in the request to start scaling, using the Controller’s
REST API, as the value of “actuator” parameter.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 168
Appendix D - Load Balancer Guide
This appendix will help you to deploy the EUBra-BIGSEA WP3 Load Balancer service in a machine with a
fresh Linux installation, and also provide information about how to configure the different heuristics.
Please make sure that the machine you are installing the service has access the KVM Hypervisors in the
infrastructure hosts and to the following OpenStack services: Keystone, Nova, and Monasca. The current
version only supports OpenStack infrastructure and considers shared storage for the live migration
process.
Installation
You can install the Load Balancer in any physical or virtual machine, the minimal configuration is
described below.
Minimal configuration:
1. OS: Ubuntu 14.04 2. CPU: 1 core 3. Memory: 2GB of RAM 4. Disk: there are no disk requirements.
1. Update and upgrade your machine
$ sudo apt-get update && sudo apt-get upgrade
2. Install the necessary packages (python dependencies, git, and pip)
$ sudo apt-get install python-setuptools python-dev build-essential
$ sudo easy_install pip
$ sudo apt-get install git
$ sudo apt-get install libssl-dev
3. Clone the Load Balancer repository
$ git clone https://github.com/bigsea-ufcg/bigsea-loadbalancer.git
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 169
4. Access the Load Balancer folder and run the install script.
$ cd bigsea-loadbalancer/
# Some requirements require sudo
$ sudo ./install.sh
Configuration
The configuration file for the load balancer is currently composed by the following sections: monitoring,
heuristic, infrastructure, openstack, and optimizer. Below we describe the template for the
configuration file with the explanation of each section and its attributes.
# Section to configure information used to access the Monasca.
[monitoring]
# Username that will be used to authenticate
username=<@username>
# Password that will be used to authenticate
password=<@password>
# The project name that will be used to authenticate
project_name=<@project_name>
# The authentication url that the monasca use.
auth_url=<@auth_url>
# Monasca api version
monasca_api_version=v2_0
# Section to configure all heuristic information.
[heuristic]
# The filename for the module that is located in /loadbalancer/service/heuristic/
# without .py extension
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 170
module=<module_name>
# The class name that is inside the given module, this class should implement BasicHeuristic
class=<class_name>
#Number of seconds before executing the heuristic again
period=<value>
# A float value that represents the ratio of number of CPUS in the hosts. (overcommit factor)
cpu_ratio=0.5
# An integer that represent the number of rounds that an instance need to wait before be migrated again
# Each round represents an execution of the loadbalancer
wait_rounds= 1
# All information about the infrastructure that the Load Balance will have access
[infrastructure]
# The user that has access to each host
user=<username>
#List of full hostnames of servers that the loadbalancer will manage (separated by comma).
#e.g compute1.mylab.edu.br
hosts=<host>,<host2>
#The key used to access the hosts
key=<key_path>
#The type of IaaS provider on your infrastructure e.g OpenStack, OpenNebula
provider=OpenStack
# Section to configure OpenStack credentials used for Keystone and Nova
[openstack]
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 171
username=<@username>
password=<@password>
user_domain_name=<@user_domain_name>
project_name=<@project_name>
project_domain_name=<@project_domain_name>
auth_url=<@auth_url>
# Section to configure Optimizer services
[optimizer]
# The filename for the module that is located in /loadbalancer/service/heuristic/
# without .py extension
module=<optimizer_module_name>
# The class name that is inside the given module, this class should implement BaseOptimizer
class=<optimizer_class_name>
# The url to make the request to the optimizer service
request_url=<http://url/...>
# The type of the request: GET, POST, etc...
request_type=<type>
# Parameters to be used with the request url
request_params=<params>
Heuristics Configuration
Below we give the configuration for the heuristic section according to each heuristic we have available in
our repository, it’s possible to find examples27 in the repository too.
27
https://github.com/bigsea-ufcg/bigsea-loadbalancer/tree/master/examples
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 172
BalanceInstancesOS heuristic configuration
[heuristic]
# The filename for the module that is located in /loadbalancer/service/heuristic/
# without .py extension
module=instances
# The class name that is inside the given module, this class should implement BasicHeuristic
class=BalanceInstancesOS
#Number of seconds before executing the heuristic again
period=600
# A float value that represents the ratio of number of CPUS in the hosts. (overcommit factor)
cpu_ratio=1
# An integer that represents the number of rounds that an instance need to wait before be migrated again
# Each round represents an execution of the loadbalancer
wait_rounds=1
CPUCapAware heuristic configuration
[heuristic]
# The filename for the module that is located in /loadbalancer/service/heuristic/
# without .py extension
module=cpu_capacity
# The class name that is inside the given module, this class should implement BasicHeuristic
class=CPUCapAware
#Number of seconds before executing the heuristic again
period=600
# A float value that represents the ratio of number of CPUs in the hosts.
cpu_ratio=1
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 173
# An integer that represents the number of rounds that an instance need to wait before be migrated again
# Each round represents an execution of the loadbalancer
wait_rounds=1
SysbenchPerfCPUCap heuristic configuration
[heuristic]
# The filename for the module that is located in /loadbalancer/service/heuristic/
# without .py extension
module=benchmark_performance
# The class name that is inside the given module, this class should implement BasicHeuristic
class=SysbenchPerfCPUCap
#Number of seconds before executing the heuristic again
period=600
# A float value that represents the ratio of number of CPUs in the hosts.
cpu_ratio=1
# An integer that represents the number of rounds that an instance need to wait before be migrated again
# Each round represents an execution of the loadbalancer
wait_rounds=1
Optimizer Configuration
The only available plugin is the HSOptimizer that uses the HSOptimizer28 Service. The configuration to
have the optimizer section in the configuration file is shown below. The request_params should receive
a dictionary with username, project_id, password and auth_ip to work properly.
28
https://github.com/bigsea-ufcg/bigsea-hsoptimizer
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 174
[optimizer]
# The filename for the module that is located in /loadbalancer/service/heuristic/
# without .py extension
module=hsoptimizer
# The class name that is inside the given module, this class should implement BaseOptimizer
class=HSOptimizer
# The url to make the request to the optimizer service
request_url=http://IP:1517/optimizer/get_preemtible_instances
# The type of the request: GET, POST, etc...
request_type=GET
# Parameters to be used with the request url
request_params={'username': 'myuser', 'project_id': '98sab49kdo1280zlapq', 'domain': 'mydomain', 'password': 'mypassword', 'auth_ip': 'http://mycloud.domain.com'}
Execution
To execute the Load Balancer you need to access the directory where you have cloned the repository
and set the PYTHONPATH environment variable.
$ cd bigsea-loadbalancer/
$ export PYTHONPATH=”:”`pwd`
Now you can run using a default configuration file name or by giving the configuration file name.
$ python loadbalancer/cli/main.py
$ python loadbalancer/cli/main.py -conf load_balancer.cfg
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 175
Creating a new Heuristic
To create a new heuristic you must you just need to follow the steps below:
1. Create a python module file in loadbalancer/service/heuristic directory
2. In the module file, create a class that inherits BaseHeuristic class from
loadbalancer/service/heuristic/base.py
from loadbalancer.service.heuristic.base import BaseHeuristic
...
class MyNewHeuristic(BaseHeuristic):
def __init__(self, **kwargs):
…..
…..
def collect_information(self):
…..
return ...
def decision(self):
……
…...
3. You must override collect_information and decision methods in your class
Creating a new Optimizer Plugin
To create a new heuristic you must you just need to follow the steps below:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 176
1. Create a python module file in loadbalancer/service/heuristic directory
2. In the module file, create a class that inherits BaseOptimizer class from
loadbalancer/service/optimizer/base.py
from loadbalancer.service.optimizer.base import BaseOptimizer
...
class MyOptimizer(BaseOptimizer):
def __init__(self, **kwargs):
…..
def request_instances(self):
…..
…..
return …..
def decision(self, **kwargs):
…….
return …..
3. You must override request_instances and decision methods in your class
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 177
Appendix E - Monitoring-Hosts Daemon
This appendix will help you to deploy the monitoring-host daemon in the hosts of your infrastructure.
That daemon is responsible for executing the Sysbench benchmark and publishing the results into
Monasca, so the Load Balancer can use the metrics when the selected heuristic is SysbenchPerfCPUCap.
Installation
You can install the LoadBalancer in any physical or virtual machine; the minimal configuration is
described below.
Minimal configuration:
1. OS: Ubuntu 14.04 2. CPU: 1 core 3. Memory: 2GB of RAM 4. Disk: there are no disk requirements.
1. Update and upgrade your machine
$ sudo apt-get update && sudo apt-get upgrade
2. Install the necessary packages (python dependencies, git and pip and sysbench)
$ sudo apt-get install python-setuptools python-dev build-essential
$ sudo easy_install pip
$ sudo apt-get install git
$ sudo apt-get install sysbench
3. Clone the Load Balancer repository
$ git clone https://github.com/bigsea-ufcg/monitoring-hosts.git
4. Access the monitoring-hosts folder and install the requirements
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 178
$ cd monitoring-hosts/
# Some requirements require sudo
$ sudo pip install -r requirements.txt --no-cache-dir
Configuration
The configuration file for the monitoring-hosts daemon defines what benchmarks it will run, below we
show the default configuration, the file need only to have a DEFAULT section containing the necessary
information for the daemon to run, the backend parameter defines if the benchmark results will be
written in files on the output_dir or if it’s going to be published into Monasca.
[DEFAULT]
type=CPU
name=sysbench
# Full path to access the sysbench.json file that is in the sample directory
parameters=/path/sysbench.json
output_dir=/tmp
backend=
To use Monasca as backend set the backend parameter and add a new section called monasca with all
necessary authentication parameters.
[DEFAULT]
type=CPU
name=sysbench
# Full path to access the sysbench.json file that is in the sample directory
parameters=/path/sysbench.json
output_dir=/tmp
backend=OS_MONASCA
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 179
[monasca]
username=<@username>
password=<@password>
project_name=<@project_name>
auth_url=<@auth_url>
monasca_api_version=2_0
It’s possible to configure two parameters for the execution of the sysbench with the CPU test,
number_of_threads, and max_prime, those parameters are in the sysbench.json file that is in the sample
directory of the repository.
{
"number_of_threads": ["8", "4", "2", "1"],
"max_prime": "25000"
}
Execution
The daemon offers a CLI that you can you use to start, stop or restart the execution, below we show the
usage of each option.
usage: python monitoring.py [-h] {start,restart,stop} ... Monitoring Host Daemon positional arguments: {start,restart,stop} Operation with the monitoring host daemon. Accepts any of these values: start, stop, restart start Starts python monitoring.py daemon restart Restarts python monitoring.py daemon stop Stops python monitoring.py daemon optional arguments: -h, --help show this help message and exit
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 180
The start option can receive two arguments, time_interval and configuration, the only required
argument is the configuration since time_interval have a zero as the default value.
ubuntu@sysbench:~/monitoring-hosts$ python monitoring/run.py start -h
usage: python monitoring.py start [-h] [-time TIME_INTERVAL] -conf
CONFIGURATION
optional arguments:
-h, --help show this help message and exit
-time TIME_INTERVAL, --time_interval TIME_INTERVAL
Number of seconds to wait before running the Monitoring
Daemon again.(Integer)
-conf CONFIGURATION, --configuration CONFIGURATION
Filename with all benchmark information
$ python monitoring/run.py start -conf my.cfg
$ python monitoring/run.py start -conf my.cfg -time 120
The stop option requires the configuration argument. Use this option to stop the execution of the
daemon.
usage: python monitoring.py stop [-h] -conf CONFIGURATION
Stops the daemon if it is currently running.
optional arguments:
-h, --help show this help message and exit
-conf CONFIGURATION, --configuration CONFIGURATION
Filename with all benchmark information
$ python monitoring/run.py stop -conf my.cfg
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 181
The restart option can receive two arguments, time_interval and configuration, the only required
argument is the configuration since time_interval have a zero as the default value. You can use to restart
the daemon with a new configuration file and new time interval.
usage: python monitoring.py restart [-h] [-time TIME_INTERVAL] -conf
CONFIGURATION
optional arguments:
-h, --help show this help message and exit
-time TIME_INTERVAL, --time_interval TIME_INTERVAL
Number of seconds to wait before running the Monitoring
Daemon again. (Integer)
-conf CONFIGURATION, --configuration CONFIGURATION
Filename with all benchmark information
$ python monitoring/run.py restart -conf my.cfg
$ python monitoring/run.py restart -conf my.cfg -time 120
Appendix F - EUBra-BIGSEA Applications Submission Guide
This appendix will guide you on how to execute some EUBRa-BIGSEA applications (BTR and BULMA) in a
WP3 infrastructure configured with the following components: Broker API, Monitor, Controller,
Authorization, and Optimizer.
The submissions of WP4 applications must follow the configuration file described below:
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 182
[manager] ip = # Broker IP port = # Broker port plugin = # To WP4 submissions, the plugin is sahara cluster_size = # Cluster size flavor_id = # Flavor ID image_id = # Image ID bigsea_username = # AAAaS username bigsea_password = # AAAaS password [plugin] opportunistic = # Details in section “Integration with Opportunistic Services” args = # Details in section 4.1.1 and 4.1.2 of this appendix job_binary_url = # Job binary url dependencies = # Dependencies to run app main_class = # Main class of a jar application like EMaaS job_template_name = # Template name of job (via sahara submission) job_binary_name = # Binary name (via sahara submission) plugin_app = # Plugin of application (via sahara submission) openstack_plugin = # Openstack plugin (via sahara submission) plugin_version = # Plugin version (via sahara submission) job_type = # Job type (via sahara submission) expected_time = # Expected time (QoS metric) collect_period = # Collect period (QoS metric) master_ng = # Master node group slave_ng = # Slave node group opportunistic_slave_ng = # Opportunistic slave node group net_id = # Network ID [scaler] starting_cap = # Starting CAP scaler_plugin = # Scaler plugin actuator = # Actuator type metric_source = # Metric source application_type = # Application type check_interval = # Check interval trigger_down = # Trigger down trigger_up = # Trigger up min_cap = # Minimum CAP max_cap = # Maximum CAP actuation_size = # Actuation size metric_rounding = # Metric rounding
BTR
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 183
The BTR (Best Trip Recommender) is a multi-modal journey recommendation system which predicts
future trips characteristics, such as duration and crowdedness, and uses the predictions to rank the trips
for the user.
The BTR execution is divided into two parts: pre-processing and training. The execution must be done
sequentially for a successful execution. Each part requires changes in specific fields of the configuration
file.
The first change is the edit of the parameter job_binary_url in the plugin section of the configuration
file. We only consider the usage of HDFS paths to submit these applications. The BTR pre-processing29
and training30 application files can be found in the BTR repository31.
After this, you need to edit the field args of configuration file in this format described below:
1. Preprocessing: <btr-input-folder> <btr-pre-processing-output-folder> <btr-route-stop-output-
folder>
2. Training: <btr-pre-processing-output-folder> <train-info-output-file> <duration-model-output-
folder> <pipeline-output-folder>
Last, fill the specific field dependencies with com.databricks:spark-csv_2.10:1.5.0.
Example of BTR pre-processing specific fields:
…
[plugin]
...
29
https://github.com/analytics-ufcg/best-trip-recommender-
spark/blob/master/utils/preprocessing_subsequent_stops.py
30 https://github.com/analytics-ufcg/best-trip-recommender-spark/blob/master/src/jobs/train_btr_2.0.py
31 https://github.com/analytics-ufcg/best-trip-recommender-spark
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 184
args = hdfs://url/BTR/input/ hdfs://url/BTR/output/preprocessing/ hdfs://url/BTR/output/routestops/
dependencies = com.databricks:spark-csv_2.10:1.5.0
...
job_binary_url = hdfs://url/BTR/preprocessing_subsequent_stops.py
...
Example of BTR training specific fields:
…
[plugin]
...
args = hdfs://url/BTR/output/preprocessing/ hdfs://url/BTR/output/ hdfs://url/BTR/output/duration/ hdfs://url/BTR/output/pipeline/
dependencies = com.databricks:spark-csv_2.10:1.5.0
...
job_binary_url = hdfs://url/BTR/train_btr_2.0.py
...
BULMA
BULMA, as discussed in Section 12.1, is an unsupervised technique capable of matching a bus trajectory
with the "correct" shape, considering the cases in which there are multiple shapes for the same route.
First, You need a generated jar of BULMA project on a remote HDFS or on a Swift container. You can find
the files needed to generate a jar of BULMA in EMaaS repository32. Here we have no limitation about the
path being of Swift or HDFS, they both work. After you have a BULMA jar file in a Swift or HDFS path, the
next step is to edit the field job_binary_url with the path to the application file.
32
https://github.com/eubr-bigsea/EMaaS
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 185
After, you need to edit the field args of configuration file in the format described below:
1. <shape-file> <gps-file-or-folder> <output-folder> <partitioning-number>
Now, fill the specific field main_class with BULMAversion20.RunBULMAv2. If you are using Swift paths,
you necessarily need to set the fields job_template_name, job_binary_name, openstack_plugin,
plugin_version, and job_type.
Example of BULMA specific fields:
…
[plugin]
...
args = hdfs://url/BULMA/input/shapes.csv hdfs://url/BULMA/input/gps/ hdfs://url/BULMA/output/ 100
main_class = BULMAversion20.RunBULMAv2
job_template_name = EMaaS
job_binary_name = EMaaS
job_binary_url = hdfs://url/BULMA/EMaaSv2.jar
openstack_plugin = spark
plugin_version = 2.1.0
job_type = Spark
...
Submission
With the configuration file of the application you want to run filled correctly, now you just need to run
the client33.
Example of execution: 33
https://github.com/bigsea-ufcg/bigsea-manager/blob/master/client/sahara/client.py
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 186
$ python client.py <config-file-of-application-you-want-to-run>
Obs: Remembering that you can request through Broker API the status of execution.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 187
Appendix G - Load Balancer Results
In this Appendix we present the analyses of the other two scenarios (Initial cap 25% and 50%). The first
scenario, with initial CPU CAP set to 75%, was detailed in Section 9.
Scenario 2 - Initial Cap 25%
In this scenario all virtual machines start with an initial cap of 25%, the Table 40 depicts the time when
the Load Balancer executed migrations for each heuristic in all five executions, for the
BalanceInstancesOS the migrations had occurred in the first and second execution of the Load Balancer,
for the other heuristics the migrations occurred only in the third execution. Analyzing the
total_consumption (Figures 51 through 53) and total_cap (Figures 54 through 56) graphs we can see
that after each execution the Load Balancer could respect the cpu_ratio value used.
The CPU utilization of the hosts was collected, we can see the data in Figures 57 through 59, the
common pattern we can see in the graphs is the decreased usage in overloaded hosts when the Load
Balance start the execution and an increase in the utilization in the host that is receiving the migrated
virtual machines, also the peak of utilization right after the first execution is caused by the removal and
creation of virtual machines in the hosts.
Table 40 - Time when migrations started.
Execution
Heuristics 1 2 3 4 5
BalanceInstancesOS
14:25
14:31
14:46
14:52
16:54
17:01
17:59
18:04
18:19
18:24
CPUCapAware 19:04 19:34 19:57 20:18 20:39
SysbenchPerfCPUCap 21:59 22:24 22:45 23:06 23:28
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 188
Figure 51 - Heuristic BalanceInstancesOS - Total Consumption per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 189
Figure 52 - Heuristic CPUCapAware - Total Consumption per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 190
Figure 53 - Heuristic SysbenchPerfCPUCap - Total Consumption per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 191
Figure 54 - Heuristic BalanceInstancesOS - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 192
Figure 55 - Heuristic CPUCapAware - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 193
Figure 56 - Heuristic SysbenchPerfCPUCap - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 194
Figure 57- Heuristic BalanceInstancesOS - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 195
Figure 58 - Heuristic CPUCapAware - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 196
Figure 59 - Heuristic SysbenchPerfCPUCap - Host %CPU during executions.
Scenario 3 - Initial Cap 50%
In this scenario all virtual machines start with an initial cap of 50%, the Table 41 shows the time when
the Load Balancer executed migrations for each heuristic in all five executions, for the
BalanceInstancesOS the migrations had occurred in the first and second execution of the Load Balancer,
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 197
for the other heuristics the migrations occurred in the first and third execution. Analyzing the
total_consumption (Figures 60 to 62) and total_cap (Figures 63 to 65) graphs we can see that after each
execution the Load Balancer could respect the cpu_ratio value used.
The CPU utilization of the hosts was collected and we can see the data in Figures 66 through 68. The
common pattern we can see in the graphs is the decreased usage in overloaded hosts when the Load
Balance start the execution and an increase in the utilization in the host that is receiving the migrated
virtual machines, also the peak of utilization right after the first execution is caused by the removal and
creation of virtual machines in the hosts.
Table 41 - Time when migrations started.
Execution
Heuristics 1 2 3 4 5
BalanceInstancesOS
19:27
19:33
19:52
19:58
20:13
20:19
20:36
20:44
21:11
21:17
CPUCapAware 20:57 21:19 21:40 22:16 22:37
SysbenchPerfCPUCap 21:13 21:33 21:57 22:21 22:43
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 198
Figure 60 - Heuristic BalanceInstancesOS - Total Consumption per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 199
Figure 61 - Heuristic CPUCapAware - Total Consumption per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 200
Figure 62 - Heuristic SysbenchPerfCPUCap - Total Consumption per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 201
Figure 63 - Heuristic BalanceInstancesOS - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 202
Figure 64 - Heuristic CPUCapAware - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 203
Figure 65 - Heuristic SysbenchPerfCPUCap - Total Cap per host.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 204
Figure 66 - Heuristic BalanceInstancesOS - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 205
Figure 67 - Heuristic CPUCapAware - Host %CPU during executions.
www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 206
Figure 68 - Heuristic SysbenchPefCPUCap - Host %CPU during executions.