Post on 23-Mar-2020
transcript
IN DEGREE PROJECT COMPUTER ENGINEERING,FIRST CYCLE, 15 CREDITS
, STOCKHOLM SWEDEN 2016
Increasing the Throughput of a Node.js ApplicationRunning on the Heroku Cloud App Platform
NIKLAS ANDERSSON
ALEKSANDR CHERNOV
KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
Abstract
The purpose of this thesis was to investigate whether utilization of the Node.js
Cluster module within a web application in an environment with limited resources
(the Heroku Cloud App Platform) could lead to an increase in throughput of the
application and, in the case of an increase, how substantial it was.
This has been done by load testing an example application when utilizing the module
and without utilizing it. In both scenarios, the traffic sent in to the application varied
from 10 requests/second to 100 requests/second. For the tests conducted on the
application utilizing the module the number of worker process used within the
application varied between 1 and 16.
Furthermore, the tests were first conducted in a local environment in order to
establish any increases in throughput in a stable environment, and, in case there
were notable differences in throughput of the application, the same tests were
conducted on the Heroku Cloud App Platform. Each test was also aimed towards
testing one of two different types of tasks performed by the application: I/O or CPU
bound.
From the test results, it could be derived that utilization of the Cluster module did
not lead to any increases in throughput when the application was doing I/O bound
tasks in neither of the environments. However, when doing CPU bound tasks, it led
to a ≥20% increase when the traffic sent to the application in the local environment
was 10 requests/second or higher. The same increase could be seen when the traffic
sent to the application was 50 requests/second or higher in the Heroku environment.
The conclusion was, thus, that utilization of the module would be useful for the
company (that this thesis took place at) in case an application installed on Heroku
was exposed to higher traffic.
Keywords
Throughput, Node.js, Heroku, Performance, Increasing
Abstract
Syftet med detta examensarbete var att undersöka om huruvida nyttjande av
Node.jsmodulen Cluster i wen webbapplikation i en miljö med begränsade resurser
(Heroku cloud appplattformen) skulle kunna leda till en ökning i throughput hos
applikationen, och om det skedde en ökning – hur stor var då denna?
Detta har gjorts genom att belastningstesta en exempelapplikation nyttjande
modulen och utan den. I båda scenarier varierade trafiken som skickades till
applikationen mellan 10 och 100 requests/sekund. För testerna utförda i
applikationen som nyttjade modulen varierade antalet workerprocesser mellan 1 och
16.
Vidare utfördes testerna i den lokala miljön med målet att slå fast möjlig
throughputökning i en stabil miljö först, och om det fanns några märkbara
skillnaden i throughput hos applikationen skulle samma tester även utföras på
Heroku app cloudplattformen. Varje test strävade också för att testa en av två olika
typer av arbetsuppgifter utförda av applikationen: I/O eller CPUbundna.
Från testresultatet kunde det fastslås att: Clustermodulen ledde inte till några
ökningar vad gällde throughput när applikationen gjorde I/Obundna
arbetsuppgifter i någon av miljöerna. När applikationen däremot gjorde
CPUbundna arbetsuppgifter ledde det till en ökning på ≥20% när trafiken var 10
requests/sekund eller högre. Samma ökning kunde ses först när trafiken kommer
över 50 requests/sekund eller högre i Herokumiljön.
Slutsatsen var därmed att användande av modulen skulle vara användbart för
företaget som arbetet uträttades hos om en applikation som låg installerad på
Heroku utsattes för vad som ansågs vara högre trafik.
Nyckelord
Throughput, Node.js, Heroku, Prestanda, Öka
1
Table of Contents
Abstract (in English) Abstract (in Swedish) Table of Contents 1 Introductio n………………………………………………………………………………………………5
1.1 Background ………………………………………………………………………………………..5 1.1.1 Increasing Throughput ………………………………………………………………...6 1.1.2 Node.js ……………………………………………………………………………………….6 1.1.3 The Heroku Cloud App Platform…………………………………………………..6 1.1.4 Web Applications ………………………………………………………………………..7
1.2 Problem……………………………………………………………………………………………..7 1.3 Research Questions …………………………………………………………………………….7 1.4 Purpose ……………………………………………………………………………………………..8 1.5 Delimitations ……………………………………………………………………………………..8 1.6 Disposition………………………………………………………………………………………...9
2 Theoretical Background …………………………………………………………………………….10 2.1 The Company Platform……………………………………………………………………..10 2.2 Heroku Dyno …………………………………………………………………………………....11 2.3 I/O vs. CPU bound ……………………………………………………………………………12
2.4 The Inner Workings of Node.js ……………………………………………………..13 2.5 Increasing Throughput in Node.js Using the Cluster Module …………...14 2.6 Related Work………………………………………………………………………………15
3 Research Process ……………………………………………………………………………………...17 3.1 Research Methodology ……………………………………………………………………….17 3.2 Process Overview ………………………………………………………………………………18
3.2.1 Problem Definition……………………………………………………………………18 3.2.2 Data Collection………………………………………………………………………...19 3.2.3 Design & Implementation………………………………………………………....20 3.2.4 Defining the Testing Environments …………………………………………...20 3.2.5 Creating Test Plan…………………………………………………………………….20 3.2.6 Results and Analysis …………………………………………………………………20 3.2.7 Evaluation………………………………………………………………………………..21
3.3 Hypotheses ……………………………………………………………………………………...21 4 Analysis: How to Increase Throughput ………………………………………………………22
4.1 Our approach …………………………………………………………………………………..22 4.1.1 Different Implementations of the Cluster Module ………………………..22 4.1.2 Clustering Method Chosen
When Creating the Application Templat e…………………………………...23
2
4.2 The Application Template ………………………………………………………………...23 4.2.1 CPU Usage ……………………………………………………………………………….25 4.2.2 Workload ………………………………………………………………………………...26 4.2.3 Memory Usage ………………………………………………………………………....27
4.3 Test Application……………………………………………………………………………….27 5 Analysis: Benchmarking the Test Application……………………………………………..28
5.1 Testing Environment ………………………………………………………………………..28 5.1.1 Local Environment …………………………………………………………………….29 5.1.2 Heroku Environment …………………………………………………………………29 5.1.3 The Test Application’s Memory Usage………………………………………...29
5.2 Testing Tools……………………………………………………………………………………30 5.2.1 Apache JMeter…………………………………………………………………………..31 5.2.2 Heroku Metrics………………………………………………………………………...32
5.3 Creating the Test Plan……………………………………………………………………….33 5.4 Local Tests………………………………………………………………………………………33
5.4.1 I/O Bound ………………………………………………………………………………..34 5.4.2 CPU Bound……………………………………………………………………………...35
5.5 Heroku Tests…………………………………………………………………………………...36 5.5.1 Throughput Rates……………………………………………………………………..37 5.5.2 Memory Usage…………………………………………………………………………39 5.5.3 Median Reponse Times…………………………………………………………….40 5.5.4 Analysis of Heroku Test Results………………………………………………...41
6 Discussion……………………………………………………………………………………………….43 6.1 Our Methodology and Consequences of the Study …………………………….....43 6.2 Discussion and Conclusions ……………………………………………………………....44
6.2.1 Recommendations Concerning the Application Template …………….45 6.3 Ethics ……………………………………………………………………………………………...46 6.4 Sustainability …………………………………………………………………………………...46 6.5 Future Work…………………………………………………………………………………….47
References Appendix 1 Heroku Dyno CPU Information………………………………………………...52 Appendix 2 The Test Application……………………………………………………………….58 Appendix 3 The Application Template………………………………………………………..60 Appendix 4 The Local Server CPU Specifications………………………………………...61 Appendix 5 Results from I/O Bound Tests in Local Environment…………………63 Appendix 6 Results from CPU Bound Tests in Local Environment………………..65 Appendix 7 Results from CPU Bound Tests on Heroku………………………….…….67
3
1 Introduction
Today, virtually every company with a presence on the Internet collects data
concerning their customers in some form[1]. With a large collection of customer
profiles it is possible to collect information concerning the customer’s geographical
area, what products the customer has viewed, what devices the customer is using etc.
With this data, customer communication can be improved, marketing can be
optimized (through a more welltargeted informational flow), and all customer
information can be stored in one single virtual space.
Data can come from different sources: web analyticstools, login processes, email,
etc. It can also be required to collect data from different physical nodes; it might be
located in different data warehouses, and can even be administered by different third
party companies.
For a large company the collected data may grow very large and there might be a lot
of daily transactions. It is therefore important that these transactions are consistent,
that data is preserved, and that the application can handle as much traffic as
possible. One way of making sure that the application is adapted to do this is by
assuring that it can handle as many requests per time unit as possible. This leads to
the application being able to serve more clients, thus lowering the risk of a client not
receiving the requested data.
1.1 Background
Innometrics, the company that the project took place at, is active within the area just
described above. Their product helps other companies personalize their marketing
strategies by collecting data from a customer’s different data warehouses, and
creating a customer profile out of this data.
They were in need of increasing the throughput of Node.js applications used for
intrasystem communication between their system and other systems. These
applications were installed on an external cloud platform (Heroku Cloud App
4
Platform or Amazon Web Services), and thus restricted by each of the platform’s
individual specifications.
1.1.1 Increasing Throughput
Throughput is a measurement used for describing the number of requests per time
unit handled by any given web service or application. One of the ways of increasing
throughput is by making the application more concurrent, that is – to make it
process more requests simultaneously [2].
This can be achieved by adding extra hardware resources or by maximizing
utilization of the available resources.
1.1.2 Node.js
Node.js is a runtime environment based on the programming language JavaScript –
a programming language most wellknown as the scripting language for web pages. A
runtime environment deals with a variety of issues such as the layout and allocation
of storage locations for the objects specified in the source code, the mechanisms used
by the target program to access variables and for passing variables, etc.[3].
Node.js ships with a collection of modules, which basically encapsulate related code,
as in Java or any other programming language with a set of standard libraries. Also,
new modules can be installed, managed and published through the Node Package
Manager to provide further functionality. A more detailed specification of Node.js is
given in chapter 2.
1.1.3 The Heroku Cloud App Platform
In order to describe what cloud computing is, Eric Griffith states in his article [4]: “In the simplest terms, cloud computing means storing and accessing data and programs
over the Internet instead of your computer’s hard drive”.
Heroku belongs to a type of cloud computing known as Platform as a Service
(PaaS)[5]. This type of service removes the need for organizations to manage the
5
underlying infrastructure (usually hardware and operating systems) and allows users
to focus on the deployment and management of their applications [6].
The Heroku platform allows for users to install and execute applications isolated
from one another. It provides functionality such as a database management system
and application monitoring. The platform’s execution environment also enables the
user to write applications in several different programming languages, such as
Node.js, Ruby, Java and PHP.
1.1.4 Web Applications
An application is a stored set of instructions that directs a computer to do some
specific task[7].
Web applications are distributed clientserver applications in which a web browser
provides the user interface [8]. The client browser and the server side exchange protocol messages represented as HTTP requests and responses. In the case of cloud
computing, web applications no longer exist on the server, instead they reside on a
cloud platform.
1.2 Problem
During periods of high traffic towards a web application, it is essential that the
system can handle the increased demand of service. Exposing an inefficient web
application to high traffic can cause individual requests not to receive their
corresponding responses. It can also lead to response times – the total time it takes from when a user makes a request until they receive a response – being longer than
desired.
In order to fulfill the need of service to as many clients as possible, it is important
that the web application can provide a large throughput.
6
1.3 Research Questions
The main questions of this thesis narrow down to:
How can the throughput of a Node.js application, running on the Heroku
Platform, be increased by taking advantage of the available system resources?
In case of an increased throughput, how substantial will it be?
1.4 Purpose
The application’s performance is limited by the cloud platform where the application
is installed. The purpose of this thesis is to show how to increase throughput of the
company’s applications running on the Heroku Platform.
The intention is to develop a generic application template in Node.js, that can be
used when creating new applications within the company’s Node.js application
platform. Applications that utilize this template should be able to be installed on the
Heroku Cloud App Platform. The template should increase the throughput of each
individual application, and thus increase the performance of the system in whole.
Although the application was primarily aimed at the Heroku platform, there should
be a possibility to migrate it to other existing cloud platforms. Therefore, the solution
should be as general as possible.
Furthermore, we are to implement functionality that takes full advantage of the
available system resources in order to increase the number of requests handled by
the application per time unit. This is to be done without adding any additional
hardware resources.
Best practices in increasing throughput of a Node.js application deployed on the
Heroku Cloud App Platform, without adding hardware resources, will be investigated
and evaluated. Hopefully, this will lead to an increase in throughput of each
individual application on the Innometrics’ Node.js application platform.
7
1.5 Delimitations
This thesis will focus singlehandedly on increasing the throughput by taking
advantage of the available system resources. Also, it is only concerned with the
increase of throughput in an application running on the Heroku Platform – not on an
arbitrary cloud platform. Furthermore, we were limited to using only the free
account level of Heroku (specifications of machines on this level are given in chapter
2).
1.6 Disposition
The thesis is outlined as follows. Firstly, a theoretical background is presented, giving
a brief insight to the specific technologies that are needed in terms of understanding
the approach to the problem and the thesis results. Node.js, the Heroku
environment, and increasing throughput in Node.js specifically are here discussed in
more detail.
After that, the research process is treated. The chapter starts by describing our
information gathering process, and continue with a review of existing literature, a
description of our research methodology and the requirements specification.
The next chapter describes how the template for the applications is created. This
chapter is then followed by a chapter devoted to the tests. Here, the testing
environment is described and the results from the tests are evaluated.
Lastly, in the Discussion chapter, we reflect about our methods and results, future
work, and the topics of ethics and sustainability within the area.
8
2 Theoretical Background
This chapter will give a deepened insight into the more theoretical parts of the
problem area that is essential in order to understand the problem and its solution. It
will describe the Innometrics system, the Heroku dyno, how Node.js works in more
detail, how to increase throughput within the runtime environment using the cluster
module, and related work done within the area.
2.1 The Company Platform
As mentioned in section 1.1, the customer’s (the company buying Innometrics’
product) data warehouse or their system for tracking and managing existing or
potential customers (Customer Relationship Management system, or CRMsystem)[9] is connected to the company platform.
With the data retrieved from the customer’s data warehouse or CRM system,
Innometrics initially puts together a profile for each of the customer’s clients (the
visitors to the customer’s website), which is then stored in Innometrics’ own data
warehouse. The Innometrics system will then continually add data to this profile
containing information on any website interaction that the client in question has
made towards the customer’s website. The website interaction to listen for is
specified by the customer through the Innometrics system.
All client interaction, that has been specified to listen for, is logged in an event
stream in the form of data objects known as events. An event is, in turn, a collection
of data containing information on an action that has been taken by a client on the
customer website. For example, as a client clicks a banner or a link, an event could be
generated containing information on which banner or link that was clicked, the time
when the click was made, etc.
In order to enable Innometrics to retrieve resources from third party sources,
Node.js applications (deployed on a cloud platform) are used. Each application has
been set to listen for one or several events. In case any of these events are triggered
by a visitor on the customer website (e.g. by the client clicking on a link), the
9
Innometrics system sends a request containing client’s profile (with the event added
to it) to the application.
An example of this type of communication is shown in figure 2.1.1. As the client visits
a website, an event is generated by the Innometrics system containing information
on the IP of the client. A request containing the client profile is then sent by the
Innometrics system to the application. The application extracts the IP address
contained in the event data of the profile just received with the request, and sends it
onwards to an IPlookup service retrieving further information on the IP address in
question. Then, as the response is received from the lookup service, the application
saves this data towards the Innometrics’ own data warehouse.
Figure 2.1.1: A flow chart describing an example case of communication between different actors as an
event is triggered on a customer website.
2.2 Heroku Dyno
Each application on the Heroku platform is running on a dyno. Each dyno is a
lightweight Linux container that runs a single command provided by the user. A
dyno can run any command available in its environment like restart, stop, scale, etc.
10
According to Heroku’s official documentation[10], containerization is a virtualization technology that allows multiple isolated operating system containers to be run on a
shared host. All dynos are isolated from one another for security purposes.
Dynos on the free account level are limited to 512 MB of RAM [11]. Concerning the CPU specifications, this is something that Heroku (due to unknown reasons) has
decided not to reveal to the user, but by accessing the application’s shell environment
it was clear that the dyno lied on a machine that had access to one physical unit
consisting of 4 cores with 8 hardware threads each (see Appendix 1). However, it
seemed [10] that this was something that the dyno has varied access to depending on the amount of other dynos currently active on the shared host.
A hardware thread is one out of two execution threads per core that executes
simultaneously in order to hide latencies when it comes to retrieving data from
memory caches on the CPU, and is something that is implemented by Intel
HyperThreading Technology [12].
2.3 I/O vs. CPU bound
Tasks performed by an application or a system can be I/O or CPU bound.
I/O (I/O is shorthand for Input/Output) bound task performs operations associated
with I/O communications. Examples of I/O communications are HTTP requests,
database operations and disk reads and writes [13].
CPU bound tasks are mainly performed by the CPU. In this case the CPU spends its
time mostly on computing. Examples of these types of tasks are calculating a hash,
searching for an item, and performing mathematical calculations.
Figure 2.3.1: A CPU (a) vs. I/O bound (b) application
11
An application can also be either CPU or I/O bound. In the case of a CPU bound
application, a majority of the tasks done within the application are CPU bound. In
the case of an I/O bound application it is the other way around – a majority of the
tasks are I/O bound. Both types of applications are depicted in figure 2.3.1. Here it
can be seen how the CPU bound application (application ‘a’) spends more time doing
calculations, and less time handling I/O. It can also be seen, in application ‘b’, how
an I/O bound application spends its time doing the opposite – more time waiting for
I/O, and less time doing calculations [14].
2.4 The Inner Workings of Node.js
One of the main strengths of Node.js is its method for treating I/O calls. This is much
because of I/O calls being handled by background threads, while the main thread of
the application, known as the event loop, can treat and process any other requests
sent to the application. In figure 2.4.1, there is a detailed overview of the inner
workings of Node.js.
Figure 2.4.1: A Node.js instance with its event loop and thread pool
12
Node.js runtime runs on single core [15] and contains an event queue which stores a list of events, each consisting of a name describing the event and a callback
function[16] (a function to be run after the initial function has finished its execution). An example of an event is when an HTTP request is sent to the server. This request is
placed in the event queue. The event loop starts by picking up an event containing an
I/O call that is to be executed from the queue and then delegates the job to the
operating system via an internal thread pool [17]. The thread that receives the job then executes the function associated with the event without blocking the event loop,
while the event loop continues treating the next event in queue.
After the thread in the internal thread pool has finished its execution, the callback
function is again placed in the event queue. The callback function is later on retrieved
from the queue and processed by the event loop. If another event occurs, a new event
is placed in the event queue, and the procedure is repeated. This way the event loop
can handle all incoming requests asynchronously in a nonblocking way.
However, Node.js is not as good at treating CPU intensive tasks [18]. When Node.js
performs a CPU intensive task all other requests are being held up, due to the event
loop running on a single thread and the CPU being occupied with working on this
thread. One of the strategies to handle this problem is by using the Cluster module [13].
2.5 Increasing Throughput in Node.js Using the Cluster Module
In order to improve Node.js ability to treat CPU intensive tasks, worker processes
can be forked. That is, the main process of the application is duplicated into new
processes referred to as worker processes [19]. The main process is then referred to as the master process. This functionality is provided by the Cluster module, which is a
part of the standard library in Node.js [15].
When forking new processes, all new connections are first received by the master
process and then handed over to an available worker. Which worker gets the
connection is decided through a roundrobin approach – which essentially means
that the next available worker gets it [20].
13
Best practice is to bind each worker to its own logical CPU core, which leads to the
application’s ability of processing each request being increased through utilization of
more of the CPU’s capacity, thus, increasing its effectiveness and throughput [21].
This essentially means that each Node.js instance (figure 2.4.1) is replicated into its
own server instance, where each instance – known as a worker process – listens to
the same socket. Here, the master process works as a loadbalancer by receiving all
incoming connections and distributing them among the worker processes [15]. The resulting architecture of the application when implementing the cluster module is
depicted in figure 2.5.1.
Figure 2.5.1: The desired application architecture for this thesis, with each worker representing the
Node.js instance depicted in figure 2.4.1
2.6 Related Work
The Node.js platform is still rather new and evolving rapidly. Because of that it is not
easy to find articles that are still uptodate. Some of the articles are reviewed in this
section.
The article “Optimizing Node.js Application Concurrency” provided by Heroku’s
official website, explains how to regulate the number of worker processes [22]. It is also recommended to create worker process and bind each of them to its own logical CPU
core, thus making the application take full advantage of the available system
resources. One interesting thing that they mention is that each app has unique
14
memory, CPU and I/O requirements and there is no solution that can fit each app.
However, they do not provide any benchmark results.
Rowan Manning in his blog describes how to implement the Cluster module [23]. He also states that creating multiple processes for a Node.js application can dramatically
improve the amount of load the application can handle. He provides some simple
benchmarking in order to illustrate the improvement. The app is installed on a local
machine without involving the Heroku platform, and the benchmarked function is
doing CPU bound tasks.
Neil Kandalgaonkar argues that “Node.js can be a great choice for computation heavy
services” [24]. He clarifies that it can be suitable for some occasional CPUbound tasks – not too many, nor too heavy tasks however. The Heroku platform is mentioned in
the article as well, but due to the fact that the application tested in the article was too
big (~200 Mb), it was not possible to perform some thorough tests on that platform.
He names the Cluster module as one of the possible solutions.
15
3 Research Process
This chapter will describe our research process. It will provide a description of the
methodology used in solving the problem, give an overview of the overall process,
and lay down the hypotheses for this thesis.
3.1 Research Methodology
Since the thesis consisted of two separate research questions, two different research
strategies were used.
In order to answer the first research question, how to increase the throughput of the
application, practices on how to create a server in Node.js were investigated. The
solution was determined through a combination of quantitative and qualitative
methods, where a form of applied research based on existing theories and research [25] was used to create a test application, which could then be evaluated by answering the
second research question. If the results from answering the second question would
lead to a substantial increase (≳20%), the results from the evaluating the first question would be considered positive. If not, the first question would need to be
reevaluated based on another existing theory.
When answering the second research question, how substantial the increase in
throughput would be, two different methods were followed. Experimental research
was conducted by having a foundation for this thesis by comparing different test
results with one changing variable per test. In our case, these variables were
represented by 1) the load sent to the application during the test, 2) the number of
workers used by the application in the test, and 3) the environment the test was
conducted in (local or Heroku).
The hypotheses could also be predefined for the outcome of the comparison, and
thereby, a method of the analytical kind was also used [25]. Thus, the methodology used for answering the second research question was a combination of two research
methods: experimental and analytical.
16
3.2 Process Overview
The methods listed in this section are described in order to give an understanding on
how this thesis was structured to be able to achieve the goal and answer the research
questions defined in chapter 1. The overall research process is illustrated in figure
3.2.1, and is described in detail below.
Figure 3.2.1: The research process
3.2.1 Problem Definition
This was the phase where the problem was defined out of the requirement
specification received from the company.
17
3.2.2 Data Collection
Data collection consists of two different types of data: primary data and secondary
data.
Primary data is most generally described as data collected from the information
source and most often is retrieved through interviews, observations and discussions
with members of the company [26].
Secondary data, in turn, is typically gathered by persons not involved in the current
research. The sources of this kind of data can be technical and statistical records,
newspaper articles, etc [26].
The primary data that the qualitative part of this thesis relies on mainly consists of a
task overview given by Innometrics’ supervisor of this thesis, and of informal
interviews given by the employees of the company.
The overview given by the company consisted of recommendations on what modules
to use for the thesis – partly modules used by the company daily when designing
applications for the platform, and partly modules that could contribute to this thesis.
Recommendations on what tools that might be used when performing the tests were
also given.
The informal interviews given by employees consisted of recommendations on how
to set up the remote environment, and information on the average traffic that the
Innometrics system is exposed to.
This primary data was then complemented by document studies in the form of
company documentation on the platform, technical reports, and articles on the
subject. Such materials can give a better and deeper understanding of the subject.
The primary data that the quantitative part of this thesis relies on mainly consists of
test results obtained from tests conducted in order to answer the second research
question on how substantial the increase in throughput was (in case there was an
increase).
18
3.2.3 Design & Implementation
In this phase, a test application is to be designed and implemented based on a known
method for increasing throughput in Node.js. The initial design of the test
application is the result of the primary data obtained through qualitative methods
just described, and it defines the architecture and functionality of the application.
3.2.4 Defining the Testing Environments
In this phase, the specifications for the machines, of both the local and the Heroku
testing environments, were laid down.
3.2.5 Creating Test Plan
During this phase, focus laid on creating a test plan that included test for both the
local and the Heroku testing environment. The test plan were to be designed to test
the application’s throughput in the case of both I/O and CPU bound tasks, different
rates of traffic sent to the application, and different number of worker processes for
each traffic rate and type of task.
We had been informed on the structure of the requests being sent to the application,
and by reusing that structure we only needed to adapt the request’s body to contain
data relevant for the test application. The body data that was relevant for this thesis
was simply a string, in order to determine which function to call (I/O or CPU bound).
3.2.6 Results and Analysis
This phase consisted of two iterations: one for local tests, and one for Heroku tests.
In both iterations, the test application was benchmarked, and the results of the
benchmark was then analyzed. The results were presented in form of tables and
graphs. Increases in throughput were expressed in terms of a percentage increase
between each test.
The analysis consisted of a type of formative evaluation, where concentration laid on
examining and changing processes as they occur. The last iteration was evaluated, in
case it had provided positive results the process would continue to a final evaluation
19
of the solution. In case it had provided negative results, a new iteration would be
initiated.
3.2.7 Evaluation
The evaluation of the solution to this thesis was to have a summative approach,
providing an overall description of the application’s performance increases. It was to
be described whether the objectives of the thesis had been fulfilled, and on the future
direction of the product. Here, a secondary analysis was also to be given to reexamine
existing data to address new questions or methods not employed.
3.3 Hypotheses
Our hypotheses were that the results would show that the test application would have
performance increases in areas where Node.js usually was flawed. In other words:
when benchmarking one and the same Node.jsapplication with and without our
application template, a performance increase, in the form of a higher throughput,
should be apparent when doing CPU heavy tasks, such as calculating a hash or when
doing other arithmetic calculations. However, when doing I/O bound tasks it should
result in a status quo.
20
4 Analysis: How to Increase Throughput
This chapter will provide the answer to the first research question of this thesis how
can the throughput of a Node.js application, running on the Heroku Platform, be
increased by taking advantage of the available system resources? The answer was
obtained by using the qualitative methods described in section 3.2.2. It will provide a
description of the test application used in this thesis.
The application was to consist of partly the throughput increasing template that was
to be the product of this thesis, and partly functionality for testing two different
aspects – it’s capabilities of fulfilling I/O and CPU bound tasks – of the test
application in its current environment.
4.1 Our approach
When analyzing data retrieved during the data collection phase, we found that there
were not so many ways to increase throughput of a Node.js application.
The main method for increasing throughput in Node.js is by creating multiple
processes for the application, thus utilizing more of the available system resources.
This is known as clustering the application, and is mainly implemented by the
Cluster module described in section 2.5.
4.1.1 Different Implementations of the Cluster Module
There are several alternatives for implementing worker processes in Node.js [27]. One of these is to simply use the standard Cluster module, which comes as a standard
library in Node.js, and provides the most basic mechanisms for implementing worker
processes. More about the implementation of the Cluster module for this thesis can
be found in section 4.2.
There is also the alternative of implementing the Throng module [28], which is used by Heroku in their own example on how to cluster. This module is also implemented on
top of the Cluster module. It is being advertised for being “a simple worker manager
for clustered Node.js apps”, by obscuring large parts of the master/worker logic when
21
clustering the application – in order to make it easier for the developer. Instead, the
developer mainly has to focus on setting the number of workers, configuring the
master process, etc.
Another alternative is PM2, a program which is also implemented on top of the
Cluster module. It is similar to the Throng module by obscuring large parts of the
master/worker logic from the developer, but does so to an even larger extent. It also
provides the application with some additional functionality such as real time process
management [29] (e.g. adding workers), basic system monitoring, log aggregation, etc [30].
Lastly, there is the alternative of implementing the StrongLoop Cluster Management
Tool [27], which also is based on the Cluster module, and basically provides the same functionality as PM2, but with some smaller differences (such as profiling).
4.1.2 Clustering Method Chosen When Creating the Application
Template
When it came to this thesis, it was found that the standard Cluster module was the
most appropriate way to implement clustering in the application.
When looking at the alternatives, they either tended to hide larger parts of the cluster
related code from the developer (Throng, PM2, and StrongLoop), or offer
functionality not relevant for this thesis – which might have lead to a larger memory
usage (higher memory allocation) for the processes. They are also all built on top of
the Cluster module, and it also seemed easier to adapt the standard Cluster module
to different cloud platforms compared to the other alternatives [31].
While other alternatives, with additional functionality, might be useful in a livecase
scenario – it was not appropriate for this study, where it was desired to evaluate the
effects of clustering on the most basic level.
22
4.2 The Application Template
When creating the template, we relied on the official Node.js documentation, and the
description of the Cluster module in particular, on how to create, and cluster, a web
application. This lead to a template realizing the server model described in section
2.5 (and depicted in figure 2.5.1).
When developing the template, it was important to keep the master process as light
as possible, by keeping the allocated memory for the process at a minimum, and not
to include any server related code, or any other code that was not relevant to its task
of managing the workers. The reason for this was to optimize memory usage on the
cloud platform, and since it was the workers that did the request handling
procedures, it was important that they had a maximum amount of memory available.
In order to change the number of worker processes dynamically for each application
instance, an environment variable that could be set via the command line was used.
On Heroku, this variable held the number of workers appropriate for the number of
dynos used for the application.
According to official Node.js Cluster documentation, the default strategy when
creating worker processes in an application, was to use the worker processes as
request handlers (receives and treats requests), and use the master process for
creating workers and handing sockets to them through interprocess communication
(IPC) – a mechanism for sharing data among multiple process [4].
Figure 4.2.1: The master related code of the template
This works in the way that, as the application instance is started, the master creates a
number of workers equal to the value represented by the environment variable
mentioned earlier (see figure 4.2.1, rows 2527). Also, on rows 2931, it is shown how
23
a new worker is generated by the master in case a worker dies (process somehow
shuts down).
Figure 4.2.2: The worker related code of the template
In figure 4.2.2, the worker related code of the template is shown. The worker starts
by instantiating the Express framework (row 34) – a Node.js framework used for
creating web applications providing the process with necessary server functionality.
On rows 3941, it can be seen how each worker process listens to the same port.
The code for handling each request sent to the application is shown on rows 4451.
As a request is sent to the application, the request is treated and a response is
generated in the callback of this method.
The complete template of this thesis can be found in Appendix 3.
4.2.1 CPU Usage
As mentioned earlier, a Node.js application has a singlethreaded event loop,
utilizing only one of the available CPU cores. To increase throughput only using the
available system resources, it should be specified in the application how many
worker processes that are to be created.
As mentioned in section 2.5, best practice in determining how many worker threads
that should be created for a particular application, is to base it on the number of
24
cores available to the system. That way, each process is bound to a single logical core.
The desired CPU usage can be seen in figure 4.2.1.1
Figure 4.2.1.1: Regular vs. desired Node.js CPU usage
Using the Cluster module, this can easily be implemented on a physical machine,
where you exactly know the specifications of the machine. When it comes to a cloud
platform, however, there is not much information revealed about container
specifications. A single Heroku dyno shares access to system resources with some
other dynos and the performance of a single dyno can vary depending on the total
load on the underlying machine. Therefore, according to Heroku’s article
“Optimizing Node.js Application Concurrency” [21], clustering more than one worker on standard single dyno may hurt, rather than help performance. This was one of the
things to be considered when performing the tests.
4.2.2 Workload
Analyzing the information received from observations and recommendations, it was
clear that the application should be able to handle different amounts of simultaneous
requests. The customers that use the application are of different kinds it can be
either large or small companies. Thus, the application should take into consideration
those differences, i.e. it should be able to handle both larger and smaller amounts of
clients. Therefore, finding the right balance of workers is very important.
25
4.2.3 Memory Usage
Applications can differ in memory usage. Some applications, in need of larger
memory allocation (≳200 Mb for a single application), might suffer from implementing worker processes on a single Heroku dyno (due to exceeding the
memory limit). Exceeding the memory limit could lead to the application not
performing desirably, with requests timing out (not receiving responses). Therefore,
when clustering an application the memory usage of each process has to be kept in
mind – the application’s overall memory usage must not exceed the dyno’s memory
limit.
4.3 Test Application
A template was created, and by that the first research question was partly answered.
In this case, the next step would be to verify whether the template would give the
desired increase in terms of throughput on the Heroku platform or not. In order to
do that a test application was to be created and the second research question to be
answered.
The application had to provide means for testing its capabilities of doing different
tasks. From discussions with people at the company, it was discovered that the
system sometimes calculates hashes when creating new profiles. Therefore, the test
application needed to provide the ability to run two different types of tasks: CPU and
I/O bound. The CPU bound function calculated a hash, while the I/O bound task
simulated an I/O call by doing a timeout of 300 ms where the application simply was
waiting without blocking the event loop. The function to run was parsed from the
HTTPrequest that the application received. In each of the two functions, an
appropriate response was generated. The same test application (see Appendix 2) was
used in both local and Heroku tests.
26
5 Analysis: Benchmarking the Test Application
This section will focus on describing the testing environment, the test plan, and the
results obtained from the tests, in a local environment and on Heroku. It will provide
an analysis of the results, in order to answer the second research question of this
thesis on how substantial the increase in throughput of the application can be when
clustering functionality has been added.
5.1 Testing Environment Analyzing the information gathered during the data collection phase when
attempting to answer the first research question, we came to the conclusion that it
was needed to define a local and a Heroku testing environment.
Heroku themselves inform[21] that an application might suffer from being clustered when running on a free account. The tests were thus first conducted on the test
application locally, with the goal of acquiring the expected results in a stable
environment. Tests were then conducted on the same application, but instead
installed on Heroku, with the expectation of obtaining similar results.
In both testing environments we used six different versions of our application – one
without clustering functionality, and five with clustering functionality, each with a
given number of workers available to the application instance (1, 2, 4, 8 or 16). The
version without clustering functionality was needed in order to confirm that the
added functionality would not affect the performance of the application.
For both environments the same machine was used as client. The specifications of
the client machine were:
Macbook Air (13inch, Mid 2013)
CPU: 1.7 Ghz Intel Core i7
Memory: 8 GB 1600 Mhz DDR3
OS: Mac OS X El Capitan, Version 10.11.4
100/10Mbs Ethernet Connection
27
5.1.1 Local Environment
The local testing environment consisted of two machines: one client (with the
specifications given above), and one server with the following specifications:
MacBook Pro (13inch, Mid 2012)
CPU: 2.5 Ghz Intel Core i5
Memory: 4 GB 1600 Mhz DDR3
OS: Mac OS X El Capitan, Version 10.11.5
100/10Mbs Ethernet Connection
Through a terminal command in Mac OS X, the specifications for the Intel Core i5
CPU could be retrieved (see Appendix 4). Here, it could be seen that the CPU had
access to 2 cores and 4 hardware threads. This later on became a determinant when
deciding on what was the most appropriate amount of worker threads to be used
when running a local server.
5.1.2 Heroku Environment
Summarizing the specifications given for the Heroku dyno in section 2.2:
CPU: varied share depending on how many other dynos that are currently
active on the shared host
Memory: 512 Mb
Due to having significantly less memory available in the Heroku environment
compared to the local one, and due to the fact that a dyno had a varied share of the
CPU, it was needed to establish the results locally first. We thought that if the
expected results from the hypotheses (i.e. getting a throughput increase only for CPU
bound tasks) could be obtained in a local environment, it would be worth testing on
Heroku as well. If not, the expected results would definitely not be obtained on the
lower performing machines that we had in our Heroku environment.
5.1.3 The Test Application’s Memory Usage By monitoring the application’s memory usage locally through a Mac OS X terminal
command, “top”, we could see that it used 20 Mb without any requests being sent to
it. When requesting the application to run CPU bound tasks, on the other hand, its
28
memory usage could climb up to 85 Mb, but averaged around 65 Mb. When
requesting the application to run I/O bound tasks, on the other hand, its memory
usage could climb up to around 80 Mb, but averaged around 60 Mb.
Since we, on Heroku, had a memory quota of 512 Mb, we had now been given an
equation for calculating the appropriate number of workers for the application. By
having 512 Mb in total memory available, and the application having a max memory
usage of around 85 Mb when it was performing CPU bound tasks (the most memory
demanding task), the most appropriate number of workers would be around 512 Mb
/ 85 Mb ≈ 6 workers. Considering that the master process also would need some
memory allocated, the appropriate number of workers would most likely be slightly
below 6.
Among our different versions of the application, we could thereby predict that the
one having 4 workers would produce the best results by giving an increased
throughput, while not exceeding the memory limit of the dyno (and still leave a
margin to it). The application utilizing 4 workers would have a memory quota of 512
Mb / 4 workers = 128 Mb available for each worker (minus master process memory
usage). This meant that when the application would be exposed to high traffic
ordering it to perform CPU bound tasks, each worker would still have a memory
quota of 128 85 = 43 Mb available, which should be considered as a good margin,
without leaving a significant amount of unused memory on the dyno.
Summarizing, it is important that the memory usage of the application’s processes
does not exceed the available memory of the Heroku dyno, and that it, ideally, lies
with a good margin below this value – but not too good, because then a large amount
of memory could become unused. The problem, concerning the Heroku
environment, had thus become memory related as well (not only CPU related).
5.2 Testing Tools
This section will describe testing tools used when running tests locally and on
Heroku.
29
5.2.1 Apache JMeter
JMeter is a Java application designed to load test functional behavior and measure
performance. It provides means for simulating a heavy load on a server, groups of
servers, or network, to test its strength or to analyze the performance under different
load types. It has the ability to load and performance test many different
server/protocol types: HTTP/HTTPS, FTP, TCP etc.
Figure 5.2.1.1: Example properties of a thread group
With each testing plan, the user creates a thread group, specifying a thread number,
a rampup period and a loop count. The thread number specifies how many threads
that are to be started in the beginning of each rampup period (specified in seconds),
and the loop count specifies how many times this procedure should be repeated. In
figure 5.2.1.1, there is an example of the properties that can be set for a thread group.
Here, 10 threads are being initiated each second, and this is looped 320 times.
Figure 5.2.1.2: An example of the properties of an HTTP Request Sampler
30
Within each thread group, in turn, there are several elements that can be included.
For example, in our case it was relevant to include an HTTP Request Sampler – an
object that contains information on an HTTP request that is to be sent with each
thread in the thread group. Figure 5.2.1.2 shows an example of properties set for an
HTTP Request Sampler that sends request to port number 8887 on IP 192.168.1.104.
The body data can also be set, this is however something that we could not show due
to risking company policy infringement.
There is also a possibility of generating aggregated reports. This type of report is
what lies as a basis for presenting the results of the tests performed in the local
environment.
5.2.2 Heroku Metrics
When running the tests on Heroku, we used JMeter for sending the requests, but not
for measuring the application’s performance. This was due to JMeter having a
different measurement of throughput, which was based on the number of samples
divided by the total time of the test. This meant that the time for the request being
sent to, and received by, the server, and the time for the response being sent to, and
received by, the client, being included in the measurement as well. This was an
acceptable measurement in the local environment, since the distance between client
and host was small. Now, however, with the application deployed on an external
host, we had to take into consideration that there might be a significant distance
between the client and the host. Therefore, it was decided that the just mentioned
transport times were something that should not be a part of the application’s
performance evaluation.
In order to measure the application’s performance ideally, it was important to do the
measurements as close to the application as possible. This could be done by relying
on the metrics tool which Heroku has made available for developers. The tool
consisted of a collection of graphs including the same units of measurement for the
application as those retrieved from the JMeter reports used in the previous section –
namely throughput, average and median response times, and error rates.
31
5.3 Creating the Test Plan The testing procedures of the application followed a pattern where, for each type of
task (I/O or CPU bound), the number of requests sent to the application was
gradually increased – in order to evaluate how well the application performed
different tasks at different traffic rates.
The load rate for each test varied between 10100 requests per second. The rates of
1025 requests per second were to simulate low traffic, 2550 medium traffic, and
50100 high traffic. For all tests 15000 samples were sent to the application.
Figure 5.3.1: Example of six thread groups each containing one HTTP Request Sampler
In order to test each application sequentially, there was one thread group (see figure
5.3.1) for each number of workers available for each application. Within each thread
group, we had specified a HTTP Request Sampler (described in section 5.2.1) sending
to that application's specific endpoint. As mentioned earlier, regardless of whether
the application was running locally or remotely on Heroku we set up the same
samplers in each of the thread groups – only changing the names of the thread
groups and the host URL for the samplers.
5.4 Local Tests
This section describes the evaluation of the application’s capabilities in performing
I/O and CPU bound tasks in the local environment. Focus was laid on differences in
throughput between tests, but average and median response times will also be noted
and analyzed.
32
5.4.1 I/O Bound
Starting with the I/O bound tests and sending in 10 requests per second (see figure
5.4.1.1), it was noted that the results were similar between each thread group – no
matter the number of worker processes used. Both the average and median response
time is close to the same for each of the thread groups. The throughput (number of
requests handled by the application per second) is also similar between the thread
groups.
Label # Samples Average (ms) Median (ms) Error % Throughput (rps)
10/s With Cluster, 2 workers 15000 317 308 0,00% 30,8
10/s With Cluster, 4 workers 15000 308 309 0,00% 31,7
10/s With Cluster, 16 workers 15000 307 308 0,00% 31,8
10/s Without Cluster 15000 307 307 0,00% 31,9
10/s With Cluster, 8 workers 15000 307 308 0,00% 31,9
10/s With Cluster, 1 workers 15000 307 307 0,00% 31,9
Label # Samples Average (ms) Median (ms) Error % Throughput (rps)
100/s With Cluster, 16 workers 15000 306 305 0,00% 309,3
100/s With Cluster, 4 workers 15000 305 305 0,00% 310,3
100/s Without Cluster 15000 305 305 0,00% 310,8
100/s With Cluster, 8 workers 15000 305 305 0,00% 310,9
100/s With Cluster, 2 workers 15000 306 305 0,00% 311,5
100/s With Cluster, 1 workers 15000 306 305 0,00% 312,0
Figure 5.4.1.1: Results from local I/Obound tests at 10 and 100 request per second (sorted by
throughput)
Looking at the results from the test where 10 requests were being sent per second,
there is barely a difference when comparing the one without clustering and the ones
utilizing it. When looking at the results from the other test (100 requests per second)
the result was the same. The difference in throughput between the thread without
clustering and the highest performing thread with clustering was ~0.4%, which is not
a significant difference (and does not pass the bar of 20%).
To summarize, when the application was performing I/O bound tasks in a local
environment, the test results did not show a significant difference in terms of
33
throughput, and none of the results passed the bar of an increased throughput of
20%. In accordance with the hypotheses (described in section 3.3) and the research
process depicted for this thesis (section 3.2.6) of only moving onto Heroku with tests
that proved an increase in the local environment, an increase in throughput could
not be seen as the application performed I/O bound tasks.
All of the results obtained from testing the application’s capabilities of performing
I/O bound tasks in the local environment can be found in Appendix 5.
5.4.2 CPU Bound
As can be seen in figure 5.4.2.1, when simulating low traffic (10 requests per second)
the results obtained showed significant difference between the thread groups. When
comparing the thread group without clustering to the highest performing one with
clustering (4 workers), it showed a difference of ~24.4%. There could also be noted a
decrease of 25% in average and median response times between the two thread
groups.
Label # Samples Average (ms) Median (ms) Error % Throughput (rps)
10/s With Cluster, 1 workers 15000 6 6 0,00% 787,5
10/s Without Cluster 15000 4 4 0,00% 827,6
10/s With Cluster, 16 workers 15000 4 3 0,00% 892,9
10/s With Cluster, 8 workers 15000 3 3 0,00% 972,3
10/s With Cluster, 2 workers 15000 3 3 0,00% 994,6
10/s With Cluster, 4 workers 15000 3 3 0,00% 1029,9
Label # Samples Average Median Error % Throughput
50/s With Cluster, 1 workers 15000 51 53 0,00% 815,5
50/s Without Cluster 15000 37 39 0,00% 954,5
50/s With Cluster, 2 workers 15000 25 27 0,00% 1211,9
50/s With Cluster, 16 workers 15000 15 15 0,00% 1269,7
50/s With Cluster, 4 workers 15000 16 15 0,00% 1320,3
50/s With Cluster, 8 workers 15000 13 13 0,00% 1355,9
Figure 5.4.2.1: The results of the CPU bound tests at 10 and 50 requests per second (sorted by
throughput)
Moving on to the test where requests were being sent to the application at 50
requests per second, the results obtained showed an increase in throughput of
34
~42.1% when comparing the application without clustering and highest performing
clustered application (8 workers). In this case, a decrease in average response time
when comparing the two threads were of ~64.9%, and in terms of median response
time the decrease was ~61.5%.
Label # Samples Average (ms) Median (ms) Error % Throughput (rps)
100/s With Cluster, 1 workers 15000 107 110 0,00% 793,6
100/s Without Cluster 15000 82 86 0,00% 926,8
100/s With Cluster, 16 workers 15000 19 18 0,00% 1151,8
100/s With Cluster, 2 workers 15000 63 67 0,00% 1208,1
100/s With Cluster, 4 workers 15000 49 52 0,00% 1278,3
100/s With Cluster, 8 workers 15000 39 44 0,00% 1326,5
Figure 5.4.2.2: The results of the CPU bound tests at 100 requests per second (sorted by throughput)
Looking at the test where requests were being in at 100 requests per second (see
figure 5.4.2.2), a difference of ~43.1% in terms of increased throughput could be
noted between nonclustered and best performing clustered application (8 workers).
In this case, a decrease of ~52.4% in average and ~48.8% in median response times
could also be seen. Lastly, it was noted that the application with 1 worker performed
~4.8%14.4% lower than the application not implementing clustering.
In conclusion, the results obtained from testing the application’s capabilities of
performing CPU bound tasks locally showed increases in terms of throughput higher
than the bar of 20%. Because of the results showing this increase, CPU bound tests
were to be conducted in the Heroku environment as well. When comparing the
application only utilizing 1 worker, however, the throughput was lowered by
~4.8%14.4% compared to when not utilizing clustering at all. The full results of the
CPU bound tests can be seen in Appendix 6.
5.5 Heroku Tests This chapter presents the evaluation results of the same application used in the local
tests, but deployed on the Heroku platform instead, are presented.
Worth taking note of is that it was here decided not to test the I/O bound function on
Heroku, as the results obtained from analyzing the local tests spoke for the cluster
35
package not contributing to any increases concerning throughput in this
environment.
Here, the test results are based on the output of the Heroku metrics. Each bar in the
diagram represents the performance of the application during a given minute in
time. The vertical line apparent in most of the diagrams (e.g. figure 5.5.1.1)
represents a specific minute, chosen by us to analyze. This specific minute belongs to
the highest throughput value obtained during each test. All test results in this section
are based on results from when the application was performing CPU bound tasks on
Heroku.
5.5.1 Throughput Rates
The results obtained from simulating low traffic for an application deployed on the
Heroku platform did not show any significant difference. In figure 5.5.1.1 it can be
seen that the throughput is about the same (~10000 requests/min) for an
application without clustering and for the highest performing application with
clustering (4 workers).
Figure 5.5.1.1: A comparison in throughput between tests at 10 rps without clustering (upper) and with
4 (lower) workers
36
Continuing with the test results obtained from the tests running at 50 requests per
second (see figure 5.5.1.2), a difference of ~20.6% in throughput could be noted
when comparing the application without clustering to the best performing
application with clustering (4 workers). Thus, passing the bar of 20% set out as the
prerequisite for being considered as a positive result.
Figure 5.5.1.2: A comparison in throughput between tests at 50 rps without clustering (upper) and 4
(lower) workers
When looking at the results obtained from the tests running at 100 requests per
second (see figure 5.5.1.3), a significant difference in throughput could be seen. Here,
there is a difference of ~54.9%, which also passes the bar of 20%.
37
Figure 5.5.1.3: A comparison in throughput between tests at 100 rps without clustering (upper) and 4
(lower) workers
Lastly, looking at the results from each test (10100 rps), it was shown that the
differences in terms of lowered throughput for the application utilizing 1 worker
compared to the one without clustering persisted. Comparing the two, the
performance was lowered with ~1.55.3%, when sending 10100 requests/second to
the application (see Appendix 7).
5.5.2 Memory Usage
Looking at the memory usage when comparing the application without clustering
and the one with the best performance (4 workers) during high traffic, it can be seen
(in figure 5.5.2.1) that a significant amount of the available memory to the
nonclustered application is unused, namely 407 Mb. The reason for looking at high
traffic in particular is because it is the worst case scenario for the application (when
the highest amount of memory is allocated).
38
Figure 5.5.2.1: Memory footprint of the application without clustering (upper) and with 4 workers
(below) running at 100 rps
Looking at the memory usage of the second application (4 workers), it can be seen
(also in figure 5.5.2.1) that more of the memory available is being used by the
application, leaving 221 Mb unused. The vertical line represents the same time as in
figures 5.5.1.3.
Additionally, the application in this case (100 rps) exceeds the memory limit when
having 8 or 16 workers, leading to requests queueing up (see Appendix 7, figure 23
and 24). These queued requests further adds to the memory quota of the application,
and it can eventually lead to the application crashing (at 1 Gb memory usage [10]), thus loosing data.
5.5.3 Median Response Times
When looking at the tests, the median response times varied between being roughly
the same during low traffic (10 rps), and having a decreased time of ~94% during
high traffic (100 rps), when comparing the application without clustering to the one
39
having 4 workers (see figure 5.5.3.1). The vertical line in the figure marks the same
time as in figures 5.5.1.3 and 5.5.2.1.
Figure 5.5.3.1: Response times for application without clustering (upper) and with 4 workers when
sending at 100 rps
5.5.4 Analysis of Heroku Test Results
When performing CPU bound tasks in the Heroku environment, an increase in
throughput, passing the set out bar of 20%, could be seen when the traffic sent to the
application was of medium to high rate (50100 requests per second).
It could be seen how the memory when using 8 workers at 100 request/second
exceeded the limit of the dyno. Thus, requests started queueing up, which could
eventually have led to the application crashing. At the same traffic, when clustering
the application, it could be seen that a large amount of the available memory of the
dyno was unused, thus not utilizing the full capacity of the dyno. The memory usage
of the applications at high traffic, thus, spoke for using 4 workers within the
application.
40
Additionally, in the Heroku tests, as with the local tests, there was some decrease in
throughput when comparing the nonclustered application with the one utilizing 1
worker. This meant that implementing the Cluster module and instantiating only one
worker process would most likely lead to a small decrease in throughput instead of
an increase. This is most likely due to the procedure of the master process both
needing to create the worker process at the start of the test, and due to redundancy of
having a master process doing IPC when only having one worker.
In conclusion, throughput increases could be seen in the Heroku environment as
well. The answer to the second research question of this thesis, on how substantial
the increase in throughput could be, is thus: when sending medium to high traffic
(50100 rps) to an application performing CPU bound tasks, the increase in
throughput varies between ~20.6% (when sending 50 rps) and ~54.9% (when
sending 100 rps). When doing I/O bound tasks, on the other hand, there are not any
throughput increases. The obtained answer confirmed the hypotheses depicted in
section 3.3.
41
6 Discussion
This chapter presents discussion and our conclusions on this study, which will reflect
our interpretations of the results and problem areas related to the thesis. Finally, the
future direction of the template will be discussed.
6.1 Our Methodology and Consequences of the Study
The purpose of the thesis was to create a template for Node.js applications deployed
on the Heroku cloud platform. The problem definition described in section 1.2 led to
two research questions (see section 1.3), thus dividing the research process into two
parts.
When answering the first question, the applied research method was chosen. This
method belongs both to qualitative and quantitative research. This choice was
determined through early investigation, where it was found that there were not so
many techniques that could be applied in order to design and implement the
template. This was due to the fact that we were bound to a particular situation and a
very specific implementation environment. This led to there only being one solution
to the problem.
The problem with this research method was that it relied too much on the outcome of
the results of the second question. It would require further investigations if the
results would be negative and it would have made the research process iterative,
where a new technique for increasing throughput would have to be evaluated.
The data collection phase of the first question included collection of primary and
secondary data. The primary data consisted mainly of informal interviews and our
observations at Innometrics. This data was then complemented by the document
studies related to our research field. However, we had trouble finding academic
literature on the subject. This is probably partly due to the narrowness of the
problem area, and partly due to the fact that the technologies being discussed are
rather new. That is why we were careful when analyzing the quality of the available
42
resources, otherwise potential problems could arise during the later phases of the
research process.
The implementation part of the study focused on the development of a Node.js
template for Innometrics. This is the part of the research process that heavily relied
on the result of the second research question.
Before answering the second question, the right assumptions needed to be made. In
our case, it was to focus on CPU bound functions. That is why the tests gave the
desired result. The only problem was that the Heroku environment, where the tests
were conducted, was very unpredictable in terms of available system resources.
In conclusion, the research process made it possible to identify the right problem
areas and choose the right methods. As there was no previous research in this field
and the problem was of a practical character, our thesis could serve as a good ground
for the future development of the applications at Innometrics, as well as for all
developers experimenting with Node.js and the Heroku platform.
6.2 Discussion and Conclusions
The only solution found when trying to answer the first research question, how to
increase throughput for the applications within Innometrics’ system, was to utilize
the Cluster module (there were alternatives on how to implement the module,
however), which was a part of the standard library in Node.js.
When faced with the problem of testing the application on Heroku, it was found that
the problem was not only focused on utilizing more of the CPU’s capacity anymore,
but on optimizing the memory usage as well. This meant we could not focus only on
finding the number of workers giving the highest throughput, but now also had to
consider the memory usage of the application, and make sure that the application,
with its processes, stayed well below the memory limit of the dyno that it was
installed on. Otherwise, the memory might be exceeded, which could eventually lead
to the application crashing, thus loosing data.
43
Analyzing the obtained test results, it showed an increase of ~20%55% when
utilizing the Cluster module in a test application when performing CPU bound tasks
during medium to high traffic (50100 requests per second). However, through tests
performed in a local and higher performing environment, we came to the conclusion
that, when doing I/O bound tasks, a clustered application would not show any
increases in throughput in the Heroku environment, regardless of the traffic sent to
it. This was in accordance with the hypotheses for this thesis (depicted in section
3.3).
It was, also, found that the highest increases in throughput, when doing CPU bound
tasks in the Heroku environment, could be seen when using 4 workers in the
application. This number of workers, too, showed to stay well within the memory
limit of the dyno (thus minimizing the risk of exceeding it during sudden increases in
traffic).
Regarding the test results, it was also noted that utilizing only one worker led to a
throughput decrease of ~1.5%5.3% (depending on the traffic sent to the application),
which meant that an application should only utilize the application template in case
it was in need to, and had the memory available, instantiate more than one worker.
The obtained results from answering the second research question (on the
substantiality of the increase in throughput), confirmed that the answer to the first
research question on how to increase throughput in a Node.js application was to
utilize the Cluster module.
6.2.1 Recommendations Concerning the Application Template
Our recommendations when utilizing this template in an application installed on a
Heroku free account is to implement it mainly in applications that are exposed to
traffic around 50 requests per second and higher. It is also recommended to check
the memory usage of the application before implementing this template, to make
sure that it is able to utilize at least more than one worker, in order to obtain an
increase in throughput, without exceeding the memory limit of the dyno.
44
6.3 Ethics
A problem encountered considering the ethical parts of the project was that we
needed to avoid revealing company secrets about either Innometrics, one of its
customers, or the customer’s clients (the Innometrics profiles). Therefore, the report
is kept in as general terms as possible when writing our report.
Also, by increasing the throughput of an application, more data can be collected per
minute. This means that the consumer behaviour can be tracked more detailed, and
this will lead to a better communication between customer and company. But at the
same time, concerning the individual person, this project might not benefit in a
positive way. It contributes further in monitoring persons, which is a controversial
subject today. A more detailed customer profile might lead to a person feeling
pursued. This is, however, not up to us, but to the company using the template (in
this case Innometrics).
6.4 Sustainability
When speaking about sustainability, there are four different general areas to discuss:
environmental, human, economic, and social sustainability [32].
Concerning environmental sustainability, by increasing the throughput of the
applications, more clients can be handled per minute. This can lead to a faster
abrasion of the machines running the system, since the machine in question now will
be handling more clients simultaneously – thus doing more work per minute. It can
also lead to the energy consumption of the machine going up, due to utilizing more of
the machine’s capacity. Both of these aspects will lead to further damaging our
environment.
It can, however, also lead to less abrasion on the machine by completing demanding
work sessions faster than before, and thereby, giving the machine time for recovery
(e.g. lowering machine temperatures) before strenuous periods.
Lastly, concerning economical sustainability, by increasing the throughput of the
application, it will not crash as often. This will lead to an increase in profits for the
45
companies in need of the service, and for the company providing the service as well.
Also, by only using the free account on Heroku, it will save expenses for Innometrics.
6.5 Future Work
Despite the fact that the the application template can already be implemented, there
are still some areas that can be further investigated.
When it comes to implementation of the Cluster module, there are different
possibilities. It could be worth evaluating some of them (PM2, StrongLoop). They
will probably not give a much better performance, if any at all, but they can give
more convenient ways of managing the worker processes and some other features,
which can facilitate the maintenance of the application in the future. For example,
there will be no need to undeploy the application when making changes.
In order to make the application as generic as possible for different platforms, other
cloud platforms should also be considered. Similar tests can find the best suitable
platform.
In the case of Heroku, there are other types of accounts providing more system
resources. Detailed comparison tests could give a clear picture of differences in
performance between different types of accounts.
It could also be interesting to do further research on finding the exact appropriate
number of workers for an application. In this case, 4 workers had, as mentioned, a
good margin to the memory limit, but it might still not have been the most optimized
memory usage of the application. Thus, testing 56 workers for this particular
application might have showed even better results.
46
References
[1] “IAB Internet Advertising Revenue Report”, 2012. [Online], p. 16. Available:
http://www.iab.net/media/file/IAB_Internet_Advertising_Revenue_Report_FY_2
012_rev.pdf , [Accessed: 1 March 2016] [2] B. Cantrill, J. Bonwick, “Realworld Concurrency”, ACM Queue, Volume 6, Issue
5, September 2008. [Online]. Available:
http://queue.acm.org/detail.cfm?id=1454462, [Accessed: 25 May 2016] [3] A.V. Aho, M.S. Lam, R. Sethi, J.D. Ullman, Compilers, Principles and Techniques,
p.247. 2006, Pearson
[4] S.D. Burd, Systems Architecture, Fifth Edition, p.45. 2004, Course Technology a
division of Thomson Learning
[5] E. Griffith, “What Is Cloud Computing?”, PCMag UK, 3 May 2016. [Online].
Available:
http://uk.pcmag.com/networkingcommunicationssoftwareproducts/16824/featur
e/whatiscloudcomputing [Accessed: 24 May 2016] [6] Heroku, “The Heroku Platform as a Service & Data Services”. [Online]. Available:
https://www.heroku.com/platform, [Accessed: 24 May 2016] [7] B. Butler, “PaaS Primer: What is platform as a service and why does it matter?”,
Network World.com, 11 February 2013. [Online]. Available:
http://www.networkworld.com/article/2163430/cloudcomputing/paasprimerwh
atisplatformasaserviceandwhydoesitmatter.html , [Accessed: 25 May 2016] [8] L. R. Rewatkar, U. A. Lanjewar, “Implementation of Cloud Computing on Web
Application”, International Journal of Computer Applications, Volume 2 N0.8,
June 2010. [Online], pp. 2831. Available:
http://www.ijcaonline.org/volume2/number8/pxc387964.pdf , [Accessed: 24 May 2016]
[9] S.Lynn, “What Is CRM?”, PCMag UK, 18 August 2011. [Online]. Available:
http://uk.pcmag.com/software/9038/feature/whatiscrm, [Accessed: 24 May 2016] [10] Heroku Dev Center, “Dynos and the Dyno Manager”. [Online]. Available:
https://devcenter.heroku.com/articles/dynos , [Accessed: 24 May 2016]
47
[11] Heroku Dev Center, “Dyno Types”. [Online]. Available:
https://devcenter.heroku.com/articles/dynotypes , [Accessed: 24 May 2016] [12] S. Saini, H. Jin, R. Hood, D. Barker, P. Mehrotra, R. Biswas, “The Impact of
HyperThreading on Processor Resource Utilization in Production Applications”,
NASA Advanced Supercomputing Division, 2011. [Online]. Available:
https://www.nas.nasa.gov/assets/pdf/papers/saini_s_impact_hyper_threading_20
11.pdf , [Accessed: 25 May 2016] [13] E. SwensonHealey, “The JavaScript Event Loop: Explained”, Carbon Five, 27
October 2013. [Online]. Available:
http://blog.carbonfive.com/2013/10/27/thejavascripteventloopexplained/, [Accessed: 25 May 2016]
[14] A.S. Tanenbaum, Modern Operating Systems, Third Edition, p.145., 2009,
Pearson
[15] Cluster, Node.js v6.2.1 Documentation, [Online]. Available:
https://nodejs.org/api/cluster.html
[16] A. Burgess, “Using Node's Event Module”, Envato Tuts+, 3 December 2013.
[Online]. Available:
http://code.tutsplus.com/tutorials/usingnodeseventmodulenet35941, [Accessed: 26 May 2016]
[17] D. Khan, “How to Track Down CPU Issues in Node.js”, about:performance
Application Performance, Scalability and Architecture, 14 January 2016. [Online].
Available:
http://apmblog.dynatrace.com/2016/01/14/howtotrackdowncpuissuesinnode
js/, [Accessed: 25 May 2016] [18] M. Ridwan, “The Top 10 Most Common Mistakes That Node.js Developers
Make”, Toptal [Online]. Available:
https://www.toptal.com/nodejs/top10commonnodejsdevelopermistakes
[Accessed: 25 May 2016]
[19] Linux Documentation. [Online]. Available: http://linux.die.net/man/2/fork, [Accessed: 24 May 2016]
[20] B. Noordhuis, “What’s New in Node.js v0.12: Cluster RoundRobin Load
Balancing”, StrongLoop, 19 November 2013. [Online]. Available:
48
https://strongloop.com/strongblog/whatsnewinnodejsv012clusterroundrobi
nloadbalancing/, [Accessed: 25 May 2016] [21] A. Gorbatchev, “Howto Cluster Node.js in Production with Strong Cluster
Control”, StrongLoop, 22 April 2015. [Online]. Available:
https://strongloop.com/strongblog/productionnodejsstrongclustercontrol/, [Accessed: 25 May 2016]
[22] “Optimizing Node.js Application Concurrency”, Heroku Dev Center. [Online].
Available: https://devcenter.heroku.com/articles/nodeconcurrency , [Updated: 24 September 2015]
[23] R. Manning, “Node.js Cluster and Express”, 10 January 2013. [Online].
Available: http://rowanmanning.com/posts/nodeclusterandexpress/, [Accessed: 28 May 2016]
[24] N. Kandalgaonkar, “Why you should use Node.js for CPUbound tasks”, 30 April
2013. [Online]. Available:
http://neilk.net/blog/2013/04/30/whyyoushouldusenodejsforCPUboundtask
s/, [Accessed: 25 May 2016] [25] A. Håkansson, “Portal of Research Methods and Methodologies for Research
Projects and Degree Projects”, 2013. [Online]. Available:
https://www.kth.se/social/files/55563be1f276547328cea897/Research%20Methods
%20%20Methodologies(1).pdf
[26] “Qualitative and Quantitative Research Techniques for Humanitarian Needs
Assessment. An Introductory Brief”, ACAPS, May 2012. [Online]. Available:
http://www.acaps.org/sites/acaps/files/resources/files/qualitative_and_quantitativ
e_research_techniques_for_humanitarian_needs_assessmentan_introductory_bri
ef_may_2012.pdf
[27] S. Kar, “Node.js Performance Tip of the Week: Scaling with Proxies and
Clusters”, StrongLoop, 22 April 2015. [Online]. Available:
https://strongloop.com/strongblog/nodejsperformancescalingproxiesclusters/, [Accessed: 16 May 2014]
[28] Throng, npm. [Online]. Available: https://www.npmjs.com/package/throng [29] J. Shkurti, “Node.js clustering made easy with PM2”, Keymetrics, 26 March
2015. [Online]. Available:
49
https://keymetrics.io/2015/03/26/pm2clusteringmadeeasy/, [Accessed: 5 June 2016]
[30] PM2. [Online]. Available: http://pm2.keymetrics.io/, [Accessed: 5 june 2016] [31] “ Using PM2 in Cloud Providers ”, PM2 Documentation, [Online]. Available: http://pm2.keymetrics.io/docs/usage/usepm2withcloudproviders/, [Accessed: 5 june 2016]
[32] Robert Goodland, “Sustainability: Human, Social, Economic and
Environmental”, Baltic University Programme a regional university network.
[Online]. Available:
http://www.balticuniv.uu.se/index.php/component/docman/doc_download/435su
stainabilityhumansocialeconomicandenvironmental , [Accessed: 5 june 2016]
50
Appendix 1 Heroku Dyno CPU Information
Results from running the “cat /proc/cpuinfo”command in the shell environment of each test application: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 1
51
cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
52
processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13
53
wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4
54
microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 apicid : 5 initial apicid : 5 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
55
bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
56
Appendix 2 The Test Application
Figure 1: First half of the test application
57
Figure 2: Second half of the test application
58
Appendix 3 The Application Template
Figure 1: Part one of the application template
Figure 2: Part two of the application template
59
Appendix 4 The Local Server CPU Specifications machdep.cpu.max_basic: 13 machdep.cpu.max_ext: 2147483656 machdep.cpu.vendor: GenuineIntel machdep.cpu.brand_string: Intel(R) Core(TM) i53210M CPU @ 2.50GHz machdep.cpu.family: 6 machdep.cpu.model: 58 machdep.cpu.extmodel: 3 machdep.cpu.extfamily: 0 machdep.cpu.stepping: 9 machdep.cpu.feature_bits: 9203919201183202303 machdep.cpu.leaf7_feature_bits: 641 machdep.cpu.extfeature_bits: 4967106816 machdep.cpu.signature: 198313 machdep.cpu.brand: 0 machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC POPCNT AES PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI machdep.cpu.logical_per_package: 16 machdep.cpu.cores_per_package: 8 machdep.cpu.microcode_version: 21 machdep.cpu.processor_flag: 4 machdep.cpu.mwait.linesize_min: 64 machdep.cpu.mwait.linesize_max: 64 machdep.cpu.mwait.extensions: 3 machdep.cpu.mwait.sub_Cstates: 135456 machdep.cpu.thermal.sensor: 1 machdep.cpu.thermal.dynamic_acceleration: 1 machdep.cpu.thermal.invariant_APIC_timer: 1 machdep.cpu.thermal.thresholds: 2 machdep.cpu.thermal.ACNT_MCNT: 1 machdep.cpu.thermal.core_power_limits: 1 machdep.cpu.thermal.fine_grain_clock_mod: 1 machdep.cpu.thermal.package_thermal_intr: 1 machdep.cpu.thermal.hardware_feedback: 0 machdep.cpu.thermal.energy_policy: 0 machdep.cpu.xsave.extended_state: 7 832 832 0 machdep.cpu.xsave.extended_state1: 1 0 0 0 machdep.cpu.arch_perf.version: 3 machdep.cpu.arch_perf.number: 4 machdep.cpu.arch_perf.width: 48 machdep.cpu.arch_perf.events_number: 7
60
machdep.cpu.arch_perf.events: 0 machdep.cpu.arch_perf.fixed_number: 3 machdep.cpu.arch_perf.fixed_width: 48 machdep.cpu.cache.linesize: 64 machdep.cpu.cache.L2_associativity: 8 machdep.cpu.cache.size: 256 machdep.cpu.tlb.inst.small: 64 machdep.cpu.tlb.inst.large: 8 machdep.cpu.tlb.data.small: 64 machdep.cpu.tlb.data.large: 32 machdep.cpu.tlb.shared: 512 machdep.cpu.address_bits.physical: 36 machdep.cpu.address_bits.virtual: 48 machdep.cpu.core_count: 2 machdep.cpu.thread_count: 4 machdep.cpu.tsc_ccc.numerator: 0 machdep.cpu.tsc_ccc.denominator: 0
61
Appendix 5 Results From I/O Bound Tests in Local Environment
Average and Median measured in ms, and Throughput in requests per second.
Label Samples Average Median Error % Throughput
10/s With Cluster, 2 workers 15000 317 308 0,00% 30,8
10/s With Cluster, 4 workers 15000 308 309 0,00% 31,7
10/s With Cluster, 16 workers 15000 307 308 0,00% 31,8
10/s Without Cluster 15000 307 307 0,00% 31,9
10/s With Cluster, 8 workers 15000 307 308 0,00% 31,9
10/s With Cluster, 1 workers 15000 307 307 0,00% 31,9
Label Samples Average Median Error % Throughput
25/s With Cluster, 2 workers 15000 305 305 0,00% 79,1
25/s With Cluster, 16 workers 15000 306 306 0,00% 79,2
25/s With Cluster, 4 workers 15000 306 307 0,00% 79,3
25/s With Cluster, 8 workers 15000 306 306 0,00% 79,4
25/s With Cluster, 1 workers 15000 305 305 0,00% 79,4
25/s Without Cluster 15000 305 305 0,00% 79,6
Label Samples Average Median Error % Throughput
50/s With Cluster, 4 workers 15000 305 305 0,00% 157
50/s With Cluster, 2 workers 15000 305 305 0,00% 157,4
50/s With Cluster, 16 workers 15000 305 305 0,00% 157,7
50/s With Cluster, 8 workers 15000 305 305 0,00% 157,7
50/s Without Cluster 15000 305 305 0,00% 158,2
50/s With Cluster, 1 workers 15000 305 305 0,00% 158,4
Label Samples Average Median Error % Throughput
75/s With Cluster, 4 workers 15000 305 305 0,00% 234,5
75/s With Cluster, 8 workers 15000 305 305 0,00% 234,6
75/s Without Cluster 15000 306 305 0,00% 234,6
75/s With Cluster, 2 workers 15000 305 305 0,00% 234,8
75/s With Cluster, 16 workers 15000 306 306 0,00% 234,9
75/s With Cluster, 1 workers 15000 306 305 0,00% 235,6
Label Samples Average Median Error % Throughput
100/s With Cluster, 16 workers 15000 306 305 0,00% 309,3
100/s With Cluster, 4 workers 15000 305 305 0,00% 310,3
62
100/s Without Cluster 15000 305 305 0,00% 310,8
100/s With Cluster, 8 workers 15000 305 305 0,00% 310,9
100/s With Cluster, 2 workers 15000 306 305 0,00% 311,5
100/s With Cluster, 1 workers 15000 306 305 0,00% 312
Figure 1: Results from I/O bound tests in local environment
63
Appendix 6 Results From CPU Bound Tests in Local Environment
Average and Median measured in ms, and Throughput in requests per second.
Label # Samples Average Median Error % Throughput
10/s With Cluster, 1 workers 15000 6 6 0,00% 787,5
10/s Without Cluster 15000 4 4 0,00% 827,6
10/s With Cluster, 16 workers 15000 4 3 0,00% 892,9
10/s With Cluster, 8 workers 15000 3 3 0,00% 972,3
10/s With Cluster, 2 workers 15000 3 3 0,00% 994,6
10/s With Cluster, 4 workers 15000 3 3 0,00% 1029,9
Label # Samples Average Median Error % Throughput
25/s With Cluster, 1 workers 15000 21 23 0,00% 814,8
25/s With Cluster, 2 workers 15000 10 10 0,00% 1158,6
25/s With Cluster, 4 workers 15000 6 5 0,00% 1247,6
25/s With Cluster, 8 workers 15000 5 5 0,00% 1277,8
25/s With Cluster, 16 workers 15000 7 4 0,00% 1188,4
25/s Without Cluster 15000 14 15 0,00% 933,4
Label # Samples Average Median Error % Throughput
50/s With Cluster, 1 workers 15000 51 53 0,00% 815,5
50/s Without Cluster 15000 37 39 0,00% 954,5
50/s With Cluster, 2 workers 15000 25 27 0,00% 1211,9
50/s With Cluster, 16 workers 15000 15 15 0,00% 1269,7
50/s With Cluster, 4 workers 15000 16 15 0,00% 1320,3
50/s With Cluster, 8 workers 15000 13 13 0,00% 1355,9
Label # Samples Average Median Error % Throughput
75/s With Cluster, 1 workers 15000 77 80 0,00% 821,3
75/s With Cluster, 2 workers 15000 47 49 0,00% 1197,1
75/s With Cluster, 4 workers 15000 34 33 0,00% 1348
75/s With Cluster, 8 workers 15000 29 32 0,00% 1345,1
75/s With Cluster, 16 workers 15000 40 27 0,00% 1101,4
75/s Without Cluster 15000 60 63 0,00% 949,8
Label # Samples Average Median Error % Throughput
100/s With Cluster, 1 workers 15000 107 110 0,00% 793,6
100/s Without Cluster 15000 82 86 0,00% 926,8
64
100/s With Cluster, 16 workers 15000 19 18 0,00% 1151,8
100/s With Cluster, 2 workers 15000 63 67 0,00% 1208,1
100/s With Cluster, 4 workers 15000 49 52 0,00% 1278,3
100/s With Cluster, 8 workers 15000 39 44 0,00% 1326,5
Figure 1: Results from CPU bound tests in local environment
65
Appendix 7 Results From CPU Bound Tests on Heroku
To see the results of the test corresponding with the figure description, look at the
vertical line (the one next to the timestamp).
10 rps:
Figure 1: 10 rps, without clustering (vertical line missing, see timestamp for time of test)
Figure 2: 10 rps, 1 worker
66
Figure 3: 10 rps, 2 workers
Figure 4: 10 rps, 4 workers
67
Figure 5: 10 rps, 8 workers
Figure 6: 10 rps, 16 workers
68
25 rps:
Figure 7: 25 rps, without clustering
Figure 8: 25 rps, 1 worker
Figure 9: 25 rps, 2 workers
69
Figure 10: 25 rps, 4 workers
Figure 11: 25 rps, 8 workers
70
Figure 12: 25 rps, 16 workers
50 rps:
Figure 13: 50 rps, without clustering
71
Figure 14: 50 rps, 1 worker
Figure 15: 50 rps, 2 workers
72
Figure 16: 50 rps, 4 workers
Figure 17: 50 rps, 8 workers
73
Figure 18: 50 rps, 16 workers
100 rps:
Figure 19: 100 rps, without clustering
74
Figure 20: 100 rps, 1 worker
Figure 21: 100 rps, 2 workers
Figure 22: 100 rps, 4 workers
75
Figure 23: 100 rps, 8 workers
Figure 24: 100 rps, 16 workers
76
TRITA TRITA-ICT-EX-2016:69
www.kth.se