Benchmarking Akka am

Bachelor Informatica

Benchmarking Akka

Dennis Kroeb

June 15, 2020

Supervisor(s): Ana-Lucia Varbanescu

Signed:

Informatica—

Universiteit

vanAmst

erdam

2

Abstract

In modern-day computing, concurrent programming is essential in high-performance sys-tems. The Akka platform provides a programming model to meet this need. Systematicperformance analysis studies for Akka do not exist. Therefore, this thesis proposes such astudy. To this end, the performance of the Akka actor model is assessed by first comparingit to Java in a microbenchmarking experiment, which illustrates the overhead and differentthreading models in Akka. Furthermore, to compare Akka with other models, we ported twoapplications from Computer Language Benchmarks Game (CLBG) to Akka, and comparedtheir performance against the original CLBG models, using the CPU metrics, compressedcode size, and sampled memory usage. Akka performed similar to Java. Based on this anal-ysis, we conclude that Akka can get similar performance to Java in non-blocking concurrentapplications, but Akka has a larger code size in general.

3

4

Contents

1 Introduction 71.1 Research question and approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Ethical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background and related work 92.1 Benchmarking and CLBG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 The Akka platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 The Akka actor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Java, Akka and multithreading . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Akka vs Java: a microbenchmark 153.1 Experiments setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 The Counter program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Scalability and overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Measuring different phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Initialisation of the Akka actor model . . . . . . . . . . . . . . . . . . . . . . . . 213.6 ForkJoinPools and ThreadPoolExecutors . . . . . . . . . . . . . . . . . . . . . . . 21

4 CLBG for Akka 254.1 Selecting relevant CLBG programs . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Program 1: Binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Porting to Akka from pseudocode . . . . . . . . . . . . . . . . . . . . . . . 264.2.2 CLBG results - Binary Trees . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Program 2: Reverse Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.1 Porting to Akka from Java . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 CLBG results - Reverse Complement . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusion and future work 395.1 Main findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Acronyms 41

5

6

CHAPTER 1

Introduction

The world we live in values (real-time) connectivity more with each day, and the demand forhigh-performance distributed applications keeps rising. Proper design and implementation isimportant to ensure good performance, and choosing a suited programming model is importantfor these applications. Concurrency plays an important role in these distributed applications.

The Akka platform claims to perform well in distributed applications, by providing nativeconcurrency through an actor-based model. The platform incorporates the Akka toolkit, whichis a collection of modules built by Lightbend, a company specialised in real-time cloud-basedservices [1]. The toolkit offers support for two common programming languages: Scala and Java.In this project, we only evaluate the Java implementations of Akka, because of our familiaritywith the language.

The problem is that there is currently no systematic performance analysis study done onAkka compared to other models. Performance analysis of Akka can aid in choosing the rightmodel when designing new applications. This thesis is aimed at solving the mentioned problem.

1.1 Research question and approach

To provide a systematic performance of Akka compared to other models, we answer the followingresearch question in this thesis:

How does the Akka actor model perform compared to other models?

This thesis focuses on assessing the performance of the Akka actor model, which is the coremodule of Akka. Akka uses the actor model programming principle, and our performance analysisgives insight into the performance impact of this abstraction compared to other programmingprinciples, like regular Object-oriented programming (OOP) in Java for example.

To answer our research question we propose two different types of comparison, specificallyaddressing two sub-questions:

SQ1: How does Akka compare against Java for a basic application?SQ2: How does Akka compare against the multiple models in the Computer Language Bench-

marks Game?

To answer SQ1, we microbenchmark Java multithreading versus Akka actors through a sim-ple synthetic program (Chapter 3). For the second sub-question (SQ2 ), we benchmark Akkaversus many other models, using the Computer Language Benchmarks Game (CLBG). We fur-ther report on porting existing input programs included by CLBG to Akka, because CLBG doesnot thus far support Akka natively. This porting process and the CLBG results can be found inChapter 4. In Chapter 5, we conclude our report and discuss our findings.

7

If a new model like Akka does not perform much better than other models on existing pro-grams, rewriting an existing program might not be worth the work. It should not be forgottenthat there is (currently) no such thing as a ’best’ model, because all programming models havedifferent use cases and features which vary in their relevance with respect to a given program.A nice analogy for this: it is very hard to be an excellent sprinter and marathon runner at thesame time.

1.2 Ethical aspects

This project stems from an application where a distributed system is used to supervise illegalactivities (e.g. poaching) in national parks [2]. Because this surveillance software uses Akka, aproper performance analysis could help with catching of poachers if our findings contribute to abetter code base.

Furthermore, our work can enable users to make informed choices about the tools they use toprogram their applications, which is beneficial to the efficient use of (computational) resources.This could then in turn lead to less power consumption, which is better for the environmentand also offers financial benefits, as long as the functionality is not compromised by the reducedpower consumption.

The work in this thesis is open-source, which offers transparency and reproducibility. Thisallows others to perform additional research based of our findings. This project does not touchon controversial topics (like artificial intelligence for example), and we see this work as ethicallyresponsible.

8

CHAPTER 2

Background and related work

In this chapter, we aim to provide the basic terms and notions required to understand the researchdone in this thesis. Thus, we discuss the Akka platform, benchmarking, CLBG, and we brieflypresent related work.

2.1 Benchmarking and CLBG

In the context of this thesis, we define a benchmark as one program that measures execution of aninput program to quantify performance; a benchmarking suite is a set of such programs, typicallyrepresentative for real-life applications, whose combined performance measurements give a betterunderstanding of the performance across different types of applications. The main challengesfor any benchmark are the selection of the applications and the selection of the representativemetrics. Both selections depend on the goal of the benchmark. In this work we focus on usingthe Computer Language Benchmarks Game (CLBG) as our benchmark suite.

CLBG is a benchmark suite that tests many programmings models using various implemen-tations of algorithms (programs). The suite is well-documented [3] and open-source [4]. Thebenchmark results are posted on the CLBG website, but the suite also provides the option todisplay your own benchmark results on a (local) webpage (Figure 2.1).

To assess the performance of a model, CLBG uses the following metrics: Execution time,memory usage, compressed code size, total CPU time over all threads and individual threadusage. An overview of these metrics are in Table 2.1 Example measurements can be seen inFigure 2.1. Compressed code size is measured using the GZIP tool [5]. CPU information isgathered using the GTOP library [6]. Memory measured by GTOP as well, and is sampled every200ms. Programs who run less than 1 second may not have accurate memory measurements asa result of this. [3].

Metric Unit Tool DetailsExecution time Sec. GTOP 2.0 Measure the whole execution time,

from start to finishTotal CPU time Sec. GTOP 2.0 Total CPU non-idle time

for all cores combinedCPU load per core % GTOP 2.0 Amount of non-idle CPU work

performed per core with respect to the total timeAverage peak memory MB GTOP 2.0 Peak RAM usage (sampled every 200ms).Compressed code size B GZIP 1.6 Compressed using minimal GZIP compression

Table 2.1: Metrics used by the CLBG suite

CLBG is transparent - due to its clear specifications and metrics - and offers a wide range ofprograms implemented using many models. This enables users to get performance insights whilenot overfitting to certain models, a mistake made by other Akka benchmarks[7][8].

9

Figure 2.1: The CLBG WebUI for displaying benchmark results. The number behind somesource programs identifies the implementation of the algorithm for that model, as some modelshave multiple implementations [4].

CLBG obtains its measurements about models in a generic way and does not require a lotof special action for a new model to be supported, assuming the implementation of the inputprogram is valid with respect to the program’s specifications. Because of this, CLBG is usefulfor benchmarking Akka as a new model for the suite. While input programs for CLBG can beextended to support other models, but you should be careful to follow the specifications listedby the suite, to ensure a fair comparison between models.

2.2 The Akka platform

The Akka platform is a toolkit which provides concurrency and distributivity in message-drivenapplications using the actor model programming principle [1].

2.2.1 The Akka actor model

The Akka actor model is the core of the Akka platform, which incorporates the actor modelprogramming principle. The actor model allows concurrent programming with the advantageof enforcing encapsulation without locks, boosting concurrent performance [9]. This is verybeneficial in Object-oriented programming (OOP), because encapsulation plays a major rolehere. The actor model uses a sender/receiver type model, where the receiver performs an actionbased on a message. Akka does this in such a way that the internals of the receiver are notcompromised, removing the need for locks [1].

10

Figure 2.2: How Akka actors process messages, and how locks are avoided [1].

The following step-by-step description demonstrates how the actor model works (from theAkka documentation), displayed in figure 2.2 [1]:

1. Actor2 receives a message from Actor 1 and puts it in the queue.

2. If Actor2 was not scheduled for execution, it is marked as ready to execute (by the dis-patcher).

3. The dispatcher starts the execution of Actor2.

4. Actor2 also receives a message from Actor3, which is then queued.

5. Actor2 processes the message from the front of the queue. The internal state of Actor2 ismodified. Actor2 is now able to send messages to other actors.

6. Once the queue is empty, or other actors are awaiting execution, Actor2 is unscheduled.

The dispatcher in the Akka actor model is a scheduler which allows actors who are waitingin memory to work on a thread. It is possible to have multiple dispatchers in a program, andthey are very configurable [10].

2.2.2 Java, Akka and multithreading

Multi-threading is the execution of programs making use of multiple threads. In general, wedistinguish between logical (hardware) threads and virtual (software) threads [11]. Runningmultiple virtual threads on the same processor is achieved by each thread getting a time slicein which it may work using one hardware thread; when the time has elapsed, a next thread canstart or continue its work. A context switch occurs during this transition [11]. A context switchencapsulates all the (time-consuming) operations required for a logical thread to switch stateand work on a different task. Having too many virtual threads running on a processor with toofew hardware threads often leads to very bad performance, even worse than sequential execution,because of the overhead introduced by excessive context switches.

In Java, threads can be spawned for a given task to be executed on-demand [12]. It is alsopossible to use a thread-pool : this method does not create a new thread for each new task, butinstead re-uses threads efficiently. The benefit of a thread-pool is that you can have many moretasks than threads, reducing the overhead of thread creation [13].

While concurrency and parallelism through multithreading can offer great performance ben-efits, it is not always easy to apply multithreading properly. When programming concurrently,the task should be non-blocking, in concurrent applications, blocking operations are detrimentalfor performance because the thread execution can be extensively delayed by the execution ofother threads. The negative performance impact applies to Akka too [1].

Efficiently re-using threads can be done in Java by using the ThreadPoolExecutor (TPE)class [14][15]. Another option in Java is the ForkJoinPool (FJP) approach using the ForkJoinPoolclass [16]. This gives options to manually fork and join in Java code, using threads which gettheir tasks from a double-ended queue (a deque). A deque is a combination of a stack and aqueue, where items can be inserted and removed from both the head and the tail [17].

The benefit of using a deque is making distribution of tasks faster, because they can be gottenfrom two ends of a queue, whereas in a TPE all threads get a task from a single (regular) queue.

11

If tasks are small, having a deque can then increase performance and, as such, using an FJPis preferable [16]. This performance increase partially comes from the work-stealing-algorithm,which acts as a load balancer which lets threads ’steal’ tasks assigned to a different thread froma deque if the stealing thread has little or no work left to do. While stealing of work happensinfrequently, it still improves performance in certain cases [16].

Because Akka is built on JVM, Akka uses similar multithreading techniques as Java in theAkka dispatchers. This means that each dispatcher can be configured to either use a TPE orFJP, with the latter being the default option [10]. The performance analysis of TPE vs FJP stillholds: when Akka actors have small tasks to perform, the ForkJoin dispatchers should performbetter.

2.3 Related work

In this section we summarize relevant related work on benchmarking in general, comparativestudies on programming models, and CLBG.

According to [18], when designing a benchmark, a few important steps should be taken intoaccount. The first step is deciding what exactly you want to measure. This means finding outwhich performance criteria are important and define the metrics of the benchmark accordingly.The next step is defining procedures for how the benchmark should be run - for example, shouldtesting be done or not under a (too) high workload (stress testing), how many repetitions ofexecution there should be, and how with how distinct measurements should be aggregated (e.g.,by taking the average value). The third step is finding or creating input programs that are repre-sentative to what the benchmark is trying to measure. Programs for benchmarking Akka in Javashould make good use of Akka, for example, because otherwise the benchmark is just quantifyingJava’s performance instead. Creating programs specifically for the benchmark forces designersto properly think about what the program should do, making it easier to analyse the measure-ments later on. Good benchmark programs should be simple, representative to the metrics,(trans)portable, system independent, and preferably in higher level languages[18]. Finally, thebenchmark should be ran on the programs and the results should then be validated and analysed.Validation involves checking whether the benchmarks results are reproducible. It should also beran on different systems to check for transportability (compatability). Properly documenting thebenchmark also contributes to its quality. The CLBG suite meets these requirements, making ita suitable suite for use in this thesis.

In the context of programming languages/models, benchmarking is often related to decidingwhich programming model to use given a type of problem. A comparative study has been doneusing Rosetta Code programs [19], which focuses on comparing models in terms of their per-formance. The authors state that it is very important to precisely know the features of modelsto be able to compare them properly. Scripting, procedural, object-oriented and functionalprogramming languages all have their pro’s and con’s. The article used the Rosetta Code repos-itory [20] to compare these paradigms using a large enough data set to provide high statisticalsignificance. The article also covers differences in features of languages, like conciseness (to-the-point), performance and failure-proneness. The latter applies to weakly-typed languages, forexample. Moreover, functional and scripting languages are more concise than procedural andobject-oriented languages.

The authors signal a significant problem in comparisons of programming models: they cansuffer from overfitting to specific problems, while not providing proper statistical significance orgeneralisability [19]. The results of the article are well-founded and statistically seen very signif-icant, making it a useful source for model comparison and thus for this thesis. The CLBG suiteprovides many programs covering multiple problems in many models[4], to avoid the overfittingpitfall mentioned by the article.

A lot of work comparing specific pairs of programming models also exists. For example, re-search that compares Go and Java and specifically analyses both thier concurrency performance,and the compile time of both models, is presented in [21]. The former metric is relevant for this

12

thesis, since Akka focuses on concurrency as well. The work focuses on matrix multiplicationusing both models, and compares the execution time of this application. The authors concludethat Go is faster when using concurrency, but is outperformed by Java for as the problem sizeincreases in the sequential implementation of the matrix multiplication.The article clearly describes the implementations of the used programs. The results are based ofthe mean of three simple benchmarks ran for each experiment, and no statistical analysis is per-formed on these results. The performance comparison only focuses on sequential and concurrentmatrix multiplication, and does not use any other programs to benchmark. This work, therefore,does not provide a thorough performance analysis of Go and Java, because it is overfitting tothese specific cases [21][19].

Another performance comparison between Java and Go was done in a PhD thesis project[22]. This work focuses on the parallel performance of the two models, and concludes that Goperforms matrix multiplication faster than Java, but that the speedup was relatively higher on alower number of threads; the author claims that the observed differences in execution time, werenot a direct consequence of parallelism, but a result of differences in the models.

This work [22] also focuses only on implementations of the matrix multiplication. Again,we see that performance analysis research is fit to one single problem, and not applied to abroader spectrum of problems, where different features related to parallelism are tested properly.The project referenced the concurrency performance comparison article [21], and stated that Go(version 1.2) compiles three times faster than Java(version 1.7.0 45). While both studies usematrix multiplication to measure execution time one cannot generalise this compile time differ-ence between the models, which is bad for the project’s integrity. Nevertheless, these sourcesstill contain relevant information for performance analysis, if we combine it with a broad set ofprograms from the CLBG suite to get an idea of the performance of the Akka actor model.

As for CLBG, it is referenced in existing research as a benchmark for an adaptation of C whichallows dynamic code execution on the Java Virtual Machine (JVM) [23]. This adaptation iscalled TruffleC, and uses an efficient implementation of the C interpreter to dynamically compilecode, while most traditional compilers produce static code. CLBG was used as a benchmark totest the performance of TruffleC versus static C compilers [23].

The results of the CLBG benchmark in this study are not explained in sufficient detail. Thestudy is not using CLBG to its full potential, because not all metrics are shown in the results.This study is relevant to this thesis nonetheless, because the Akka platform is also built on theJVM. In this thesis, CLBG metrics will put to better use when benchmarking Akka.

To summarise, correct benchmarking requires proper selection of applications and metrics, includ-ing a thorough understanding of the application behaviour and metrics meaning. The benchmarkshould also give transparency about the environment in which it runs, and the used parametersshould be reported. Programs that are benchmarked should be simple, to aid in the analysisof the results. Well-selected simple programs lower the amount of reasoning required to under-stand why certain behaviour is observed. Knowledge of the used input programs and modelsin which these are implemented is also important to properly analyse results. Previous studiesshow that performance analysis should not be generalised, because overfitting to a specific caseis common. Another previous study used CLBG, but did not use all the metrics available inCLBG. Not using all of the metrics provided makes it harder to give a broader comparison of amodels performance.

Based on this related work analysis, and to the best of our knowledge, we assess that thework presented in this thesis is the first that attempts to extend CLBG with Akka, while usingand reporting all metrics of the benchmark for a subset of its applications ported to Akka.

13

14

CHAPTER 3

Akka vs Java: a microbenchmark

In this chapter we study how Akka and Java compare to each other, using a simple syntheticinput program. This analysis will provide an answer to our first research sub-question (SQ1,see 1.1).

3.1 Experiments setup

The experiments done in this thesis were performed on an HP Omen 15 (ax010nd) laptop (2016).This model has an Intel i5 6300HQ CPU, with a base clock of 2.3GHz and the ability to boostto 3.2Ghz. The CPU has 4 cores and 4 hardware threads. The laptop has 8GB DDR4 RAMoperating at 2700MHz. The OS is Ubuntu 16.04 LTS installed on an M.2 NVMe SSD (256GB).The laptop was connected to the power grid at all times during experimentation, and any unnec-essary user applications were closed before running experiments. For model version details, seeTable 4.2. All experiments were automated using bash and all plots are created using matplotlibin Python [24].

3.2 The Counter program

To get an idea of how Akka actors perform against native Java multithreading, we proposea comparison using a synthetic input program and measuring execution time. This programperforms integer addition using the unary increment operator (++), where each thread (oractor) performs an equal share of the total iterations. The thread count and the total iterationscan be given as arguments to the input program.

For our first experiment, the timer is started when the threads or actors are being created.The timer stops when the result of the program is computed and all threads or actors arefully done. Results are aggregated and displayed using bash and Python. The core of theJava implementation is presented in Algorithm 1, and the Akka implementation can be seen inAlgorithms 2, 3 and 4.

15

Algorithm 1 Java multi-threaded counter. The time is measured with System.nanoTime().Threads are stored so they can be joined later on when all work is done. Threads are createdand then immediately started.

1: startTime = System.nanoTime();2: for (int i = 0; i < totalThreads; i++) do3: Thread t = new Thread(() => {4: NonAkkaCounter counter = new NonAkkaCounter();5: for(int j = 0; j < iterLimit; j++) do6: counter.count++;7: end for8: });9: t.start();

10: threadArray.add(t); . Store the created threads to be able to check for completion11: end for12: EndTimerWhenDone(); . Joins the threads from the threadArray and stops the time

Algorithm 2 Akka code snippet of the main (parent) actor which creates counter actors andtells them to start counting. The time is measured with System.nanoTime().

1: startTime = System.nanoTime();2: for (int i = 0; i < totalActors; i++) do3: final ActorRef<Counter.Command> counterActor =4: getContext().spawn(Counter.create(), ”subCounter” + i);5: counterActor.tell(new Counter.Loop(IterLimit, getContext.getSelf()));6: end for

Algorithm 3 Akka code snippet of a child counter actor which performs the counting. It tellsthe main actor it is done when the for-loop is done. See Alg. 4 to see how this is handled by themain actor.

1: procedure private Behaviour<Command> onLoop(Loop loopInfo)2: for (int = 0; i < loopInfo.limit; i++) do3: count++;4: end for5: loopInfo.parentActor.tell(new AkkaMainActor.CounterFinished(count));6: return Behaviours.same();7: end procedure

Algorithm 4 Akka main actor which stops the timer if all children are done.

1: procedure private Behaviour<Main> onCounterFinished(CounterFinished info)2: actorsFinished++;3: result.addAndGet(info.count); . Add the subcount to the atomic integer result4: if (actorsFinished. ≥ totalActors) then5: EndTimer(); . Stops the timer6: return Behaviors.stopped()7: end if8: return Behaviors.same()9: end procedure

16

(a) 1024 (b) INTMAX

Figure 3.1: The performance of the counter program, using different amounts of threads (Java)and actors (Akka). The execution time is the average of 10 runs. Note that the x-axis is in logscale. The initialisation of the Akka ActorSystem is not measured.

3.3 Scalability and overhead

The counter program performs two additions per iteration: one for the for-loop counter andone for the increment of the local counter variable. Each thread or actor computes a part ofthe total result, and these intermediate results are added together when the threads or actorsfinish computing. The time is measured by the JVM from within the code. The initialisationof the program and its variables are not relevant for this experiment, because we are interestedin the overhead and scalability of our concurrent counter program. We do not include these(near-constant) initialisation costs in the measurements for our first experiment.

Experiment 1a: low count, strong scaling

In the first experiment, we only counted to 1024 (a very low number) using a varying numberof threads up to 1024 threads. In this way, high overhead is expected for both implementationsof the counter program. Moreover, as the number of threads/actors increases, the workload perthread decreases to the point where each thread performs a single iteration of the counter loop.As the problem size is fixed, and we only increase the threads/actors, this is a strong scalabilitytest for the two models.

The results of experiment 1 are presented in Figure 3.1a. We make the following observations:

• The Akka actors are faster until the amount of actors grows beyond 512.

• The overhead for both Java and Akka is outweighing the speedup of additional threads/actorswhen we have too many counters. Akka performs much faster before this point.

• The fastest time was observed when using a single actor. This behaviour confirms ourexpectation: the problem size was so small that any additional threads/actors causedextra overhead, and no visible performance gain.

17

Experiment 1b: high count, strong scaling

The experiment was repeated for a much larger problem size (i.e., equal to the maximum 32-bit signed integer INTMAX). The results are displayed in Figure 3.1b. We make the followingobservations:

• For a lower thread/actor count, Akka is faster in this experiment, as well. However theJava threading implementation outperforms Akka sooner than in Figure 3.1a (i.e, at 48threads).

• Excessively large numbers of threads lead to the same performanne behaviour as that ob-served in Experiment 1a (Figure 3.1a): too many threads/actors cause too much overhead,and slow the execution down.

3.4 Measuring different phases

As seen in figures 3.1a and 3.1b, Akka is faster for a lower amount of actors. In Figure 3.1a, theexecution time initially does not change a lot when the number of threads is changed. To get anidea what causes Java to be slower than Akka in these cases, the previous tests were extendedto measure thread/actor creation, assignment of work as well as the time to complete which wealready measured.

Experiment 2: Initialisation cost

The previous experiments did not measure the creation of the Akka actor system where themain actor which spawns the child counter actors is initialised. Since these actors use the same(default) dispatcher, it is possible that the threads on which active Akka actors are processed arecreated outside of our measured time frame, explaining the difference between Java and Akka interms of overhead.

The thread/actor creation time is measured when all threads/actors have been created, andtime is measured from the moment both implementations are started (in the main functions),but we also measure the original starting point. The latter is meant to capture the overheadof creating threads among other initialisation factors in Akka. Once initialisation is done, allthreads/actors will be told to start working, and when this is done the next time interval ismeasured. When the threads/actors are all done working, the last interval is measured.

Algorithm 5 Code snippet of the Java counter program with the measurement of the threadcreation and starting.

1: startTime = System.nanoTime();2: for (int i = 0; i < totalThreads; i++) do3: Thread t = new Thread(() => {4: NonAkkaCounter counter = new NonAkkaCounter();5: for(int j = 0; j < iterLimit; j++) do6: counter.count++;7: end for8: });9: threadArray.add(t); . Store the created threads to be able to check for completion

10: end for11: timeInterval(System.nanoTime(), ”INIT”); . Measures the initialisation time and resets startTime

12: for (int i = 0; i < totalThreads; i++) do13: threadArray.get(i).start();14: end for15: timeInterval(System.nanoTime(), ”WORK”); . Measures the time for starting all tasks and

resets startTime16: EndTimerWhenDone(”DONE”); . Joins the threads from the threadArray and stops the

time

18

In Algorithm 5, we see how these intervals were measured. Note that the code in thisalgorithm is different from the code in Algorithm 1, because the creation and starting of thethreads have been split into different loops. Creating and starting a thread right away whilenew threads are still being created results in an early start of these threads, and if spawnedthreads exceed the logical threads of the run-time environment, the scheduling overhead of bothcreation and starting could interfere with one another. Since the program only finished whenall threads are done, starting the first thread when the other threads have not been created canthen increase the execution time due to the mentioned overhead.

The results of this experiment can be seen in Figure 3.2, for the low-count and high-countversions of the experiment, respectively.

We make the following observations:

• Java Thread initialisation seems constant for all threads, which is a significant penalty tothe performance of the Java implementation. In Figure 3.2a, we can see that the workafter thread creation takes far less time than the thread creation itself. This overhead doesnot scale with the workload size, because the line is around 45 milliseconds in both figures3.2a and 3.2b.

• In Figures 3.2c and 3.2d we can see that the overhead of starting up an ActorSystem andsome actors takes up quite some time, and that this scales with the amount of actors thatneed to be created. This scaling can be seen more clearly in figures 3.2a and 3.2b. Theinitialisation of the ActorSystem itself takes roughly 840ms, and as more actors are createdthis overhead increases slightly.

19

(a) 1024: The initialisation is measured from thestart of creating threads/child actors.

(b) INTMAX: The initialisation is measured fromthe start of creating threads/child actors.

(c) 1024: Full initialisation measurements (d) INTMAX: Full initialisation measurements

Figure 3.2: Counting to 1024 and INTMAX using different amounts of threads (Java) and actors(Akka). Elapsed time is displayed for the three phases of the counter program. Average of 10runs. Notice that the x-axis is in log scale. Code in Alg. 5

20

3.5 Initialisation of the Akka actor model

The Akka actor initialisation overhead mainly comes from the ActorSystem, which also handlescreation of system actors and the multithreading, either via a TPE or FJP. The time it takesfor an actor to receive its first message is negligible compared to the ActorSystem creation, orconstruction of an actor. This means that once an actor has finished execution the constructormethod, it is ready to process messages.

Experiment 3: Akka initialisation

To get more knowledge about the initialisation of Akka, we timed the creation of the ActorSystem,actors themselves and the time to send the first message in the case of our counter program.The results are in table 3.1, and they explain the large overhead visible in Figures 3.2c and 3.2d:the creation of the ActorSystem was measured in these experiments. An ActorSystem can’t existwithout any actors, so you could argue that the ActorSystem creation actually is the sum of theactor construction and the ActorSystem creation. We note that the execution of code in actor’sconstructor method never took more than 40 milliseconds. This is not shown in table 3.1 becauseit falls under the Actor Construction phase. Isolating actor creation from ActorSystem creationis hard because they are created hand-in-hand.

Mean (n=10) [ms] Std.Dev. [ms]Actor Construction 69.15 2.47ActorSystem Creation 775.18 9.35Receiving first message 3.33 0.37Total time 847.66 8.51

Table 3.1: Average elapsed time of Akka initialisation. Measured using the counter program.Amount of actors does not affect results because the spawning of the child actors is not measuredhere, as only the main actor is examined.

3.6 ForkJoinPools and ThreadPoolExecutors

So far, we have observed that Java threading scales differently than Akka actors when the amountof threads/actors increases, and this might have to do with the Akka (default) dispatcher versusspawning a lot of threads in the Java implementation.

Akka dispatchers either use an ForkJoinPool (FJP) or a ThreadPoolExecutor (TPE). Theformer is the default setting, but both options will use the amount of threads in the commonthreadpool, unless Akka is instructed otherwise. As such, using 1024 actors in Akka does notmean 1024 threads are utilised.

The FJP and TPE options are both native to Java. To make a fair, and thus better comparisonbetween Java and Akka, we should compare the performance where both implementations usethe FJP to achieve multithreading. For this to be done correctly, both implementations shoulduse the same amount of threads.

We will also compare Java and Akka using the TPE on these poolsizes, because we want toknow what difference is between FJP and TPE. We expect that the FJP implementations willbe faster, because this Executor is claimed to be fast when the task size is small [16], which isthe case for our counter program.

Experiment 4: ForkJoinPool comparison

To enable a fair Akka vs Java comparison when using FJP, we must use FJP of matching sizesin Akka and Java.

Because the common threadpool in our case (refer to Section 3.1) uses 3 active threads(determined by Java), we first examine the time to complete of our counter experiments using apool size of 3. We also experiment with poolsizes of 4 and 8, to see the effect of larger poolsizes.

21

Figures 3.3a and 3.4a show Experiment 4’s results. We make the following observations:

• Initially, the performance using different poolsizes is roughly the same. However, when theamount of tasks grows, differences in elapsed time can be observed between the pool sizesfor both Java and Akka, although in Akka the difference is more noticeable.

• For the small problem size (seen in Figure 3.3a) the poolsize of 3 performs the best. Also,for this case, using too many threads adds more overhead than speedup in Akka.

• For the large problem size (see in Figure 3.4a) the poolsize of 8 is faster for both Java andAkka.

• The variability of the results (pictured as standard deviation in the graphs) grows larger asthe poolsize increases, for both the small and large problem sizes. A possible explanationfor this is that the environment on which the tests were run has 8 hardware threads, andother (system) processes were still running during the test (only necessary user programswere running), which can affect the measured time through unrelated context switches,which show up as more overhead in this experiment.

Experiment 5: ThreadPoolExecutor comparison

The results for both cases (low- and high-count) are presented in Figures 3.3b and 3.3b, respec-tively. We make the following observations:

• The performance trends for the TPE are similar to those measured for FJP.

• For TPE the different poolsizes don’t affect the elapsed time as much as in FJP, with theexception of p=8 for Java in Figure 3.4a. In this case, the increased number of tasks (andthus a smaller task size) lowered the elapsed time, but the elapsed time increases after 256Tasks have been passed.

• The standard deviation areas are smaller for the TPE implementation in both Java andAkka. Poolsizes seem to affect the performance of a TPE implementation less than an FJPimplementation in Java, if the problem size is larger. For the small problem size in Figure3.3b, a larger poolsize caused more overhead.

• In Akka, the poolsize affected the performance less than it did in the FJP implementation.

22

(a) 1024: ForkJoinPool (b) 1024: ThreadPoolExecutor

Figure 3.3: Counting to 1024 using FJP and TPE implementations in Java and Akka. Averageof 10 runs. The figures show the standard deviation as well. The chosen poolsizes are 3, 4 and 8.

(a) INTMAX: ForkJoinPool (b) INTMAX: ThreadPoolExecutor

Figure 3.4: Counting to INTMAX using FJP and TPE implementations in Java and Akka.Average of 10 runs. The figures show the standard deviation as well.

23

24

CHAPTER 4

CLBG for Akka

In this chapter, we discuss the porting of the CLBG programs in Akka and analyse their perfor-mance in order to find our answer on sub-question SQ2.

4.1 Selecting relevant CLBG programs

Because we cannot port every input program available in CLBG due to time constraints, weneed to select programs based on their relevance to Akka. This means we select programs whichmight show interesting insights in the performance of Akka in relation to other models. We preferprograms where concurrency is used, because the Akka actor model can then be tested better.We can identify these candidates by analysing if existing implementations (in other models) showconcurrent behavior.

Because Akka is written in Java, looking at the Java implementations can make it easierto determine whether a program is suited for porting to Akka. Specifically, we looked at thefollowing aspects in source code:

• The use of multithreading and its configuration (e.g. ThreadPoolExecutors in Java);

• The size of individual tasks (compared to each other);

• The amount of tasks;

• Blocking operations (e.g. reading/writing to a file).

On top of these criteria, we also check the (code) complexity of the program: smaller/simplerprograms are easier to port of course, and if the code is less complex we can make strongerassumptions when analysing the results as to why we observe certain behaviour.

We have identified five interesting candidate programs, listed in Table 4.1; the two underlinedprograms in the table are the ones we selected for porting. Both selected programs are concur-rent, meaning that we can make an Akka port which utilises what the Akka actor model has tooffer. The Binary Trees program is more oriented towards memory allocation, and the memoryusage of Akka seems important for us to investigate. The program is non-blocking. ReverseComplement program is blocking, however. This is important to the Akka implementation, andwe are curious to see how the performance of Akka relates to other models in a blocking program.These programs have a relatively low code complexity, making them more favourable over otherprograms. Spectral Norm and Fannkuch-redux are more complex, but their listed propertiesare similar to those of Binary Trees, while binary trees also stressed memory. Choosing BinaryTrees saves time and can provide information about Akka’s performance in a memory intensiveconcurrent program. The Pidigits program is sequential, and we assume that Akka would notperform better than Java in this case.

25

Program Description Conc. Task size Blocking

Binary Trees(De)allocating many perfect binary treesand checking validity. Memory intensive.

Yes Balanced No

Rev. ComplementConstructing the reverse complement ofa DNA string by using a look-up tableand a buffered file read

Yes Balanced Yes

Spectral Norm Calculate the Spectral Norm [25] [26] Yes Balanced No

PidigitsDetermine numbers of Pi, sequentially.Uses the unbounded Spigot algorithm [27]

No Single task No

Fannkuch-reduxFannkuch (pancake) flipping algorithmwhich works on permutations [28]

Yes Imbalanced No

Table 4.1: Overview of our selected 5 (unordered) input CLBG programs and their features. Theunderlined programs were chosen to benchmark Akka. Task size refers to the balance amongtasks sizes, where ”Balanced” means that each task performs roughly the same amount of work.Finally, we also specify whether the program has any blocking operations can be observed in theright-most column.

4.2 Program 1: Binary trees

The Binary Trees program is focused on binary tree creation, thus focusing on memory allocation.The program takes one parameter, N which is the tree depth. The fixed value of N = 21 is used.This means that the program works with very large binary trees, which we expect to take upquite a lot of memory. Memory efficiency of a model is important, and with so many tree nodes,differences in memory efficiency, if any, should become apparent in the results.

The program must consist of a Tree class or structure, which has pointers to its left and rightchildren nodes, and methods that must allow for [29]:

• Tree allocation (constructor)

• Tree deallocation (destructor, implicit in most models)

• Tree traversal with a node count. This is meant to validate the tree’s node count. Validationis done through printing the amount of nodes. This output should match the referenceoutput provided by CLBG.

The program itself creates the following trees, in order:

1. A stretch tree with depth N + 1, to verify there is enough memory available.

2. A long-lived tree allocated with depth N , which must remain in memory until all the othertrees are (de)allocated.

3. Several trees created with varying depth, starting at 4 and increasing till N with a stepsizeof 2. Each of these trees are created multiple times.

4. Finally deallocate the long-lived tree.

4.2.1 Porting to Akka from pseudocode

The Binary trees program pseudocode can be seen in Algorithm 6. This code was based on theexisting implementations and the program specifications. The pseudocode is meant to give ahigh-level abstraction of the program, making the implementation of the program in Akka (seeAlgorithm 7) more easy to write.

There are important differences when using the Akka actor model as opposed to more tradi-tional programming principles one would use in Java. In Algorithm 6 we see a MAIN procedure(line 31) which handles most of the program’s work, by using the TREE class. Data flows directlyfrom the helper function CREATE-TREE (line 15) into the MAIN procedures variable checkSum (line39). However, in Algorithm 7, the MAIN procedure (Alg. 7, line 45) is very short, because theMAINACTOR handles what the original MAIN did.

26

This difference exists in Akka because the data flow between actors should always happenusing messages. These messages are typically sent using the TELL command. A non-actor cannotbe ”told” something by an actor, and therefore we need a MAINACTOR to be able to receive (andsend) data.

In the Akka pseudocode in Algorithm 7, the checkSum is sent (line 46) using the TELL com-mand rather than using a return statement (Alg. 6, lines 17 and 19). The MAINACTOR thenreceives the checkSum (line 20) and orders the output so that it can be printed in the correctorder. In Algorithm 6, a similar approach is used to ensure the order of output, but this is notshown in the pseudocode for sake of simplicity.

We also need to keep track of how many actors are done, because Akka otherwise does notterminate (by default). This requires an analysis of the conditions for program termination, in amore complicated sense than you would do when waiting for an FJP to finish. Specifically, thismeans we must explicitly keep count of tasks started and stopped, as seen in Algorithm 7 in theform of actorsBusy (lines 17 and 22).

27

Algorithm 6 Binary Tree program: Tree class, helper function, and Main program

1: class Tree2: left child : TREE3: right child : TREE4: . Constructor: Allocate the node and set its children5: procedure TREE(TREE left, TREE right)6: left child = left7: right child = right8: end procedure9: . Constructor: Allocate the node without children

10: procedure TREE11: left child = NULL12: right child = NULL13: end procedure14: . Method: Traverse the tree and count nodes recursively15: procedure CHECK-TREE16: if left child == NULL then . The tree is perfect, no need to check both children17: return 118: end if19: return 1 + left child.CHECK-TREE() + right child.CHECK-TREE()20: end procedure21: end class22: . Helper function to create trees given depth N, from the bottom up23: procedure CREATE-TREE(Integer N)24: if 0 < N then25: return TREE(CREATE-TREE(N-1), TREE(CREATE-TREE(N-1)26: end if27: return TREE()28: end procedure29: . Main code of the program. N is the maximum tree depth.30: . De-allocation must happen after each PRINT, which happens implicitly in most models.31: procedure Main(Integer N)32: stretchTree = CREATE-TREE(N+1)33: PRINT(stretchTree.CHECK-TREE())34: longLivedTree = CREATE-TREE(N)35: for (depth = 4; depth ≤ N; depth += 2) do36: checkSum = 037: iters = 1 << (N - depth + 4)38: for (i = 1; i ≤ iters; i++) do39: checkSum += CREATE-TREE(depth).CHECK-TREE()40: end for41: . Ensure that least depth trees are printed first to match reference output42: PRINT(checkSum)43: end for44: PRINT(longLivedTree.CHECK-TREE())45: end procedure

28

4.2.2 CLBG results - Binary Trees

The CLBG suite was extended with an Akka implementation based on the pseudocode in Algo-rithm 7. The tree depth was chosen to be N = 21, as specified in the program’s description in thedocumentation [29]. The benchmark was configured to run each test 10 times. Unfortunately,the suite does not provide a way to get the standard deviation of the measurements, but weassume the standard deviation is not significant based on some preliminary testing. The testswere run with the same experiments setup as in previous experiments (section 3.1). While CLBGprovides implementations in many models, only a handful were benchmarked. This is because alot of models did not work well (like compilation), even after tweaking for a while.

The chosen models are listed in Table 4.2. Most models in CLBG have multiple implemen-tations, so their implementation ID is suffixed to the program name. The source code for theseimplementations can be found in the CLBG documentation for the Binary Trees program [29].The Akka implementation will be referred to as akka-1, but can of course not be found in theCLBG documentation.

Model Version Features

Akka 2.6.4 Concurrency, Actor model, Object-oriented, used within Java (or Scala), JVMDart 2.8.2 Concurrency, For user applications, Object-orientedGo 1.6.2 Concurrency, Go-routinesPython 3.6.8 Concurrency, Scripting language, supports OOPJava 1.8.0 201 Concurrency, Object-oriented, JVMJRuby 1.7.22 Concurrency, Java implementation of Ruby, Object-oriented, JVMJulia 1.4.1 Concurrency, Numerical applications

Table 4.2: Models used in our CLBG experiments. All models support concurrency.

Figure 4.1a presents the total execution time of the Binary Trees program for the tested mod-els (which are colour-coded to make models more distinctive). Java and Akka implementationsexecuted the fastest, with java-7 and akka-1 having comparable execution times. In Figure4.1b we can see that Java and Akka are similar in terms of CPU time as well. Figure 4.1 showsthat models are mostly grouped in their CPU time and elapsed time measurements. Especiallyin Figure 4.1b we can see this grouping. jruby-5 has a higher CPU time than jruby-4, but hasa lower elapsed time than the other JRuby implementations. This is an example that a lowerCPU time is not necessarily better in terms of fast execution. Factors which contribute to thisdifference are the level of parallelism and CPU idle times. CPU time tells us how much loadis put on the CPU and 4.1 shows us that both Java and Akka have fast execution times whilehaving relatively low CPU stress, which is an indication of their CPU efficiency of execution thisprogram.

Between Java and Akka, akka-1 has a similar elapsed time but a slightly lower CPU timethan java-7. Furthermore, java-2, java-3 and java-6 all have a lower CPU time than akka-1,yet their elapsed time is 50% longer than akka-1. Akka can be as fast as Java, while having a 9%faster CPU time for this program. CPU time and CPU load ordering does not match each otherdirectly, because CPU time is not determined solely by CPU load. julia-2 has the least totalCPU load but does not have the lowest CPU time, for example. For implementations withinthe same model however, implementations with a lower CPU time generally have a lower CPUload than other implementations within that model. In Figure 4.2 we can see that CPU load islower than java-7, which is more evidence that Akka is more CPU efficient in this case, becausejava-7 has a similar execution time compared to akka-1.

For the tested program, Akka seems to be more CPU efficient than Java. If we comparejava-7 and akka-1 in terms of memory, Figure 4.3a shows that Akka uses about the sameamoung of memory as java-7 (about 5% less, to be precise). Other Java implementations useless than akka-1, however, but these also had a longer execution time. Similar to CPU time, themodels seem to grouped in their memory usage, which can easily be observed by looking at thecolours of the bar chart in Figure 4.3a. The results for GO were left out of this analysis, becausethe measurements were invalid (CLBG did not measure the memory of the child process used byGO properly).

29

Algorithm 7 Implementation in Akka (pseudocode) of the Binary Trees program

1: class MAINACTOR2: minDepth : Integer3: maxDepth : Integer4: actorsBusy : Integer5: longLivedTree : TREE6: OutputArray : Array7: procedure Create(INTEGER minDepth, INTEGER maxDepth)8: minDepth = minDepth9: maxDepth = maxDepth

10: actorsBusy = 011: stretchTree = CREATE-TREE(N+1)12: PRINT(stretchTree.CHECK-TREE())13: longLivedTree = CREATE-TREE(N)14: for (depth = minDepth; depth ≤ MaxDepth; depth += 2) do15: newActor = TreeMakeActor.CREATE(minDepth, maxDepth, GET-SELF())16: newActor.TELL(MakeTreeCommand(depth))17: actorsBusy++18: end for19: end procedure20: procedure OnStopCommandReceived(Integer depth, Integer checkSum)21: OutputArray[depth - minDepth / 2 ] = checkSum22: if (−−busyActors ≤ 0) then23: for String s in OutputArray do24: PRINT(s)25: end for26: PRINT(longLivedTree.CHECK-TREE())27: STOP-AKKA()28: end if29: end procedure30: end class31: class TreeMakeActor32: minDepth : Integer33: maxDepth : Integer34: parentActor : MAINACTOR35: procedure Create(Integer minDepth, Integer maxDepth, MAINACTOR replyTo)36: minDepth = minDepth37: maxDepth = maxDepth38: parentActor = replyTo39: end procedure40: procedure OnMakeTreeCommandReceived(Integer depth)41: checkSum = 042: iters = 1 << (maxDepth - depth + minDepth)43: for (i = 1; i ≤ iters; i++) do44: checkSum += CREATE-TREE(depth).CHECK-TREE()45: end for46: parentActor.TELL(StopCommand(depth, checkSum))47: end procedure48: end class49: . Creates the MAINACTOR which automatically starts the creation of trees.50: procedure Main(Integer N)51: MAINACTOR.CREATE(4, N)52: end procedure

30

(a) Elapsed time (seconds) (b) Total CPU time (seconds)

Figure 4.1: Binary Trees: CLBG measurements of elapsed time (seconds) and total CPU time(seconds). Averages of 10 runs. The number suffixed to model names indicates the implementa-tion number within CLBG.

Figure 4.2: Binary Trees: CLBG measurements of CPU core load. Average of 10 runs. Thenumber suffixed to model names indicates the implementation number within CLBG.

Finally, we also compare the compressed code for Binary Tree. The size is measured using theGZIP tool and the results are presented in Figure 4.3b. We observe that Java implementationsrequire less code than Akka, which is expected given that The Akka actor model requires pro-grammers to write more code. This is visible in Algorithm 7, which is more verbose (while stillbeing simplified pseudocode) than Algorithm 6 (notice that Algorithm 6 also includes the TREE

class, which Algorithm 7 also uses). The results indicate that Akka is one of the most verbosemodels of the tested programs, only being surpassed by Dart in two cases. However, dart-1is much shorter than other Dart implementations (Figure 4.3b), while dart-1 also performedwell in terms of elapsed time, CPU time and memory usage (Figures 4.1 and 4.3a). It would bedifficult to make akka-1 much shorter while still having the same functionality.

4.3 Program 2: Reverse Complement

The Reverse Complement CLBG program was selected to be ported and benchmarked due to itsuse of blocking operations. As blocking is malicious to performance in concurrent application,this program should provide insight into Akka’s performance when handling blocking operations.

31

(a) Memory used (MB). Average of 10 runs. (b) Compressed code size (Bytes).

Figure 4.3: Binary Trees: CLBG measurements of memory used (MB) and compressed codesize (Bytes). The number suffixed to model names indicates the implementation number withinCLBG.

The program uses a mapping to create the reverse complement of a nucleotide (buildingblock of DNA) string to create the complementary DNA string [30]. The nucleotide mappingis one-to-one, and can be seen in Table 4.3. The input DNA string is read from the standardinput stream, and the reverse complement is printed to the standard output stream. The inputand output are both in FASTA format, which is a standard format for DNA string files [31].The read and write operations happen using buffers, meaning only a (small) part of the input isprocessed rather than reading the input stream as a whole. For our benchmark, we will use the250MB input file from CLBG [32].

4.3.1 Porting to Akka from Java

We created our Akka implementation based on one of the Java implementations available inCLBG, namely java-4 [32]. We ran the benchmark before implementing the program in Akka,to see which Java program performed well while having a relatively low code complexity andsize. We then chose for java-4 because porting it to Akka seemed more easy than porting otherapplications. Java-4 scored well in the preliminary benchmark despite its short and simple code.Pseudocode for this Java implementation can be seen in Algorithm 8.

The java-4 code was used to create the Akka implementation (akka-1). The reading of inputfrom Algorithm 8 (line 31) is done in the ReaderActor (Alg. 9, line 1), while computing thereverse complement was done by the ReverseActor (Alg. 9, line 16). Once the ReverseActor

receives new valid input, the ReaderActor starts reading more input (concurrently). This wasdone to make the program concurrent. The buffer size (BUFSIZE) was increased from 82 (injava-4) to 1024 in akka-1. We expect the overhead of sending and receiving messages to bereduced, because the amount of tasks grows smaller when the buffer size is increased, resultingin less frequent IO operations. The buffer size was different between the implementations inCLBG, and no fixed value was specified.

32

Algorithm 8 The pseudocode for java-4 from CLBG

1: class ReversibleByteArray extends java.io.ByteArrayOutputStream

2: procedure reverse(())3: . count is the number of valid bytes in the ByteArrayOutputStream buffer [33]4: if count > 0 then5: INTEGER begin = 06: INTEGER end = count - 17: while (buf[begin++] 6= newline) do8: pass . Reversal should happen line-by-line, so we search for the next newline9: end while

10: while (begin ≤ end) do11: if (buf[begin] == newline) then12: begin++13: end if14: if (buf[end] == newline) then15: end−−16: end if17: if (begin ≤ end) then18: BYTE tmp = buf[begin]19: buf[begin++] = COMP-MAP(buf[end]) . Apply the mapping (Table 4.3)20: buf[end−−] = COMP-MAP(buf[tmp])21: end if22: end while23: WRITE(buf, 0, count) . Write to Std.Out24: end if25: end procedure26: end class27: procedure MAIN(())28: BYTE[ ] line = new byte[BUFSIZE]29: INTEGER read30: REVERSIBLEBYTEARRAY buf = new REVERSIBLEBYTEARRAY()31: while (READ(line)) do32: INTEGER i = 033: INTEGER last = 034: while i < read do35: if (line[i] == ’>’) then36: buf.WRITE(line, last, i - last)37: buf.REVERSE()38: buf.RESET()39: last = i40: end if41: i++42: end while43: buf.WRITE(line, last, read - last)44: end while45: buf.REVERSE()46: end procedure

33

Algorithm 9 Akka pseudocode for the Reverse Complement CLBG program (akka-1. It usesthe ReversibleByteArray class from Alg. 8 (line 1).

1: class READERACTOR2: reverseActor : REVERSEACTOR3: line : BYTE[ ]4: procedure CREATE(())5: line = new BYTE[BUFSIZE]6: reverseActor = REVERSEACTOR.CREATE(GET-SELF())7: end procedure8: procedure OnReadCommandReceived(())9: INTEGER read = READ(line) . Stores values in the buffer

10: reverseActor.TELL(ReverseCommand(read, line))11: if (read == -1) then12: STOP-AKKA() . Awaits completion of running tasks13: end if14: end procedure15: end class16: class REVERSEACTOR17: readerActor : READERACTOR18: line : BYTE[ ]19: buf : REVERSIBLEBYTEARRAY20: procedure CREATE((READERACTOR reader))21: line = new BYTE[BUFSIZE]22: buf = new REVERSIBLEBYTEARRAY()23: readerActor = reader24: end procedure25: procedure OnReverseCommandReceived((INTEGER read, BYTE[ ] line))26: if (read 6= -1) then27: readerActor.TELL(ReadCommand()) . If there is more to read, start a new read28: INTEGER i = 029: INTEGER last = 030: while i < read do31: if (line[i] == ’>’) then32: buf.WRITE(line, last, i - last)33: buf.REVERSE()34: buf.RESET()35: last = i36: end if37: i++38: end while39: buf.WRITE(line, last, read - last)40: else41: buf.REVERSE()42: end if43: end procedure44: end class45: procedure Main(Integer N)46: READERACTOR.CREATE().TELL(ReadCommand())47: end procedure

34

Code Meaning ComplementA A TC C GG G CT/U T AM A, C KR A, G YW A, T WS C, G SY C, T RK G, T MV A, C, G BH A, C, T DD A, G, T HB C, G, T VN G, A, T, C N

Table 4.3: The mapping used to get the Reverse Complement. The code column is a representa-tion of nucleotides, where N can be seen as a wildcard because it can be any of the 4 nucleotides,for example. The complement column denotes the output of the mapping based on the inputcode value [32]

4.3.2 CLBG results - Reverse Complement

We present results for running the CLBG benchmark 10 times for all the selected models in Table4.2 in Figure 4.4a. We observe that Akka had a higher execution time than most models, andJava was always faster in this experiment. The akka-1 implementation was based on java-4,which executed faster than akka-1. While akka-1 was able to read input concurrently, theoverhead of communication between actors outweighed the benefit of this.

Preliminary testing of the Akka port during implementation showed that a lower buffersize slowed down Akka significantly. A lower buffer size resulted in more overhead, becausecommunication between the ReaderActor and ReverseActor happened more frequently.

In Figure 4.4b, depicting CPU time, we also see a relatively high CPU time for Akka com-pared to other models; again, java-4 outperformed akka-1. Furthermore, Figure 4.5 shows thatakka-1 has a higher CPU load than most models, although some Java implementations had aneven higher CPU load. However, these models (java-3 and java-8) were among the fastestmodels in terms of elapsed time (Figure 4.4a). We can clearly state that the akka-1 implemen-tation is less efficient than the Java implementations for this program, because it has a higherCPU time, CPU core load, and execution time compared to most models including Java. Akka-1exhibits worse CPU performance than most models for this program.

35

(a) Elapsed time (seconds) (b) Total CPU time (seconds)

Figure 4.4: Reverse Complement: CLBG measurements of elapsed time (seconds) and totalCPU time (seconds). Averages of 10 runs.

Figure 4.5: Reverse Complement: CLBG measurements of CPU core load. Average of 10runs.

36

(a) Memory used (MB). Average of 10 runs. (b) Compressed code size (Bytes).

Figure 4.6: Reverse Complement: CLBG measurements of memory used (MB) and com-pressed code size (Bytes).

While Akka’s CPU performance is poor for this program, the amount of memory used is onlyslightly more than Java’s, and models like Dart and JRuby use much more than Akka and Java(Figure 4.6a). We do see that implementations in a same model can vary in memory usage. Thisbehaviour is observed for all models, as there is not much grouping of the same models. For theBinary Trees program, the memory usage was more grouped (Figure 4.3a). Overall, the resultsseem to indicate that memory usage is more implementation specific than model dependent, forthis program, making the data in Figure 4.6a less suited to make generalised statements aboutAkka’s memory usage.

In terms of code size, the results are not what we expected, because Akka tends to moreverbose than Java. However, Figure 4.6b shows that Akka is more concise than several Javaimplementations. Nevertheless, it is still more verbose than most implementations, and java-4

is shorter than akka-1. This difference in length is not visible in the pseudocode in Alg. 8 andAlg. 9, because some of Akka’s verbosity was abstracted away to keep the pseudocode brief.

37

38

CHAPTER 5

Conclusion and future work

Concurrency is important and it is more easily achievable if a programming model has it embed-ded. Because Akka has concurrency deeply embedded in its design, we compared Akka againstmodels to see how it performs.

5.1 Main findings

The main goal of this project was to answer the following research question: How does theAkka actor model perform compared to other models?. To this end, we also formulated two sub-questions:

SQ1: How does Akka compare against Java for a basic application?SQ2: How does Akka compare against the multiple models in the Computer Language Bench-

marks Game?

As we have seen in Chapter 3, where we experiment with a simple synthetic program, usingmore actors or threads creates more overhead than speedup for our tested program. Havingmany actors in Akka is more detrimental to the execution time of our program than havingmany individual threads in Java, although Java also suffers when too many threads are used.Using ForkJoinPools (FJPs) and ThreadPoolExecutors (TPEs) reduces the overhead caused byhaving too many threads in Java. Akka does also use FJPs or TPEs to work concurrently, butthis is affecting the dispatcher that is used to govern which actors get to work on a thread.Having many Akka actors still causes overhead, because each actor needs to be instantiatedand kept in memory. While initialisation in Akka is worse than Java when executing a simple,short program, the distribution of messages to Akka actors is faster than the starting of tasks inJava. Java was always faster in terms of total execution time, because of the initialisation over-head of Akka. Using FJPs or TPEs in Akka resulted in similar performance for our test program.

An answer to SQ2 can be given based on the results in Chapter 4. In our research, we haveseen that for the non-blocking Binary Trees program, Akka was more CPU efficient compared tomost other models, while for the blocking Reverse Complement program, Akka had worse CPUperformance than most other models. In both programs Akka did not have significantly highermemory usage than Java. Akka has a larger code size in general, as the actor model makes itquite verbose. However, a more verbose model does not always mean that the code size is alwaysmuch larger than less verbose models, because the difference depends on how the model appliesto the application being implemented.

Concluding with an answer to our research question, Akka has the potential to outperformother models if it is applied to programs where it is useful to have multiple actors, and havingany blocking operations should be avoided to aid the performance of Akka. Akka can performbetter than other models in terms of CPU and memory performance, at the cost of code size, but

39

this requires that the actor model is applicable to the program. Java was comparable or betterthan Akka in our CLBG experiments. If initialisation of a program is not that important, Akkacan be used to perform tasks more efficiently, with respect to the distribution of work, than Javacan using multi-threading. However, having too many actors work at the same time can createsignificantly more overhead than you would see in a Java ThreadPool, for example.

5.2 Future work

While our work has provided interesting insights into the performance of Akka versus othermodels, there are a lot of additional experiments which could be done to make differences betweenAkka and other models more clear.

First off, all the experiments were performed on one machine, and results can vary on othersystems (one with more CPU cores, for example). Thus, repeating these experiments on moresystems would be useful for generalization.

Second, our microbenchmark from Chapter 3 could be extended to also investigate the mem-ory usage of Akka compared to Java, because it would be interesting to see the effects of manyactors being held in memory while they are waiting to occupy a thread.

Furthermore, additional research should be done using CLBG, because the suite offers manymore input programs. An input program which uses more actors to check the effects on memorycould be Spectral Norm [25], as high dimension matrix multiplication is easy-to-distribute intasks. New programs could also be created in both Java and Akka to compare just these twomodels, as they are closely related to each other.

Finally, it would also be very interesting to see how Akka performs in a large-scale distributedapplication, because it is advertised as having great performance in those types of applications.

40

Acronyms

CLBG Computer Language Benchmarks Game. 3, 5, 7, 9, 10, 12, 13, 25, 26, 29, 31, 32, 33, 34,35, 36, 37, 39, 40

FJP ForkJoinPool. 5, 11, 12, 21, 22, 23, 27, 39

INTMAX the maximum amount of a 32-bit signed integer, equal to 2147483647. 17, 18, 20,23

JVM Java Virtual Machine. 12, 13, 17, 29

OOP Object-oriented programming. 7, 10, 29

TPE ThreadPoolExecutor. 5, 11, 12, 21, 22, 23, 39

41

42

Bibliography

[1] Lightbend, “How the actor model meets the needs of modern, distributed systems.”https://doc.akka.io/docs/akka/current/typed/guide/actors-intro.html, mar 2020.

[2] H. Koen, J. De Villiers, G. Pavlin, A. De Waal, P. De Oude, and F. Mignet, “A frameworkfor inferring predictive distributions of rhino poaching events through causal modelling,” inFUSION 2014 - 17th International Conference on Information Fusion, Institute of Electricaland Electronics Engineers Inc., 2014.

[3] I. Gouy, “Details about computer language benchmarks game (clbg).”http://benchmarksgame.wildervanck.eu/how-programs-are-measured.html, 2020.

[4] I. Gouy, “Github repository for the computer language benchmarks game.”https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/tree/master, 2020.

[5] J.-l. Gailly, “Gzip documentation.” https://www.gnu.org/software/gzip/manual/gzip.html,2011.

[6] M. Baulig, “Libgtop documentation.” https://developer.gnome.org/libgtop/stable/, 2020.

[7] A. Hasija, “Performance benchmarking akka actors vs java threads.”https://blog.knoldus.com/performance-benchmarking-akka-actors-vs-java-threads/, Oct2019.

[8] P. Nordwall, “Blog: Yet another akka benchmark.”https://blog.jayway.com/2010/08/10/yet-another-akka-benchmark/, Aug 2010.

[9] S. Boyd-wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich, “(mit csail) non-scalablelocks are dangerous,” 2011.

[10] Lightbend, “Akka dispatchers documentation.” https://doc.akka.io/docs/akka/current/typed/dispatchers.html,april 2020.

[11] A. Tanenbaum and H. Bos, Modern Operating Systems. Pearson, 2015.

[12] “Oracle java 8 documentation: Threadpoolexecutor.”https://docs.oracle.com/javase/tutorial/essential/concurrency/runthread.html, 2020.

[13] J. Friesen and J. Friesen, “Java 101: Understanding java threads, part 3: Thread schedul-ing and wait/notify.” https://www.javaworld.com/article/2071214/java-101–understanding-java-threads–part-3–thread-scheduling-and-wait-notify.html, Jul 2002.

[14] “Oracle java 8 documentation: Threadpoolexecutor.”https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html,2020.

[15] Baeldung, “Introduction to thread pools in java.” https://www.baeldung.com/thread-pool-java-and-guava, April 2020.

43

[16] Baeldung, “Guide to the fork/join framework in java.” https://www.baeldung.com/java-fork-join, Feb 2020.

[17] C. Okasaki, “Purely functional data structures.” https://www.cs.cmu.edu/ rwh/theses/okasaki.pdf,September 1996.

[18] H. Letmanyi, “Tutorial on benchmark construction,” 1979.

[19] S. Nanz and C. A. Furia, “A comparative study of programming languages in rosetta code,”in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1,pp. 778–788, 2015.

[20] A. Sieber, “The rosetta code repository on github.” https://github.com/ad-si/rosettagit.

[21] N. Togashi and V. Klyuev, “Concurrency in go and java: Performance analysis,” in 20144th IEEE International Conference on Information Science and Technology, pp. 213–216,2014.

[22] T. Andersson and C. Brenden, “Parallelism in go and java: A comparison of performanceusing matrix multiplication,” 2018.

[23] M. Grimmer, M. Rigger, R. Schatz, L. Stadler, and H. Mossenbock, “Trufflec: Dynamic ex-ecution of c on a java virtual machine,” in Proceedings of the 2014 International Conferenceon Principles and Practices of Programming on the Java Platform: Virtual Machines, Lan-guages, and Tools, PPPJ ’14, (New York, NY, USA), p. 17–26, Association for ComputingMachinery, 2014.

[24] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in Science & Engineer-ing, vol. 9, no. 3, pp. 90–95, 2007.

[25] I. Gouy, “Spectral norm input program implementation details.” https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/spectralnorm.html#spectralnorm,2020.

[26] E. Weisstein, “Spectral norm. from mathworld–a wolfram web resource.”https://mathworld.wolfram.com/SpectralNorm.html, 2020.

[27] I. Gouy, “Pidigits input program implementation details.” https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/pidigits.htmlpidigits, 2020.

[28] I. Gouy, “Fannkuch-redux input program implementation details.”https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/fannkuchredux.html#fannkuchredux,2020.

[29] I. Gouy, “Binary trees input program implementation details.” https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees, 2020.

[30] R. Bedre, “Reverse complement (dna) explained.” https://reneshbedre.github.io/blog/revcom.html,2020.

[31] W. U. R. Wageningen University Bioinformatics Group, “Fasta file format for dna applica-tions.” https://www.bioinformatics.nl/tools/crab fasta.html, 2020.

[32] I. Gouy, “Clbg reverver complement input program implementation details.”https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/revcomp.html#revcomp,2020.

[33] “Oracle java 7 documentation: Bytearrayoutputstream.”https://docs.oracle.com/javase/7/docs/api/java/io/ByteArrayOutputStream.html, 2020.

44

Date post:	11-Feb-2022
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Benchmarking Akka am

Documents