An Application Framework for High Available Systems in Node

An Application Framework for High Available Systems in Node.JS

Master of Science ThesisStockholm, Sweden 2011

TRITA-ICT-EX-2011:215

Sergio Avalos Contreras

Royal Institute of TechnologySchool of Information and Communication Technology

An Application Framework for High-AvailableSystems in Node.JS

Sergio Avalos Contrerasavalos(@)kth.se

September 14, 2011

Supervisor:Claudijo Borovic

Examiner:Prof. Johan Montelius

Abstract

“Node.JS”, an event-oriented framework for coding JavaScript programs on the server side, iscoming out as an emergent technology for creating efficient and scalable network applicationsof high performance and low memory consumption. Yet, its characteristic of handling several,even thousands, of connections by using one single process, opts to be a vulnerability whencreating highly available applications. Thus, a research has been conducted to confirm if thisframework is capable of meeting such requirements despite the odds.

During the course of this investigation, a study about failures in Internet Services has beenconducted, showing that the technology chosen is not the most common reason for servicedisruptions. In addition, a prototype, based on a Fault Model Enforcement and design patternsfor fault tolerant software, was developed to monitor an Instant Messaging service (also writtenin JavaScript) at system and application level and to provide redundancy by communicatingwith other nodes within a cluster system whenever it crashes.

The results obtained through a series of fault-injection testings show the functionalities of thenewly created system, confirming that Node.JS seems to meet the requirements needed todevelop a highly available program. Further testing in regards to stability and CPU usage,together with the implementation of better tools for monitoring, can improve the robustness ofthe system.

Para Juan y Kenay a su pueblerina pero invaluable frase

“Para atras ni para tomar aire”

Acknowledgements

My special thanks to...

CONACYT, National Council of Science and Technologyfor their financial aid during these two years while I study for this master program

Claudijo Borovic and Niclas Holm, Founders of Wussapfor giving me the chance of being part of something authentic and original

Johan Montelius, Associate Professor at KTHfor being patient and accessible

Vasileios Trigonakis, Researcher at EPFLfor his observations which made a incredible valuable contribution to this document

My family and relatives,for providing a clear perspective on my goals and showing me the way to achieve them

And finally to Tatiana;because without her encouragement and support, I wouldn’t have completed this project

Contents

1 Introduction 6

2 Project Description 72.1 Description of the Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Functional Requirements: . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Non- functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Fault Tolerance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Theoretical Background 113.1 Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Fault Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 General Fault Tolerance Procedure . . . . . . . . . . . . . . . . . . . . . . 123.1.4 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.5 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Analysis of Fault Tolerance Systems in Internet Services . . . . . . . . . . . . . . 143.3 Fault Model Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Patterns for Fault Tolerant Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Methodology 194.1 Node.JS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Event-driven programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Node and the ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Database Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5 JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5.1 Must-to-know features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.2 The bad parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6 Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Implementation 275.1 Wussap Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1.2 Publish-Subscriber Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Fault-Tolerant Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.2 Description of Components . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Testings 326.1 Prototype Functionality Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Framework Functionality Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conclusion 38

8 Discussion and Limitations 39

References 40

1 Introduction

Internet services are mostly known for operating all time (seven days a week twenty four hoursper day) with no downtime, behave naturally despite the number of users and respond as fastas possible so the interaction seems very smooth. Not always can be achieved but surely theseare always the main targets when building them so they can seem ubiquitous from whereverthey are accessed.

Building these kind of systems represents a set of challenges and depends a lot in the chosentechnology to achieve them. One that has lately started to acquire a lot of attention is Node.JSfor being incredibly fast and superior in terms of scalability and efficiency. Indeed, a recentstudy revealed that it was capable of handling 10 thousand simultaneous connections[24], alimit where many other technologies have failed.

Nevertheless, the stability of Node.JS is still questionable due to its maturity and because ituses an dynamic programming language, JavaScript, running in one single process. Even JoeArmstrong, creator of Erlang, stated that Node.JS was not designed for building fault-tolerancesystems [13]

For this reason, a research is presented about the needed methodologies to create a faulttolerant system focusing on High Availability, an important feature that any Internet Servicerequires and Node.JS solely can not provide. This document starts by describing the projectthat was made in Wussap, a startup IT company from Stockholm. followed by the preliminaryinvestigations. These are composed by a summary of the most common failures found in thebig Internet companies and study about design patterns for fault tolerance software and faultenforcement model. Then, a prototype, an instant messaging browser plugin, is proposedalong with a framework that provides these capabilities to operate indeterminably. To verify ourfindings, a series of tests were performed to verify the system behavior and expected results.

7

2 Project Description

2.1 Description of the Company

Wussap is a start-up company (part of the company incubator STING located in KTH) who hasbeen awarded as one of the most innovative IT companies in 2010 [17], and surely it is; as theypromote it, Wussap takes the social co-browsing to the next level by allowing users to visit awebsite all together and chat with any visitor of any particular site. All you need is to install onyour browser a plugin and, as soon as it’s on, you can continue surfing a website as you didbefore but with the difference that you will see other visitors and even chat with them.

In addition, you can create a shared surfing session (called “Surftrain”) where the browser ofthe invited users will be redirected to any website the creator is going to and also all events aspassing through like opening a picture, scrolling up/down, etc will be shown. So all participantscan see the same screen.

2.2 Initial Conditions

The software that Wussap offers is an Instant Messaging application very similar to a groupchats except for a few differences. First, the activity takes place on a website, and second, thereare two roles, leader, who decide does everyones browser goes, and viewers who witness andchat with other participants. These sessions will be referred from now on as Surftrains.

Other functionalities (among several) that this program has are:

• Finding persons and chat with any other visitor of the website the user is on,

• Creating a Surftrain that will start later (i.e. scheduling),

• Subscribing and being notified about other scheduled Surftrains,

• Reviewing the most popular websites visited with the tool,

• Reviewing the most popular past Surftrains, etc

Currently the system is developed as a web browser plugin (so far only for Internet Explorerand Firefox) and the user is asked to install it in order to use it. But part of the future plan is toreduce this dependency of installing it in a particular browser and attach this service directly onthe website.

Part of this project is to develop a much simpler version using Node, Web Sockets and otherrelated web technologies. The main objective was to develop some of the functionalities thatthe current version has followed by a series of tests that will analyze different aspects of theadvantages and disadvantages of using this framework.

8 2 Project Description

2.3 Requirements

2.3.1 Functional Requirements:

• Wussap allows users to interact via exchange of text message when two or more personsare visiting same website at the same time .

• Wussap allows users to interact via exchange of text message when two or more personsare participating in a Surftrain.

• Wussap allows only registered users to interact.

• Wussap persists user information by saving a username and password.

• Wussap notifies to other visitors when a new user is on a website.

• Wussap notifies to other visitors when a user is no longer visiting a website.

• Wussap allows a user to create a Surftrains.

• Wussap notifies when a new user has joined a Surftrain.

• Wussap notifies when a users is no longer participating in a Surftrains.

• Wussap allows just the creator to change the website of the Surftrain.

• Wussap allows just the creator to conclude the Surftrain.

• Wussap notifies the current location of an Surftrain exclusively to its followers.

• Wussap notifies exclusively to its participants when a Surftrain changes to website.

• Wussap notifies exclusively to its participants when a Surftrain has concluded.

• Wussap keeps a track of the created Surftrains including starting and ending time.

• Wussap provides a list of current Surftrains.

2.3.2 Non- functional Requirements

• Wussap is written in Node.js.

• Wussap uses Socket.IO for communication.

• Wussap interface is developed through a set of tests in QUnit.

• Wussap client program is loaded using RequireJS.

• Wussap uses a NoSQL Database system.

• Wussap is used in any Web browser that supports WebSockets.

2.4 Fault Tolerance Analysis 9

2.4 Fault Tolerance Analysis

In order to build a fault-tolerant application, it is essential to know the possible threats thatthe software could suffer from. The key question that any developer should continuously askthrough the course of the development is What could go wrong and then think about the pro-cedure of how to be protected from them. The following list describes these threats:

Failure Possible FaultsIncorrect Information

• Invalid User information (see below)• Invalid Surftrain Information (see below)• Invalid Web Site Information (see below)• Invalid Message information in Surf-

train/Website (see below)• Wussap contains duplicated Surftrains (having

same ID)• Wussap contains duplicated Websites (having

same URL)• Surftrain/Website contains duplicated Mes-

sages

Untimely information• Wussap displays different current/ongoing Surf-

trains• Wussap fails to save/fetch information• Wussap receives more requests that it can han-

dle• Response time is too long• Wussap ignores requests

Server unavailability• Wussap fails to response (i.e process has

crashed)• Wussap’s server has crashed• Wussap fails to connect with database• Wussap runs out of memory• Wussap closes connection sockets to current

clients• Wussap is being upgraded

Invalid User Information:

• User does not have a unique username

• User does not have a password

Invalid Surftrain Information:

• Surftrain does not contain title

• Surftrain does not contain URL or wrong formatted according to the format in RFC 1738

• Surftrain contains URL/station with no registered time

• Surftrain contains URL/station with registered time greater than the arrived date

10 2 Project Description

• Surftrain does not contain a registered user as a leader

• Surftrain registered an arrived time greater than current

• Surftrain contains followers that are not registered in the system

Invalid Web Site information:

• Website contains an invalid URL format according to the format RFC 1738

• Website contains users that are not registered in the system

Message Information:

• Message does not contain content, sender or registered time

• Message contains as sender a non-registered user

• Message contains a registered time greater than current

11

3 Theoretical Background

3.1 Fault-Tolerance

3.1.1 Basic Concepts

The following terms are listed in [15]:Failure – a system behavior that deviates from the specified behavior. For example, when a

server crashes when it probably shouldn’t crash at anytime or prompts a miscalculation.Error – the incorrect system behavior from which a failure may occur, either by value or timing.

Errors are important because, if detected on time, they can be prevented from turning intofailures.

Fault – the defect that is present in the system and can cause an error. (Colloquially calledas “bug” in software). Normally present due to an incorrect requirement specification,incorrect design or a coding error.

Figure 1: Relation between fault, error and failure

Although related, a fault not always turns into an error. For instance, a line code could beerroneously written but never executed or a piece piece of hardware that was never touched byother components in a complex system.In the same way, an error does not necessarily turn into a failure every time. The typical exam-ple is when the server crashes and it is replaced by a backup. Also in software an exceptioncan be caught by an exception handler and hidden. Although the error was presented, it wasimperceptible for the end user.

3.1.2 Fault Classifications

By durationPermanent – a fault will remain unless it is removed by some external agency. From an engi-

neering point of view, these are easy to diagnose.Transient – a present fault that will eventually disappear without any apparent intervention or

cause. Also considered as unpredictable.

By causeDesign faults – due to an incorrect requirement specification or bad designs while coding. In

practice, even with a carefully designed system, there’s the assumption that errors mightappears, thus some mechanism are put in placed to protect the system.

Operational faults – faults that occur during the life time of the system.

By failed component behaviorCrash faults – the component stops operating completely.Omission faults – the component refuses to perform its service.Timing faults – the component does not complete its service on time.Byzantine faults – the component fails to perform its service due to an arbitrary nature.

12 3 Theoretical Background

3.1.3 General Fault Tolerance Procedure

There are different ways in how a system can deal with a fault:Fault prevention – The use of good engineering methods and best practices in industry helps

to prevent the presence of any potential fault in the system.Fault removal – Occurs when the system is verified to provide the right result according to the

requirements. If not, the fault is diagnosed and correct. This discovery done staticallyduring analysis or dynamic when the system is being executing.

Fault Tolerance – As mentioned above, in some cases the presence of a fault is not an indi-cation of a bad execution as long as it is found during certain time limits.

In addition, fault treatment procedures can be grouped into four different activities [15]:Error detection – Identify the root of the failure (i.e. the fault).Damage confinement – Isolate the failed component from propagating the error.Error recovery – Restore the same to a valid state.Fault treatment – Analyze and verify the fault that caused the error.

Fault treatment procedure is done in this manner because it is considered first priority to restorethe system from the state it was prior to the failure. Although it seems as a reverse method,in practice diagnosing the failure is a lengthy and complex process (especially considering thatthe error can be caused by multiple roots) and therefore left at the end once the safety of thesystem has been guaranteed [16].

3.1.4 Redundancy

Most of the errors in a system are treated by redundancy where the failed component must bereplaced by a non-failed copy to mask the failure from the end user. The rapidness of the copyto supersede is divided into the three following categories [15]:Cold standby – Include a non operational component that remains inactive until is needed.

Although it is a cheap method, it introduces a delay to startup the system called recoverytime. In the case of a large data base to be created from zero, it can take a very long timeand therefore be very expensive.

Warm standby – Here there are check points created at certain interval of times where theactive data is saved. Then, if the main active component crashes, the copy makes thebackward recovery an start from the last check point. Although it is more effective thanthe previous category for making the recovery process shorter, it adds some overhead tothe system when doing these checkpoints.

Hot standby – The replica is fully active duplicating all information obtained by the primaryone. That makes the recovery time minimum close to instantaneous. There are no check-points because the backup process is continuously working and the overhead added tothe system is evidently higher than any of the previous methods.

Recalling from the introduction of this section, choosing the method depends a lot on the levelof dependability that the system requires and, most importantly, how much the client is willingto pay. For instances, a banking system with thousand of transactions per day can not afford tohave any failures in the system because this “down time” can be translated in big amounts ofmoney lost. For our purposes, where a Instant Messaging application is used, a warm standbycopy is more than enough because the data guarantee is not high (i.e. it is not so critical ifsome messages are not received) but it is important for the system to be restarted in a rapidmanner.

3.1 Fault-Tolerance 13

3.1.5 Dependability

Dependability in a system is defined by the characteristic of performing the service for which ithas been designed. It can be decomposed into four aspects [15]:

Reliability – The probability for a system to work correctly.

Availability – The probability for a system to be up and running at any point in time.

Safety – The ability to avoid catastrophic failures that involve human life or excessive costs

Security – The ability of a system to prevent unauthorized access.

Commonly reliability is confused with availability, and although they are related, they are twodifferent concepts.

Figure 2: Example of a High reliable system

Reliability is referred to as a measureof the continuous delivery of servicein the absence of a failure, definedas the “Mean time between failures”(MTBF). For example, for a spaceshipshuttle, it is extremely important that itis completely failure free from the timeit ignites until it reaches its destina-tion. This is not just very expensivebut also very challenging and normallythese high reliable systems are found inlife-support programs such as avionics,military and aerospace fields where the presence of a failure can turn into human lifethreats.

Figure 3: Example of a high available system

On the other hand, availability is de-fined as the probability of a service tobe running at any giving instance. It al-lows system failure with the presump-tion that the recovery time will be min-imal. An example can be seen in awebsite: the end user does not care ifthere has been failures in the site beingvisited. What only matters is that thissite is available whenever the end userwants to browse it.

Evidently, availability is a characteristic strongly dependent on the time it takes for a system torestore in presence of a failure (“Mean Time To Restore,” or MTTR):

%Availability =MTBF

(MTTR+MTBF )

For our purposes, high availability is the characteristic that is being studied and this formula willaddressed in the following sections.


3.2 Analysis of Fault Tolerance Systems in Internet Services

Nowadays, Internet Services are expected to run 24/7 and, as a matter of fact, it is taken forgranted that these services will be available every time we access them; we do not considerthe local time nor day to think if it is up and running, we just type the URL and hope to see thewebsite. Thus, considering that this field has been treated for a while, it is important to find anycontributions made by others, especially by the big companies.

In [23] there is a study about the common faults found in two big Internet Companies (CNET.com and eWeek.com to be precise). The purpose of this investigation was to find in their reportsany information about causes of the failures on their websites. These are the categories offailure roots listed:

Software failures: Mainly due to system complexity, inadequate testing and/or poor under-standing of system dependencies

Operator Error: Classified into configuration errors, procedural errors and miscellaneous ac-cidents

Hardware and Environmental failures: Due by several reasons such as wear and tear of me-chanical parts, loose wiring, etc.

Security violations: Common security violations such as password disclosures, denial of ser-vice attack, worm and viruses, authentication failures, etc.

These failures are presented to the end users as: Partial or entire site unavailability e.g., 404file not found error; System exceptions and access violations, when a executing process oftenterminates abruptly when a system exception is thrown; Incorrect results, when an executingprocess does not terminate but returns erroneous results; Data loss or corruption, when usersare unable to access data from previously functioning computer system; and Performance slow-downs.

In Figure 4 it can be observed that most of the failures are due to human errors and applicationsoftware failures while hardware errors account for a smaller portion.

Figure 4: Causes of failures(Source: [23])

In addition to this article, [22] also supports these results by showing that, in the majority ofthe cases, the presented failures are caused by humans. In this case, the authors studiedthree types of Internet services: online services/Internet portal (Online), a bleeding-edge globalcontent hosting services (Content), and a mature read-mostly Internet services (ReadMostly ).For all of them, the architecture is composed by a load-balancer, a stateless front-end and aback-end to persist the data (See Figure 5).

3.2 Analysis of Fault Tolerance Systems in Internet Services 15

Figure 5: Architecture of an Internet Service

The failures were studied individually with special attention drawn to the causes and location.The former was categorized as node hardware, network hardware, node software, networksoftware while the latter in front-end node, back-end node, network or unknown.

Figures 6 and 7 show the registered failures for both site online and content. As it can beobserved, not every failure that occurs in the system are visible to the end users and can becovered in a certain way. Therefore, these are named as component failures and servicesfailures for those that are noticeable to the public. In these figures it can be observed thatin both scenarios, Online and Content, the errors produced by operators are the hardest tomask. Meanwhile, figure 7 shows that failures in node hardware are quite many but just avery small portion will turn into service failure. This explains how the mitigation procedures likeredundancy (allowing a backup component to run while the main is down) is working very wellon failures found in hardware.

Figure 6: Number of component failures and resulting service failures for Content (Source:[22])

Finally, Table 1 lists the time to repair for errors presented in Component and shown by type ofcause where component is referred as node or network and cause as operator error, hardware,software, unknown, or environment. Once again, operator errors account the major portion ofthe time in three of the presented scenarios.

From these studies it can be concluded that operator errors is the leading cause for failures.This suggest also that is not entirely the platform nor technology the source of the unavailabilityof the system. Thus, to enhance this characteristic, as proposed in [22], the designer shouldfocus on creating better tools for performing Online testing, monitoring component failures and


Figure 7: Number of component failures and resulting service failures for Online (Source: [22])

sanity checking of configuration files. As cited from [22] “Today, this coordination is handledalmost entirely manually, via telephone calls to contacts at the various points”. Evidently, theefforts for creating a better system should be focused on automating these tasks rather thanthe system itself.

Operator Node Operator Net H/W Node H/W Net S/W Node S/W Net Unknown Node Unknown NetOnline 8.3(16) 29(1) 2.5 (5) 0.5(1) 4.0 (9) 0.8 (1) 2.0 (1) N/A (0)

Content 1.2 (8) N/A (0) N/A (0) N/A (0) 0.2 (4) N/A (0) N/A (0) 1.2 (2)ReadMostly 0.2(1) 0.1(3) N/A (0) 6.0 (2) N/A (0) 1.0 (4) N/A (0) 0.1 (6)

Table 1: Average Time To Repair (TTR) for failures by component and type of cause, in hours.(Source: [22])

3.3 Fault Model Enforcement

The article [21] presents a contemporary model for creating fault tolerant systems. The authorsargue that creating a high reliable system, where every single fault is prevented, for an Internetservice is very hard and even in some cases impossible to do this due to the complexity ofthe system. It is easy to see why this statement is true. A Web Application is composed ofmany components which are also composed by many other subcomponents. Just to give anexample, when considering the transmission of a packet using TCP, there are many reasonswhy it can fail: poor wiring, problem in the network interface, a delay in the transmission, etc.Trying to figure out the reason of the problem and trying to prevent it from happening is not justvery time consuming but also very exhausting if every detail has to be covered.

For this reason, the authors propose a new methodology of creating a fault tolerant systemcalled Fault Model Enforcement. They mentioned two strategies: first, that the failure of anysub component produces the failure of the whole component (i.e. the symptoms), and secondly,after a given symptom is observed, the expected fault behavior is forced to happen (hence theword enforcement). In other words, every fault is mapped to every failure in order for the systemto be designed in terms of recovery actions. This strategy can be applicable almost in everycomponent in the model and makes the planning of the architecture much easier.

This model can be applied to our purposes because , when creating a high available system, asmentioned in the previous sections, there are two variables that can be played with: either en-large the mean time between failures (MTBF) or reduce mean time to repair (MTTR). Evidently,

3.4 Patterns for Fault Tolerant Systems 17

the choice here has been made.

%Availability =MTBF

(MTTR+MTBF )

To illustrate with an example, here is presented how faults are listed along with the mitigationprocedures. This was taken from the testing performed in [21] where their objective was toimprove the availability of a Web Application called PRESS:

• Link down: Reboots the node that was cut off from the main cluster.

• Switch down: Reboots all nodes.

• SCSI timeout: Reboots node with faulty disk.

• Node crash: Nothing. This fault was included in the abstract model.

• Node freeze: Reboots fault node.

• . . . etc.

Although this model seems very simplistic, the results shown look quite positive; the perfor-mance of the system was improved over 50% compared in the normal run (without using FaultModel Enforcement) in all of the tests realized in [21]. Also, the system became more robustand stable against transient errors.

Additionally, bringing simplicity in the design of the software is a big advantage not just forthe implementation but also for testing, especially if it is considered that as shown before,complexity is one the highest reasons of visible failures. Finally, the requirements for this systemare not so high in terms of availability since the presence of failures is accepted as long as theavailability as a whole remains high (by reducing recovery time).

3.4 Patterns for Fault Tolerant Systems

The design of this project is based on design patterns for fault tolerant software, found in [16].This decision is made due to the convenience that comes along. Firstly, because patterns solvethe problem in small pieces rather than trying to do it everything just at once where the solutioncould hardly fit in. Secondly, because it is always recommendable to follow the practices donein industry simply because they have been tested in numerous times.

As mentioned by Robert Hamner in [16], “Software patterns are an effective way to captureproven design information and to communicate this information to the reader”. What he refersto here is that problems that we normally face are hardly unique neither have appeared for thefirst time to us. Under different circumstances and scenarios, in the essence it remains intactand what is left for the designer is to apply the solution under the particular context given.

In other words, what it is being presented is nothing else but the techniques that engineersnormally use when building a system with the characteristics discussed here. Each of themis applied in different circumstances and needs, depending of what the objectives are. Thisdecision is made according to the stage of the error stage (detection, recovery, mitigation andfault treatment) and the particular requirements of the project. Using this method is convenientin the sense of time saved by “not-reinventing-the-wheel” and also give the security of knowingthat the best practices made by other are being applied here as well.

For example; a type of question that an engineer normally deals with is: Does the systemrequire to be running as much as possible? or Does the system require to have a certain failure


rate where only 1 out of 100 000 transactions can fail? Then it will depend on the answer ifthe system is made to recover very fast or very robust to prevent any failure from happening.Another one is: Is the server stateless or stateful?. Say, is the amount of information (the valueof the variables used during the execution of the program) kept in the server while its runningor is it deleted right after a request has been dispatched as it normally occurs in a web server?It will depend in this answer if it is decided to make a checkpoint or not.

Another example is found in Figure 8 which shows a diagram of different ways an error canbe isolated and prevented from being spread: it may be stopped even before it enters to thesystem (Complete Parameter Checking), it may be detected at system level (System Monitor ),it may be decided if it exists by checking with other servers (Voting), it may be temporary andjust ignored (Riding Over Tran sients, it may be checked during the execution as a backgroundtasks (Routine Audits), etc.

Figure 8: Design patterns for error detection (Source: [16])

This is just a very brief summary and more information can be found in [16]. It is in the follow-ing sections, Implementation, where the reader will find the chosen patterns that fulfilled ourdemands. As any other research project, part of the task was review in detail each of them,chose the ones that are more appropriate to a specific problem according to the fault tolerancephase and combine them all to make them work to their best.

19

4 Methodology

Working with Node and JavaScript definitely deserves special attention. For most of the pro-grammers, it is a technology that brings new paradigms in the way people code and evensometimes may not be well understood, as it is mentioned by Douglas Crockford, a very wellknown software engineer for his contributions in this programming language. That is why inthis section a deep insight is taken to the tools used for the development of this project, thechallenges these brought and how it was overcome to take the best of them.

4.1 Node.JS

Node.JS is an event-based non-blocking I/O framework for creating scalable network programsthat has caught the attention of many developers and companies for its high performance andefficiency at handling thousands of concurrent connections [1]. It is influenced by other systemslike Rubys Event Machine, or Python Twisted, interpreted by Google V8 JavaScript engine andran on the server side.

In contrast to other technologies like Apache that scale by spawning threads, Node does it in adifferent way by firing up an event for every request needed using one single process. In figure9 it can be observed how nginx, another event-based technology, is outperforming Apache, athreaded-based server, when more connections are being summed up. While the former isstabilizing after 1500 connections, the second one is considerably dropping.

Figure 9: Benchmark test between Apache vs Nginx. (Source: [19])

Moreover, the memory consumption difference is even more impressive when these two ap-proaches are compared with each other. In Figure 10, it can be observed clearly how the num-ber of concurrent connections does not affect Nginx at all; it’s always using the same amountof memory (2.5 MB)

Figure 10: Memory consumption test between Apache vs Nginx. (Source: [19])

Here there is an example of lightweight HTTP server written in Node for serving files from disk.

20 4 Methodology

var sys = require(sys),

http = require(http),

url = require(url),

path = require(path),

fs = require(fs);

http.createServer(function(request, response) {

var uri = url.parse(request.url).pathname;

var filename = path.join(process.cwd(), uri);

path.exists(filename, function(exists) {

if(exists) {

fs.readFile(filename, function(err, data) {

response.writeHead(200);

response.end(data);

});

} else {

response.writeHead(404);

response.end();

}

});

}).listen(8080);

From this example it can be observed two important aspects for which Node has acquired a lotof attention. Firstly, the application is not blocking for any I/O operation such as opening a TCPsocket nor opening or reading a file as it can be seen there; and secondly, due of course to thesyntax of JavaScript, it becomes quite easy to understand a simple but yet high-performanceapplication.

4.2 Event-driven programming

Concurrent programming is a topic that has been studied for a long time, especially nowadayswhen any computer has more than one processor. Commonly, muti-threading is the paradigmused to achieve these type of tasks. Nevertheless, as it is mentioned in [1], for many developersmulti-threading is “anything but easy”; there are still many other issues like liveness or deadlockthat have to be dealt with.

Instead, event-driven programming offers a more efficient alternative that allows much morecontrol over switching between application activities. The possible drawback that comes withit is that asynchronous calls are very strict in the sense that it depends in the context, i.e. thevalue of the variables, where they are executed. For a novice developer, this concept takes timeto be learned and, if it is not treated carefully, the code can easily turn in an unmaintainable“spaghetti code” because it is hard to understand. See the piece of code below:

async1(function(result1) {



// do something with results

});

});

})

4.3 Node and the ecosystem 21

Nevertheless, additional aid can be obtained from frameworks to handle asynchronous flowcontrol like “Step” in order to improve the readability of your program. For example, in thecode shown below it can be seen how the asynchronous calls can be arranged in a moreunderstandable way.

Step(

function loadUser() {

db.getUser(user_id, this);

},

function findItems(err, user) {

if (err) throw err;

var sql = "SELECT * FROM store WHERE type=?";

db.query(sql, user.favoriteType, this);

},

function done(err, items) {

if (err) throw err;

// Do something with items

}

);

“Step” is a main function that receives as parameters the I/O calls that define the control flow ofthe program and it makes sure that they are executed one after other. There are other featuressuch as executing tasks in parallel and grouping that follow the same sugar syntax and aid tomake this asynchronous code very easy to read and understand. As Node, this is another openproject that can be found at https://github.com/creationix/step.

4.3 Node and the ecosystem

Speaking about Node without mentioning the growing community of developers supporting itwould be very unfair. That is because many of the libraries has been created around this frame-work in order to be able to interact with other services like relational databases, for example,node-mysql or many frameworks for web development like Express. According to [12] thereare 1600 modules and the list keeps on growing. The most popular can be found in the Wikisite of Node at http://github.com/joyent/node/wiki/modules and they can be easily installed byusing Node Package Manager, npm. As a matter of fact, some of these modules were used forthe development of the prototype and framework that will be presented in the following sections.

Module’s name Creator DescriptionForever Charlie Robbins A simple CLI tool for ensuring that a given script runs

continuously (i.e. forever)Cradle Alexis Sellier A high-level CouchDB client for Node.jsSocket.IO LearnBoost Node.JS project that makes WebSockets and real-

time possible in all browsersStep Tim Caswell An async control-flow library that makes stepping

through logic easyExpress visionmedia Sinatra inspired web development framework for

node.js – insanely fast and flexibleNodeunit Caolan McMahon Easy unit testing in node.js and the browser, based

on the assert module.js-yaml visionmedia CommonJS YAML Parser – fast, elegant and tiny

yaml parser for javascript

22 4 Methodology

Just like Node, these modules can be found in the social code repository GitHub (http://www.github.com)or in the website of npm (http://www.npmjs.org) where so far there are 3000 registered and keepon growing.

4.4 Database Research

To persist the data generated by the prototype application, a review of the different “NoSQL”databases was done with horizontal scalability as the main requirement. The idea behind usingnon-relational types was encouraged by the company, in order to look for other alternativesdifferent from MySQL which is the one currently used.After some research on the Internet, the options were narrow to MongoDB, Riak, Cassandraand CouchDB because the integration with JavaScript among other features.MongoDB was disregarded for not being “truly” scalable. This is because it uses master-slavearchitecture [2], where the data from the “master” is replicated to different “slaves”. Thus, thestorage capacity can not be increased by adding more machines. Moreover, this approach ofhaving different machines with the same data might be suitable for reading-intense applicationsbut not for writing-intense (such as logging) which is our case.Riak seemed like the adequate solution since it is partly written in JavaScript and it is fault-tolerant; it can be replicated in master-less mode and sharding is done automatically. However,those features come in the “enterprise” version and not free as other similar services [4].When Cassandra was studied, it was found very useful because its following features: Horizon-tally scalable, read and write throughput both increase linearly as new machines are added,with no downtime or interruption to applications; Decentralized, every node in the cluster isidentical. Fault Tolerant, data is automatically replicated in every nodes available and failednodes can be replaced without any interruption in running application. Although it seemed likethe best fit for the prototype needs, the number of client libraries written in Node are very lim-ited. So far there is a just a Thrift protocol implementation in Node.JS (called node-thrift) and,even worse, it is not also maintained regularly. Consequently, it was decided to look for anotherDatabase that could have more popular and supported client programs.CouchDB was reviewed. Just like the previous engines, it is a document-based database sys-tem with HTTP/REST as protocol. It is highly distributed with consistency as it is written inErlang. Morevover, CouchDB is fault tolerant database system and in case of any failures,it happens in a controlled environment which ensures its availability. And finally, it uses viewfunctions to do computation on documents and used for reading and querying for data.At the end, CouchDB was found suitable for our Instant Messaging prototype in terms of sim-plicity and also because there are many client libraries written in NodeJS. Cradle, the clientlibrary designed by Alexis Sellier, is being used. It was chosen mainly because its high rateand the constant maintenance received in its public repository in GitHub site.

4.5 JavaScript

The JavaScript programming language is referred by Douglas Crowford as “the most misun-derstood language” in his book [8] because the bad reputation it has among the developercommunity. It is not hard to see why since JavaScript is a language of controversies, it hasmany powerful features along with many weakness in its design; it is class-free but functionscan act as constructors; it does not have classical inheritance but it does has prototype inheri-tance.Yet, it has succeed where Java has failed and become one of the most popular languages [5]having at least one interpreter in every browser. The reason why is because, as previouslysaid, it has so much richness that can be used in a very convenient way. Knowing them willprovide many benefits during the development of any project.

4.5 JavaScript 23

4.5.1 Must-know features

Giving an introduction or tutorial on this language it is out of the scope of this document. In-stead, a list of the most important features that any developer to be proficient in JavaScript isshown:

• Deep knowledge in closures functions

• Deep knowledge in prototype inheritance

• Knowledge in callbacks function and apply and call

• Clear understanding of how “this” variable works

• Understanding of timers and asynchronous execution

• Understanding on Object type and the use of instanceof and typeof

• Understanding the JSON notation

• Be up to date with the new changes and improvement in ECMAScript 5

Being familiar with those will help a lot to understand how this programming language worksand prevent confusions when trying to use it in a way that is not suppose to be. For example, itis very easy to be confused with the use of the variable “this” because it is also present in otherlanguages such as Java, but the way it work in each of them is very different. In JavaScript itcan be used in different contexts, opposed to Java where it can only appears within an object.

4.5.2 The bad parts

In this section, a brief description about some of the design flaws of JavaScript will be exhibited,in particular those that affected the project. Not with the intention of diminishing the languagebut to emphasizes that there are workarounds despite these errors.

When working with JavaScript, it is very important to be aware of these errors and especiallyhow to avoid them. Many of these were present during the development of this project. Theynever seemed to be an obstacle but they did change the way we normally program.

Classical Inheritance

A common feature that seems to be missed for some novice developers is the lack of classicalinheritance, be it is commonly found in many modern programming languages. Nowadays,many but not all programs are designed under the object oriented paradigm and this project wasnot an exception. As it will be shown in the following section, Implementation, both the InstantMessaging prototype and framework were planned according to this methodology before beingwell informed about this matter.

Fortunately, opposite to the initial concerns, JavaScript does not lack of inheritance at all. In-deed, it is present but in a different mechanism; in this environment objects can inherit attributesat runtime or expressed in other words, dynamically. This means that, even when an instancehas been created with certain properties, those can be edited, deleted or even added duringthe execution of the program. For this reason, it is possible to create an object, called child,which can inherit any attributes from another one, called parent, while the program is running.

The special libraries to implement this functionality in this project were done by Douglas Crock-ford in [7] where the source code and more details can be found.

24 4 Methodology

Equality and comparisons

One of the most popular error in JavaScript is the comparison operators. Because there isa feature called weak typing where the interpreter forces the variable values of the operandsbefore comparing them. This can lead to unexpected results. For example, this operation “rn” == 0 or or 0 == ”” will both return true. In addition, this also will impact the performance dueto the extra work the interpreter has to do when it is changing the values.

Therefore, it is highly recommended to use strict equality operator, represented by three equalsigns (“===”). By doing so, the interpreter will return false if the type of the operands is not thesame.

Using this variable

The variable this is not a design flaw at all, it just works in a different way compared to otherlanguages. At the beginning of the project, this concept was not completely understood andcaused many of the mistakes that eventually were corrected. Thus, a brief explanation is givenhere.

Opposite to other languages, this can be used not only within the method of an object but inother scenarios. [18], presents the five different ways it can be used:

• In global scope, this is bound to this context.

• In a function, this still is bound to the global scope

• When calling a method, this is bound to the object .

• When calling a constructor, this refers to the newly created object.

• When calling the call or apply methods, the value of this inside the called function refersto the first argument passed.

Below, there is an example of a common mistake shown in the left column. When the functioncallback is executed, this refers to the global scope and not to the object. In the right columnwe show the workaround frequently used in Wussap. Thanks to closures in JavaScript, with thevariable self it is possible to gain access to the attributes of the object Surftrain.

Incorrect use of this Corrected

var a = 1;

var Surftrain = {

a: 1,

join: function() {

function callback() {

this.a = 2;

};

callback();

}

};

Surftrain.join();

a == 2; //true

Surftrain.a == 1; // true

var a = 1;

var Surftrain = {

a: 1,

join: function() {

var self = this;

function callback() {

self.a = 2;

};

callback();

}

};

Surftrain.join();

a == 1; //true

Surftrain.a == 2; // true

4.5 JavaScript 25

Miscalculations with Floating point

Another problem that comes when working with JavaScript is the use of floating point. Thetypical example is:

0.1 + 0.2 != 0.3 //true

The true value of this math operation is 0.30000000000000004. This is not entirely a problemof the language because it is following the standards of the IEEE specifications for this type ofnumbers, which is different from what is taught in school. Therefore, it is very important to takeinto consideration the precision of these arithmetic operations [8].

Auto semicolon insertion

JavaScript is a language that does not necessarily need semicolons to divide every line inthe code because it contains a feature called auto semicolon insertion that does it for theprogrammer. However, it does not do it in the right way all the time. This is another exampleshown in Crockford’s book,

Works well in Javascript Silent Error!

return {

ok: true;

};

return

{

ok: true

}

Although the two pieces of code shown above look very similar, they work very different wherethe code in the right return an object and the left one return undefined. What is happening hereis that auto semicolon insertion is transforming the code in the following way;

return; // semicolon inserted

{

ok: false; // semicolon inserted

}; // semicolon inserted

Even though it should be at least a warning because of the piece of unreachable code thatfollows the return statements, JavaScript does not care about it, just ignores it. Therefore, it ishighly recommended not to rely on this feature and include semicolons where they are intendedto be.

Strategies for fault detection and correction

As previously mentioned, design patterns are used to overcome these threats. They are di-vided according to the phases of the life cycle of a fault: detection, recovery, mitigation andtreatment. Additionally, it is also considered another type of pattern that does not fit in any ofthese categories; it is called architecture because it influences the design of the whole systemand represents the ones that are already used by highly available systems.

In order to use them, it is important to be familiar with them and remember when making thedesign of the system. In the book Patterns for Fault Tolerant Software it can be found many(probably all types of patterns) and of course not all of the apply to the needs of these prototype,so just a few were selected according to the found threats.

This table list the patterns that will be used not only in the prototype but also in the frameworkthat will help the system to be protected:

26 4 Methodology

Architecture Detection Recovery Treatment

RedundancyRecovery blockMinimize human in-

terventionMaintenance Inter-

faceFault Observer

Complete parameterchecking

Riding over tran-sients

System MonitorHeartbeatAcknowledgementWatchdog

Fail-overReturn to reference

pointCheckpointsWhat to saveRemote StorageError Handler

Root cause analysisReproducible ErrorSoftware UpdateReintegration

Fault detector and error handler: Parameter checking is done at the monitored application.Exceptions are caught and tracked depending on the severity of the fault.

System monitor, Heartbeat and Watchdog: The application framework monitors Wussap atsystem, request and application level keeping track for any present fault.

Redundancy, Recovery block and Fail-over: An active copy (a.k.a. hot copy) in a remoteserver that takes over when the application has crashed.

Riding over transients: Although all faults are tracked, not all of them are corrected immedi-ately if they are considered temporary and the damage severity is low (i.e. request overload).

Minimize human intervention and Maintenance interface: The application framework takesautonomous decision to perform actions and notify the user about the status of the system.

Remote Storage: Both persistent (i.e. Database) and memory data is spread among the activeserver and the hot copy to reduce the recovery time.

4.6 Assumptions and Limitations

Finally, this section will conclude on the limitations and scope for this project.

• Although it is not restricted, for the moment just one server is assumed to be listeningfrom the cluster used.

• The current implementation of the prototype does not consider any cache mechanism.

• Any matter related to improving application performance has been left out of the project.

• Any approach to make Node scalable was not reviewed either.

• Although scalability is a feature that has been kept in mind during the development ofthe project, it was not explored further to concentrate due to the high performance ofNode. Yet, the tools for inter-server communication will be implemented and the logic tocoordinate those are left for future improvements.

• Connectivity problems between several servers are not considered and, thus, assumedto be working without any problem.

27

5 Implementation

5.1 Wussap Prototype

5.1.1 System Design

Now the prototype of the simpler version of Wussap will be presented which is nothing but anInstant Messaging as it was described in previous section. Figure 11 shows the system designwith the main components of this program.

Figure 11: Components of client and server

According to functionalities in server side, the services offered by server can be divided intothree categories:User ManagementAuthentication Manager: Handles “Register new user” requests: receives username and

password and creates a new account. If there exists the same username, the error noticewill be returned. Handle ” Login” request: receive username and password sent by userand do the authentication in the database; If they are valid, authentication manager wouldrequest session manager to create a new session for user; After session created, authen-tication manager will return the existing surf trains’ list, the detailed info of chat place andsurf train that user had subscribed (forwarded by session manager).

Session manager: Receive the request from authentication manager and return the existingsurf trains’ list, the detailed info of chat place and surf train that user had subscribed to.

Publish/Subscribe (PubSub) ManagementSurftrain Manager: Handles the request of “create/stop surftrain”, “join/leave surftrain”, “surf-

train go to new station”, “chat in the surftrain” from users; Handle the request of “all theexisted surftrains list” from both session manager and users. In addition, it also makessure that privileged functions such as stopping or leading a surftrain are done by thecreator and no one else.

Chat Manager: Handles the request of “create/join/leave a new chat place” from both SurftrainManager and Users. Handle request of “chat in chat place” and broadcast the chat con-tent to all subscriber of that chat place. Handle request of “chat log and participants ofchat place” from session manager.

I/O ManagementDatabase Manager: Handles the connection establishment, connection close, read/write to

the database from the application.

28 5 Implementation

5.1.2 Publish-Subscriber Model

The model for Abstract Publisher/Subscriber is described by diagram shown in Figure 12:

Figure 12: The Pub/Sub model

The controller will contain a list of topic and each topic will contain a description test and a listof subscribers. How they will interact is described in these events:

• On Subscription: the controller adds a new subscriber to the respective topic or createsif it does not exist yet

• Unsubscribe: remove a user from the subscriber list for a specific topic

• On Publish: whenever the controller receives a message, it will search for the targetedtopic, verify that the sender is also a subscriber (see below) and finally broadcast to therest of these users.

In comparison to other pub/sub approaches, there is one restriction in our implementationwhere only subscribers are allowed to publish and no one else.

For the case of the Surftrain Manager, the composition of a topic (which in fact is an analogy ofa Surftrain) are the same except for some additional features called leader and “currentStation”.

Figure 13: Topic described for Surftrains

Evidently, another event must be included too:

• Go to station: will update the value of “current station” and broadcast this change to allsubscribers

• Stop Surftrain: will send an end notification of the surftrain and delete this topic from thecontroller list

5.2 Fault-Tolerant Framework 29

5.2 Fault-Tolerant Framework

Once the prototype has been constructed, it is time to build the framework that externally willbe monitoring the residence application (Wussap application in this case). It was decided to becoded standalone and apart of the main application for the following reasons: first, since thestability of Node.JS is still in question, it is important to divide some of the components in twoseparate processes to increase robustness and prevent the system from crashing if an erroroccurs; and second, to make it as independent as possible so it can be used for other projectsand be improved unobtrusively.

5.2.1 System Design

The architecture of the framework is described by the figure 14 and it is influenced by themethod called encapsulated cluster found in [11]. It is called encapsulated because the frontend is not exposed to the public and rather kept in a private network while a router with anassigned domain name is receiving all requests coming from the Internet. To reduce the possi-bility of having a single point of failure, the router is placed along with a backup that will replacethe active in case of being crashed.

Figure 14: Diagram of Fault Tolerant Systems

The advantages that this model presents, compared to another method such as Round RobinDNS [11], are a better, fine grained load balancing and it removes the problem of clients cachingthe IP address of a possible down server from the front end if it were public. No matter whichone is down, the router will take care of redirecting the incoming request to the active server.

30 5 Implementation

Possible issues are that the router could became a bottleneck and impact the the performance.Also, having the router as the only entry from the outside world turns into a potential singlepoint of failure and, even though there is a backup ready to replace, it is completely immunebecause a catastrophe like an power outage in the area that could affect both instances.

Due to the scope of this research, load balancing is not addressed in detail in this topic. How-ever, considering that in any Web Application it is essential to make the system scalable, thearchitecture was chosen so it will not became an obstacle in further improvements, especiallywhen more servers have to be added and two of them have to be active to handle the incomingload.

Next, let’s make a close up and go in detail to the internal design of the framework. Figure 14shows the components in game: application monitor, system monitor, leader elector, mainte-nance interface and global fault manager. The latter is the component that orchestrates anddecides according to the status of the system while the function of other components is to mon-itor the status of the context and report. Details of the connection of the servers, CPU Usageand memory consumption threshold, and operation mode are passed in a configuration filewritten in YAML format.

Figure 15: Diagram of Fault Tolerant Systems

The application framework starts by running the Global Fault Manager, which firstly initializesthe Maintenance Interface and also the Leader Elector by passing a list of servers. Then, theLeader Elector communicate with the other instances and decide what server will be running asactive. This method is deterministic and it is based on an algorithm found in [14]. In summaryit works in the following way: all the servers exchange with the others in heartbeats messagesID and the number of times restarted, also known as epoch. Once all responses have beenreceived, they choose by selecting the one with the lowest epoch or the lowest server ID.

Once the selection is done, if the current server in use is chosen as the leader, Global FaultManager initializes the rest of the components. It is the Application Monitor which runs thetarget application and restarted in case it crashes whereas the system monitor is constantlychecking that the CPU and memory usage do not over pass certain limit.

All these events are notified by the Global Fault Manager asynchronously via Events using themodule called by this name that comes bundled with Node.JS. The rest of the componentsare mainly inspired by different open source projects found in GitHub (http://github.com) andthe Node Packager (http://npmjs.org/). Here is where it can be seen the benefits obtainedfrom working with a technology that is supported by such a big and growing community ofdevelopers. Thanks to their contribution, the development of this framework was much easier.

5.2 Fault-Tolerant Framework 31

5.2.2 Description of Components

These are the tasks for each of the components shown above (Figure 14):

System Monitor

• Check the memory consumption of the system

• Check the CPU usage

• Inform to a Global Fault Manager when a critical point has been reached

Application Monitor

• Checks if the application process is alive and not idle.

• Check Error prompt of the application

• Restart the application if it’s idle/crashed

• Keep a track of the exceptions thrown by the application actions performed in the appli-cation.

Leader Elector

• Constantly sends Heartbeats to other servers

• Check whenever a fault server is down

• Notify to Global Fault Manager if there are non backup servers

• Notify to Global Fault Manager if the current server becomes the leader

Global Fault Manager

• Receive notifications from other components about any possible fault found in the system.

• Take further actions whenever a fault is presented.

• Keep a count of number of time the application has been restarted

• Log all events received from other components

Maintenance Interface

• Present all events occurred in the system

• Present the status of the server in used

• Present the status of the other backup servers

• Displays all the errors prompt by the applications and the number of times it has beenrestarted

Configuration Details The configuration file will contain the following information:

• Command/File to execute the application

• Connexion details of all other servers

• CPU usage threshold

• Memory consumption threshold

• Port to listen for maintenance interface (i.e. 8080)

• Operation mode (i.e. debug)

32

6 Testings

In this section the performed tests will be presented. The main idea behind is nothing else butjust to test the functionalities developed in the prototype and framework. Additionally, somestress tests were also introduced with the intension of finding out some potential areas forenhancement. For all of these, an AMD Turion 64 Dual core computer was used with 2.9 GB ofmemory running Ubuntu version 10.10 (Maverick).

6.1 Prototype Functionality Tests

Test-Driven Development was the methodology used during the course of this project. Indeed,the initial plan of developing a client with a graphical user interface was dismissed for beingconsidered lack of scientific stimulus. Instead, a series of test cases were coded where thebehavior of the system is characterized.

Test-Driven Development is also very helpful for maintainability because it avoids doing repet-itive tasks of manually checking the outputs and verify that the correct system behavior if newchanges are applied or in case a fault is accidentally introduced due to a bad design.

Fortunately, there are tools in Node to carry on this task. Starting with the “assert modules” thatcomes bundled up with it. Nevertheless, using solely the assertion functions will be very com-plicated and time consuming. Therefore, nodeunit module (again, an open project designed byCaolan McMahon and published in the social repository GitHub) was also utilized to simplifythis task.

And even better, in [20] you can find a great tutorial of how to set everything up. Once again, itcan be reaffirmed how useful it becomes when working with a technology that it is supportedwith such a big community of developers.

The following the tests that were coded to develop the Instant Messaging prototype and theseare deployed from a browser (Google Chrome) as it is indicated on “nodeunit”: User operations

• Valid registration

• Failed registration - user duplicated

• Failed registration - parameters missing

• Valid authentication

• Failed authentication - credentials mismatch

Surftrain operations

• Valid registration

• Failed registration - invalid parameters

• Valid subscription

• Failed subscription - invalid ID provided

• Valid list retrieval

• Valid operation ‘send a message’

• Failed operation ‘send a message’ - invalid ID provided

6.1 Prototype Functionality Tests 33

• Valid operation ‘go to station’

• Failed operation ‘go to station’ - invalid ID provided

• Valid operation ‘stop surftrain’

• Failed operation ‘stop surftrain’ - invalid ID provided

• Valid unsubscription

• Failed unsubscription - invalid ID provided

Web Chat operations

• Valid subscription

• Failed subscription - invalid URL provided

• Valid operation ‘send a message’

• Failed operation ‘send a message’ - invalid URL provided

• Valid operation ‘stop surftrain’

• Failed operation ‘stop surftrain’ - invalid URL provided

• Failed subscription - invalid ID provided

As it can be observed, the aim of these test cases is to cover all the possible scenarios anduse cases that were documented prior to the implementation of the prototype.

Also there are other cases when messages are not triggered by user actions. Instead, theserver messages are pushed directly to the client without any noticed, just like any otherpublish-subscription system. These are also known as Notications and are sent when:

• A message has been received in a Website

• A message has been received in a Surftrain

• A new user joins a Surftrain

• A participant (user subscripted to a Surftrain) leaves

• A new user joins a Website

• A participant (user subscripted to a Website) leaves

• The Surftrain has gone to a new station

• The Surftrain in used has concluded

The two last notifications mentioned above are particularly important because they dictate thebehavior of browser and where it has to be redirected when the user is participating in a Surf-train

From these tests it was possible to verify the development of the prototype according to thespecifications provided by the company and confirms the validity of the program.

34 6 Testings

6.2 Framework Functionality Test

For testing the high-available framework, a virtual machine was used to simulate 2 differentservers that will replace each other in case of a disruption in the system using the software“VirtualBox 3.2.8”. This might not be the most convenient situation because the real circum-stances that are present in a private network (i.e. transmission delay, lost of connectivity, etc)are missed in this scenario. But unfortunately it could have been done by using 2 separateservers as it was desired to due to the lack of resources. Yet, at this early point of development,what is pursued is to confirm the behavior of the framework.

The first functionalities of the system monitor were tested to confirm that the following is de-tected:

• The CPU usage of the hosted application reached a limit

• The memory usage reached a limit

In order to do so, a fault was injected in the application that is triggered after certain timeoutand what it simply does is to run a useless piece of code that either will increase the value ofan array or will hang the application in an infinity loop to affect the system according to the twoaspects previously mentioned. As it was expected, the system monitor successfully registeredthese changes and altered the global fault manager in both situations.

Later, the application monitor and leader elector were tested. Again, the aim in the followingtests was to confirm the functionality of the system during this scenario and observe the reactionof the requests from the client perspective in two possible scenarios, when the target applicationis being restarted, called server test and when the server is crashed and failing over clustertest. All of them were done in the same procedure: a client (executed in the browser) sendsmessages to the server viaWebSockets with a delay in between to simulate the frequency(200ms, 100ms, 50 ms for our three cases respectively) . Then, the acknowledgements wererecorded and finally plotted in a graph the distribution of number of those received by can beobserved.

Please be aware that, in this case, the acknowledgement represents that the system is up andrunning (or “alive”) but this does not entirely means that is ready and can recover to the statethe client previously had. This is because currently sessions are kept in memory and the serverloses this data if it crashes and requires the client to authenticate again. Although the obvioussolution would be to persist them, this procedure might affect the performance dramatically.Therefore, it must be considered a better support from the client program to handle this situationand, for the moment, any attempt for enhancements in the server side have been left for furtherimprovements.

Initially, the server test was tried out were the hosted application is only restarted by the frame-work and this scenario was repeated at different frequencies (5, 10 and 20 requests/second).To force the application to be restarted, an exception was inserted and it is triggered by asetTimeout function every 5 seconds after the program starts running.

From the results shown in Figure 16 it can be observed an odd distribution where some highpeaks are displayed in multiple times. This is mainly for two reasons:

• First, since “WebSockets” are being used, which in reality are TCP sockets, the transmis-sion of the packet is repeated several times to guarantee the delivery. That’s why afterthe application is restarted it responds to those message that attempted to arrive while itwas down. As a matter of fact, no message was lost during the tests.

6.2 Framework Functionality Test 35

Figure 16: Server tests at 5, 10 and 20 requests/second

• Since the client was run in a browser, single-process environment, the delay used to sendthe messages was not respected. In other words, if a timeout occurs to send a messagewhile the client is “busy” (sending or processing an arrived one), the action is put in aqueue and executed later. That is why at some times these peaks are higher than others.Although it could be improved by executing in other environments, it was decided not todo so since it is the browser where the client program will have to run.

Now the cluster test will be presented that occurs when one server is unavailable and must bereplaced. This behavior is normally present either when the number for the times the applicationhas been restarted has reached a limit or simply because the application can not be started.Via heartbeat messages the server within the cluster detect this failure and proceed to choosea new leader and start up the application from this new instance.

To execute these tests, the same procedure was used with the difference that only the fre-quency of 20 req/sec will be used for being considered the most critical among the ones usedabove. Figure 17 shows the results obtained. It displays a similar behavior from the previousgraph except for a longer down time registered (a maximum of 4.89 seconds).

Figure 17: Cluster tests with 1 and 0.1 seconds of Heartbeat

36 6 Testings

Of course this result was expected considering that the work involved is also bigger: the serverhas to stop the application from being restarted, shut the leader elector of the framework inuse, be detected by the other members of the cluster, start up the application in the new serverand so on. What affected the most this down time was the timeout employed when sending aheartbeat message (1 second for the graph in the left in figure 17) so it was decided to reduceit and see the results again. Evidently, the time was lower with (a maximum of 2.67 seconds)

Considering that Virtual machines were employed for these tests, the delay used to send theheartbeat is not so representative for a real scenario. As a matter of fact, this value changes inthe specific circumstances of where the operation is taking place, such as the type of network,distances of the server, congestion, etc. Thus, it is impossible to determine a unique value andguess it will be hard and error prone as well. A possible solution (proposed in [14] as eventualleader elector ) is to set a timeout very low and increase it every time a new leader has beenselected. Initially, the premise of having a unique “leader” among the servers is not guaranteeas this variable keeps on growing. But once it does not change (say stabilized), a uniqueleader iseventually determined (hence the name of the algorithm).

6.3 Stress Testing

In addition to the basic functionalities, it was decided to see the behavior of the system duringcritical situations. Therefore, another scenario was conducted when the frequency of the mes-sages exchange is considerably high. In the previous section it was shown how the database ishosted in the same server where the application logic is running, and it is accessed via HTTPrequests done by the client “cradle”.

To do so, a JavaScript program was implemented where 3 clients join a newly created Surftrainand start sending messages at different time intervals. There is no difference among them interms of size and they are sending in a round robin manner to assure that each client is sendingthe same amount.

The following shows the results obtained:

Frequency (request/seconds) Number of messages Sent Number of message Suc-cessfully received

1000 10,000 8500 10,000 8100 10,000 9850 10,000 1355 10,000 31982 2,511 2765

Figure 18: Results from testing message sent in a Surftrain

There is an explanation for this catastrophic scenario shown in Table 18; Due to the controlversion mechanism that comes built-in with the database, by default the database does notallow to update anything that is currently used and it responds with a “Document update conflict”message [6]. This was prompted by the program as well. As a matter of fact, the prototype itselflooked quite fine despite the high frequency. It did not crash neither showed any symptom thatcould indicate that the performance was affected.

To solve this scenario, an extra parameter, batch=ok can be included in the URL query that issent to the database when performing an update operation. As it is described in [6], it works inthe following way: “When a PUT ( . . . ) is sent using this option, it is not immediately written to

6.3 Stress Testing 37

Frequency (request/seconds) Number of messages Sent Number of message Suc-cessfully received

1000 10,000 Application restarted!500 10,000 10,000100 10,000 10,00050 10,000 10,0005 10,000 10,0002 2,511 10,000

Figure 19: Results from testing message sent in a Surftrain

disk. Instead it is stored in memory on a per-user basis for a second’. The evident risk that istaken is that the lower guarantee of the data being persisted if some of them remain in memoryduring a server crash.

The results shown in table 19 were improved where all sent messages were successfully per-sisted in all the frequencies tested (except for the first one). However, two more issues werediscovered: first, the response time the system was reduced considerably and that is becauseall messages are saved as they were arriving so it eventually turned into a bottleneck. Secondly,due to the extra memory used by the database when using the batch parameter, this increasewas registered by the system monitor and the framework restarted the prototype. This lattercase just occurred at the highest frequency.

From these tests it can be outlined that it is necessary to keep the database in a differentenvironment outside of the middleware (also called “business logic”) to handle the load in abetter. Also, since the operation is quite simple (updating a document by adding one message)some support can be added from the application logic where the response can be createdwithout waiting for the acknowledgement of the database.

38

7 Conclusion

Through this document it has been answered the question of how to build a high availablesystem in Node. As it has been shown, reviews from previous studies on other Internet Servicesprove that this capability of running a cluster despite of the failures relies more on the toolsused for monitoring and configuring rather than the software itself. In addition, it is the designpatterns that can aid a system to become more robust and persistent. Therefore, the objectiveof this project was focused to experiment with this new technology and find out if it has what itis required to achieve these expectations.

To do so, a study about the possible weakness and expected faults was made together withthe mitigation procedures that the system should follow, based on the Fault Model Enforcementpresented in [21]. After this, a lightweight replica of the Wussap application was developedwritten in JavaScript along with a framework that provided system monitoring and redundancy.Finally, the prototype was tested by injecting faults and confirming that the expected behaviorwas performed.

As it was presented, building a high available system in Node does not represent an obstacle ifthe right design is chosen and also if JavaScript is handled in a proper way by taking advantageof its best features. Also, it is important to say that, when starting to use Node, the amountof help found in the Internet is impressive because there are so many libraries written andopen projects where someone can benefit from. Definitely the support was substantial for thedevelopment of this project.

Among the issues present in this research project, the main ones were due to the change ofparadigm that comes when working with a programming language that has another model ofinheritance and a powerful feature like clousure that were new to our knowledge. Also, workingon a new platform were all the events are asynchronous requires some effort to move along andget used if one comes from the old teaching where the programs were entirely synchronousand, of course, easier to read. Nevertheless, the benefits in the throughout and performanceshown in previous sections are quite high and, as it was shown how high availability can beachieved, it tell us that definitely worth the time invested.

An improvement area that should be taken care is the monitoring tools and interface for au-tomated tasks to avoid possible failures that could be done by operator. These could only bedeveloped as the project goes on and the specific needs of the system are revealed. The ideais to developed these tools with the objective in mind of minimizing the intervention of the hu-man operator in the system. Also, considering that Node is still at an early stage, more testingin regards of stability is required. Statistics of CPU and memory usage, reasons of failuresand load handling obtained from a long-time execution of prototype and the framework will begreatly helpful.

39

8 Discussion and Limitations

Split-Brain Syndrome

One of the potential issue that the reader may have foreseen is the split-brain syndrome [9].This term is used in high-available clusters (like the one it is presented in this document) whenthe communication network between the nodes is down. Then, each of them become active or(declare itself as leader ) thinking that there are no other instances and leading to have severalservices running at the same time (when it is supposed just to have one) and possibly havingdata corruption.

Even though this scenario is likely to occur in our implementation, the idea of having datacorruption due to this malfunction in the network is disregarded for the following reasons: first,the router that is receiving the outer requests may be instructed to redirect those to only onesingle server ; and secondly even if the router were distributing the load, the data in the backend is being constantly replicated by the database manager and even CouchDB contains afeature to handle conflict between different versions of a document [3].

Furthermore, the time when more than one server will be needed to handle the load will surelycome. So it is preferred to propose an architecture that do not restrict much, and even better,allows the system to be scalable in the easiest way.

Eventual Consistency

Another potential issue is that the replication of the data is not coordinated with the logic ofthe application at all. Let’s say, the information saved in back end is distributed among otherdatabases without any notice received by the front end. If for any reasons, the Global FaultManager decides to fail over, i.e. to pass the role of leader, without replicating the data beforehand, there may be the chance that some data will be temporary lost.

Nevertheless, we disregarded the solution to this problem is found in [10], called global syn-chrnoization where the nodes are constantly making checkpoint and stopping others until thereis a global state of the cluster. As it can be seen, this proposal is blocking and it may impactthe performance of the service if the internal communication is slow. Moreover, one of the de-sign philosophies behind Node is to develop programs that unobtrusive and non-blocking so itwas desired to keep the same pattern. Besides, the biggest lost that the end user can have isthat some messages will appear in different order or at different times. Although it affect thequality of the service, it is a disruption that can be afforded at the expense of having the systemunrestricted.

Unnecessary router?

In the implementation section, a method called encapsulated cluster is used where a router isreceiving the upcoming requests and later on distributing to the nodes. Considering that, forour purposes, just one node will be running the service, the need of an external device mayseems expensive and some how useless.

However, having the nodes publicly with a Domain Server that maps a domain name to sev-eral IP addresses is problematic because the parameter Time To Live (TTL) is normally notrespected by the client and could be pointing to a server that has crashed 14 [11]. In addition,the idea of having several instances running to handle the load can be easily implemented byusing this method.

40 References

References

[1] Scaling Instant Messaging Communication Services: A Comparison of Blocking and Non-Blocking techniques, The Sixteenth IEEE symposium on Computers and Communications,May 2011.

[2] Inc 10gen. mongodb. http://www.mongodb.org, 2011.

[3] J.C. Anderson, J. Lehnardt, and N. Slater. CouchDB: The Definitive Guide. O’Reilly Series.O’Reilly Media, 2009.

[4] Inc Basho Technologies. Welcome to the riak wiki. http://wiki.basho.com/, 2011.

[5] TIOBE Software BV. Tiobe programming community index for july 2011. http://www.

tiobe.com/index.php/content/paperinfo/tpci/index.html, July 2011.

[6] Apache CouchDB. The apache couchdb project. http://couchdb.apache.org/, 2008-2011.

[7] Douglas Crockford. Classical inheritance in javascript. http://www.crockford.com/

javascript/inheritance.html.

[8] Douglas Crockford. JavaScript: The Good Parts. O’Reilly Media, Inc., 2008.

[9] S.K.M.N. Deshpande. Distributed Systems. Technical Publications, 2009.

[10] Vijay Dialani, Simon Miles, Luc Moreau, David De Roure, and Michael Luck. Transparentfault tolerance for web services based architectures. In In Eighth International EuroparConference (EUROPAR02), Lecture Notes in Computer Science, Padeborn, pages 889–898. Springer-Verlag, 2002.

[11] D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari. A scalable and highly available webserver. In Proceedings of the 41st IEEE International Computer Conference, COMPCON’96, pages 85–, Washington, DC, USA, 1996. IEEE Computer Society.

[12] Klint Finley. Node.js creator ryan dahl’s keynote from nodeconf. http://www.

readwriteweb.com/hack/2011/07/nodejs-creator-ryan-dahls-keyn.php, 2011.

[13] Google Groups. Erlang programming: node.js compared to erlang. http:

//groups.google.com/group/erlang-programming/browse_thread/thread/

142aed19df0decd9/a6fbf0414b50c8ee?pli=1, 2010.

[14] Rachid Guerraoui and Luıs Rodrigues. Introduction to Reliable Distributed Programming.Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[15] HA Forum. Providing Open Architecture High Availability Solutions, 2001.

[16] Robert Hanmer. Patterns for Fault Tolerant Software. Wiley Publishing, 2007.

[17] InternetWorld. Arets webbentreprenorer. http://internetworld.idg.se/2.1006/1.

316986/arets-webbentreprenorer-claudijo-borovic-och-niclas-holm, 2010.

[18] Zhang Yi Jiang Ivo Wetzel. Javascript garden. http://bonsaiden.github.com/

JavaScript-Garden/, 2011.

[19] Swarma Limited. Benchmark testing nginx vs apache. http://blog.webfaction.com/

a-little-holiday-present, 2008.

[20] Caolan McMahon. Unit testing in node.js. http://caolanmcmahon.com/posts/unit_

testing_in_node_js, 2010.

References 41

[21] Kiran Nagaraja, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen, and Albert Einstein.Using fault model enforcement to improve availability. In In Proceedings of the SecondWorkshop on Evaluating and Architecting System dependabilitY (EASY, 2002.

[22] David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do internet ser-vices fail, and what can be done about it? In Proceedings of the 4th conference onUSENIX Symposium on Internet Technologies and Systems - Volume 4, USITS’03, pages1–1, Berkeley, CA, USA, 2003. USENIX Association.

[23] Priya Narasimhan Soila Pertet. Causes of failure in web applications. Technical report,Carnegie Mellon University, December 2005.

[24] Stefan Tilkov and Steve Vinoski. Node.js: Using javascript to build high-performance net-work programs. IEEE Internet Computing, 14:80–83, November 2010.

Date post:	12-Sep-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

An Application Framework for High Available Systems in Node

Documents