[Communications in Computer and Information Science] Grid and Distributed Computing Volume 261 ||...

T.-h. Kim et al. (Eds.): GDC 2011, CCIS 261, pp. 455–467, 2011. © Springer-Verlag Berlin Heidelberg 2011

Replication and Checkpoint Schemes for Task-Fault Tolerance in Campus-Wide Mobile Grid*

SookKyong Choi1, JongHyuk Lee1, HeonChang Yu1, and Hwamin Lee2

1 Dept. of Computer Science Education, Korea University, Anam-Dong, Seongbuk-Gu, Seoul, Korea

{csukyong,spurt,yuhc}@korea.ac.kr 2 Dept. of Computer Software Engineering, Soonchunhyang University,

336-745, Asan-si, Korea [email protected]

Abstract. Mobile grid computing is a computing environment that incorporates mobile devices to an existing grid environment and supports users’ mobility. But this environment is not stable, so methodologies to cope with the reliability issue are needed. Fault tolerance approaches for task execution in grid computing can be categorized into replication and checkpoint. We apply these techniques to a SimGrid simulator to provide a fault tolerance for a mobile environment and show the results in this paper. The results demonstrate that the best solution for fault tolerance in mobile grid computing depends on the situations of the network. The contribution of this paper is the use of real-life trace data to simulate fault tolerance in a mobile grid computing.

Keywords: mobile grid, replication, checkpoint, fault tolerance, reliability.

1 Introduction

Grid computing is a technology to offer high scalability, multitasking, and multitenancy. Grid computing provides an environment that effectively execute tasks by sharing and integrating computing resources in heterogeneous computing systems. Traditional grid environment is mostly implemented with resources in physically fixed locations. But this environment is evolving and extending the scope of resources into movable resources, by using mobile devices, to take advantage of the exploding population of mobile gadgets such as laptops, netbooks, and even smartphones. More recently, the mobile devices are equipped with high-performance computing engines with inexpensive price tags, attracting more and more users.

Mobile grid computing is a computing environment that incorporates mobile devices to an existing grid environment and supports users’ mobility. In mobile grid computing environment, mobile devices can act as both a service provider and a service requester.

* This work was supported by National Research Foundation of Korea Grant funded by the

Korean Government (2009-0070138)

456 S. Choi et al.

As a service provider, it can provide its computing resources as a part of a mobile computing infrastructure. That is, mobile devices can execute tasks in mobile computing environment, like desktops or servers in a traditional wired computing. As a service requester, a mobile device can provide an interface to request services and use resources in mobile grid computing. The mobile computing environment exhibits its own characteristics, unlike the wired computing. Users of mobile devices may freely move in and out of a mobile network, which causes unstablility in network connection. Even worse, mobile devices may experience sudden power-offs by user or dead battery. We regard these cases that may results in outcome loss as faults. Therefore, the environment of mobile grid is not stable, so it should provide methodologies to cope with the reliability issue.

Fault tolerance approaches for task execution in grid computing can be categorized into replication and checkpoint. Replication is used to improve reliability, fault tolerance, or accessibility using redundant resources. If the same data is duplicated and stored on several storages, it is referred to as data replication. If the same task is duplicated and executed several times, it is referred to as computation replication. Computation task can be executed on several resources simultaneously (replicated in space) or can be executed many times on one resources (replicated in time). Most of studies for replication have been on how to replicate data to improve accessibility and minimize access latencies.

Checkpoint is used to minimize efforts of reexecuting long task from the beginning when the task is interrupted by faults such as a hardware failure, a software failure or resource unavailability. Checkpointing is taken periodically at specified intervals. Some information like a machine status, intermediate results and logs, are stored at every checkpoint interval on non-volatile storage. In case a fault occurs, information stored at the latest checkpoint is retrieved and used to execute task from the checkpoint. In the checkpoint scheme, it is crucial to figure out the optimal checkpoint interval (frequencies). Frequent checkpoints affect to the system performance due to overhead. Infrequent checkpoints may not provide fault tolerance. Especially, there are several studies for minimizing overhead in checkpoint.

Compared to the traditional grid computing, a special care should be taken in mobile environment to consider characteristics of mobile devices, especially when mobile devices are utilized as resources for task processing. So, we apply these replication and checkpoint techniques to a SimGrid simulator and show the proper methods for fault tolerance when mobile devices are used as resources in this paper. To do this, we consider two incomplete cases caused by faults in mobile environment. One is the case that a task cannot be completed due to faults, and the other is the case that a task is completed but its outcome cannot be returned to requester due to faults. It is imperative to provide methods to withstand faults in a mobile grid computing.

2 Related Works

To withstand faults and provide efficient grid services, it is essential to investigate on task replicas that do not affect the process of other tasks, minimize waste of resources, and improve task execution time with replication. Early studies focused on fault tolerance methods using a fixed number of task duplicates [1,2,3,4]. Recent studies

Replication and Checkpoint Schemes for Task-Fault Tolerance 457

shifted to the dynamic decision of task replicas, that is, optimal number of task replica is dynamically determined according to system status [5,6,7].

Kang et al [1] researched on meeting users' QoS requirements and fostering balanced resource consumption by task replication using under-utilized and under-priced resources in computational service market composed of less reliable desktop PCs. They have a point of view that under-utilized and cheap resources can be exploited to build a high quality resource and hence can facilitate balanced resource usage. Katsaros et al [2] suggested installment-based scheduling and task replication to overcome intermittent connectivity in mobile grid. They defined that installment is consecutive fragments of a divisible task. And they used mobile device as a resource. However, the study assumed that every mobile device has the same performance and same network environment. Silva et al [3] proposed a dynamic replication approach, Workqueue with Replication (WQR). The study has not a fixed number of task replication and dynamically replicates tasks to idle resources when there is no other task to schedule. Moreover, replication is not used when system load is high. Dobber et al [4] proposed a method combining dynamic load balancing (DLB) and job replication (JR) to cope with unpredictable and dynamic nature of grid computing. This approach is a scheduling scheme that selects DLB or JR by comparing statistic time of expectation to threshold value set in advance. The study used four job replicas in JR.

Limaye et al [5] proposed smart failover approach that is a job-site recovery to proactively handle failures and use backup server to overcome situation where the execution of client’s job is failed due to the failure of primary server. This approach supports transparent recovery by storing job states in local job-manager queue and transferring those states to the backup server. Priya et al [6] proposed task level fault tolerance in grid environment. With the technique, the study claimed that their checkpoint technique achieved the optimal load balance across different grid sites. Katsaros et al [7] considered independent checkpoint activities, proposed a statistical decision-making approach, and defined response time metrics of fault-tolerance performance and effectiveness. Paul J. Darby III et al [8] suggested checkpoint arrangement based on reliability of consumer and provider to maximize recovery probabilities of checkpoint data in mobile grid computing, which concerned about host mobility, dynamicity, less reliable wireless links, frequent disconnections, and variations in mobile environments. A mobile host simply sends its checkpoint data to neighboring mobile host, or saves checkpoint data for neighboring mobile host.

There are some comparison studies to investigate the optimal solution for fault tolerance between checkpoint and replication in grid computing. However, studies in mobile grid environment have not been reported to the best of our knowledge.

Wu et al [9] compared four different fault tolerance scheduling strategies based on genetic algorithm to improve reliability of grid system. These scheduling strategies are compared in terms of the performance metrics such as makespan, average turnaround time and job failure rate. The study reports that the checkpoint provides the best solution and replication is not suitable due to the overhead. Chtepen et al [10] introduced some heuristics using parameter dynamically based on grid status information and proposed a hybrid scheme based on both checkpoint and replication to improve job throughput against failures. The study reports that the dynamic

458 S. Choi et al.

adjustment of checkpoint frequency according to resource stability and remaining job execution time minimizes the checkpoint overhead. It also reports that postponing replication minimizes the replication cost. Moreover, the study reports that the hybrid scheme is the best approach when system information is not provided in advance.

3 Task-Fault Tolerance Approaches in Mobile Grid Environment

3.1 Architecture for Campus-Wide Mobile Grid Computing

In this paper, we assume campus-wide mobile grid computing environment similar to [11]. In Fig. 1, Mobile_Network indicates the entire campus that is composed of several Sub_Mobile_Networks. Sub_Mobile_Network indicates small network of each building in campus. MN(Mobile Node; mobile device)s are connected to the Sub_Mobile_Network by AP. There is a Global_Scheduler in a Mobile_Network. There are Local_Schedulers in Sub_Mobile_Networks and there are some Intermediate_Scheduler among Sub_Mobile_Networks. High level scheduler controls low level schedulers and supports load balancing among them. For this environment, we assign Proxies to act as schedulers in networks, so a proxy receives jobs from MNs and delivers outcomes to MNs.

Because MNs act both an interface to mobile services and computational resources for task processing, an MN can submit tasks and process the submitted task. Submitted task from user is transmitted to a Local_Scheduler. The Local_Scheduler selects an MN in a Sub_Mobile_Network that including the Local Scheduler to process user’s task. Then the Local_Scheduler allocates the task to the selected MN. If Local_Scheduler has not enough resources, namely MNs, to process the task, the task is transmitted from the Local_Scheduler to an Intermediate_Scheduler or a Global_Scheduler recursively. Reversely, high level scheduler selects low level scheduler to return the task outcome, in turn. Therefore, task outcome is transmitted to the user submitting the task. Because we assume that tasks are divisible and independent in this paper, if big sized task is submitted and single mobile device cannot execute the task, then scheduler can divide the task.

Fault includes network disconnection and power-off of mobile device. Further, we regard all types of cases that cause task failures in mobile grid as fault. But, we don’t consider Byzantine failure in this paper. When a fault is occurred, a task cannot complete due to the fault, and user submitting the task cannot receive task outcome even thought the task is completed.

3.2 Status of Mobile Device (MN)

The status of mobile device for processing a job is ‘Available’ or ‘Not Available’. An ‘Available’ status means that an MN can submit, process a job, and receive a job outcome. And a ‘Not Available’ status means that MN can’t be used for resource to provide grid service, because of battery shortage, network disconnection, or other physical faults. We don’t care ‘Not Available’ status in this paper, so we consider only if the MN is available or not for some time duration to process a job.


Fig. 1. Architecture for Campus-wide Mobile Grid Computing

To select a proper MN for processing a job, a probability that a mobile device is available can be calculated by equation (1).

notavailavail

n

kijk

ij TT

PerstyAvailabili

+=

=1

(1)

where =

n

kijkPers

1

is the User Persistence(Pers) meaning a time duration during

which the ith MN stays at the jth Sub_Mobile_Network until the MN moves to other Sub_Mobile_Network or network link is down[12]. K is the number of available status. And Tavail and Tnotavail mean a time duration during which a MN can or cannot process a job respectively.

3.3 Task Scheduling Scheme

We apply 3 types of task scheduling schemes for fault tolerance in this paper; No replication and no checkpoint, task with Replication, and task with Checkpoint. The first refers to task execution without replication and checkpoint, the second and the third refer to task execution according to several replication options and checkpoint options. Basic principle in this paper is that a submitted task is processed in the Sub_Mobile_Network that includes the MN first. During a task execution, various situations that task outcome cannot be returned to user can be happened.

460 S. Choi et al.

3.3.1 No Replication and No Checkpoint When a task is submitted to a Sub_Mobile_Network, Local_Scheduler in the Sub_Mobile_Network selects an MN among MNs connected to the Sub_Mobile_Network. And the Local_Scheduler allocates the task to the MN to process the task. An MN with the best availability is selected first by equation (1). If there is no MN to process the task in the Sub_Mobile_Network, the request for task processing is transmitted to a high level scheduler. A high level scheduler selects an MN for task processing by using informations of the MN from IS(Information Service). Finally, if there is no MN to process the task in whole network, the task cannot be processed. And if the MN fails due to faults, other MNs can process the task from the beginning. We call this scheduling scheme No-No scheme in section 4.

3.3.2 Replication An original task submitted to a Sub_Mobile_Network can be replicated according to the replication options by a proxy as follows.

• option_1 : replicate necessarily

• option_2 : replicate when length of the task is long

• option_3 : replicate when the size of remaining task is not much

• option_4 : replicate when system load in a mobile network is low

• option_5 : replicate when the ratio of replicated task in a mobile network is low

The default number of replica for a job is one in order to reduce the waste of resources and overhead for replication in whole Mobile Network. System load increases with the number of MN processing a job. If a task is replicated, original task and replicas are executed on different MN simultaneously. Remaining size of the task, system load, and the ratio of replicated task are dynamically changed while the task is executed. So, replicated or not, which is determined dynamically by a proxy. When the faster task between original task and replica is completed, the other task has to be canceled. Fig. 2 shows the algorithm for replication.

3.3.3 Checkpoint An original task can be checkpointed according to the checkpoint options by a proxy as follows.

• option_1 : checkpoint necessarily

• option_2 : checkpoint when length of task is long

• option_3 : checkpoint when the size of remaining task is not much

• option_4 : checkpoint when system load in a mobile network is low

• option_5 : checkpoint when the ratio of checkpoint task in a mobile network is low


if (option_1_flag is true) { // replication process create same task. select another available MN to put the task. dispatch the task on the MN.

} if (option_2_flag is true) {

check task size. if (task is long) { start replication process of option_1 }


check system load. if (system load is low) { start replication process of option_1 }


check the size of remaining task. if (the size of remaining task is not much) { start replication process of option_1 }


check the ratio of replicated task in a mobile network. if (the ratio of replicated task is low) { start replication process of option_1 }

} when one task out of the two, original task or replica, is over,

kill the other task, and set the completion time by own completion time.

Fig. 2. Algorithm for replication

The default MN number of checkpointing for a job is one, like replication. And remaining size of the task, system load, and the ratio of checkpoint task are dynamically changed while the task is executed. So, checkpointed or not, which is determined dynamically by a proxy. Fig. 3 shows the algorithm for checkpoint.

3.3.4 Example Assume option_1_flag is false and option_2_flag is true, but length of task is short in each replication and checkpoint algorithm. Then a task will not be replicated or checkpointed, and the task is restarted from zero when faults occur.

4 Simulation

4.1 Data Analysis

To show the optimal methods for fault tolerance in a mobile grid computing, we analyzed a real-life trace: WLAN trace of Dartmouth campus[13]. The trace is a syslog produced by APs from September 1, 2005 to October 4, 2006[14]. We selected some part of trace as of June 6, 2006 for input of Simulation, which includes 987 APs and 3,367 mobile devices. After analyzing the trace, network information was extracted. The rate of the sessions maintained less than 2 hours is about 80% of all.

462 S. Choi et al.

4.2 Setup for Simulation

In this paper, we used Java based SimGrid simulator[15,16]. SimGrid is a toolkit that provides core functionalities for the simulation of distributed applications in heterogeneous distributed environments. To consider dynamic characteristics of mobile devices in a mobile grid computing, i.e. unstable and unpredictable characteristics, we setup simulation environment according to the results of real-life trace. And we used MSG of SimGrid that can support a proxy and MN clients.

Configurations for simulation are presented in Table 1. The type of a task is a computational intensive task. The number of tasks is classified 10, 100, and 500, namely 3 types. And their length type of a size is categorized as short and long, namely 2 types. We assumed that all MN(mobile devices) has same performance in this paper. They can freely move around the Sub_Mobile_Network or other Mobile_Networks. The number of MN to submit, process, and receive a job is classified 10, 100, 1000, and 4500, namely 4 types. There are 3 types of task scheduling schemes; No replication and no checkpoint, task with Replication, and task with Checkpoint.

check the status of MN executing a task. if (MN fail) { if (option_1_flag is true) { // checkpoint process

get the checkpoint information. select another available MN to put the task. restart the task from the checkpoint.


check task size. if (task is long) { start checkpoint process of option_1 }


check system load. if (system load is low) { start checkpoint process of option_1 }


check the size of remaining task. if (the size of remaining task is not much) { start checkpoint process of option_1 }


check the ratio of checkpoint task in a mobile network. if (the ratio of checkpoint task is low) { start checkpoint process of option_1 }

} }

Fig. 3. Algorithm for checkpoint


Table 1. Configurations for simulation

task • type of task : computational intensive • the number of tasks : 10, 100, and 200 • length for task (task size) : short and long

Mobile devices

• the number of mobile devices : 10, 100, 1000, and 3300 • availability : 0 ~ 100% • network connectivity : 0 or 1 (0:disconnection 1:connection) • whole system load : high and low

Methods for Task Fault-tolerance

• no replication and no checkpoint (No-No scheme) • task with Replication • task with Checkpoint

4.3 Results of Simulation

Fig. 4 shows the results of No-No scheme, namely No replication and no checkpoint, which is the basis of other scheduling scheme, Replication and Checkpoint. Completion time of tasks, the number of completed tasks, the number of failed tasks, and the number of used hosts are presented in Fig. 4. Y-axis means classification of tasks like 10/10/L that is composed of the number of tasks, the number of hosts, and size of the tasks(Long or Short).

In addition, we analyze results of simulation for 4 evaluation metrics, namely average execution time of all tasks, completion rate of tasks, the number of completed tasks, and the utilization rate of resources according to the 5 options of each task scheduling scheme.

Fig. 4. No replication and no checkpoint

● Average Execution Time

Fig. 5 and 6 show the average execution time in replication and checkpoint scheme, compared with No-No scheme represented by dotted line. Average execution time in replication scheme is shorter than in checkpoint scheme. It is due to checkpoint overhead that decision of checkpointing is made during processing, unlike replication.

464 S. Choi et al.

Fig. 5. Average Execution Time in Replication

Fig. 6. Average Execution Time in Checkpoint

● Completion rate of Tasks

Fig. 7 and 8 show the completion rate of tasks in each scheme. In replication scheme, completion rate of tasks is getting higher according to the number of hosts. This means that if there are plenty of hosts, then using replication can increase the completion rate. In checkpoint scheme, the completion rate is mostly higher than in No-No scheme. This means that checkpoint scheme basically increase the probability of job completion to a certain extent.

Fig. 7. Completion Rate of Tasks in Replication


Fig. 8. Completion Rate of Tasks in Checkpoint

● The number of Completed Tasks

Fig. 9 and 10 represent the number of completed tasks in each scheme. The number of completed tasks in checkpoint scheme is higher than in No-No and replication scheme. That’s because a checkpoint scheme deal with the faults of resources more actively

Fig. 9. The number of Completed Tasks in Replication

Fig. 10. The number of Completed Tasks in Checkpoint

● Utilization Rate of Resources

Fig. 11 and 12 show the utilization rate of resources in each scheme. There is little difference in two scheme. That seems an extra resource for replication and another resource for checkpoint are the same sense.

466 S. Choi et al.

Fig. 11. Utilization Rate of Resources in Replication

Fig. 12. Utilization Rate of Resources in Checkpoint

5 Conclusion and Future Work

In this paper, we present fault-tolerance methods for mobile computing environment. We apply replication and checkpoint scheduling scheme to real-life trace data. The results demonstrate that the best solution for fault tolerance in mobile grid computing depends on the situations of the whole network. Average execution time is shorter in replication scheme, and the number of completed tasks is bigger in checkpoint scheme. So if there are plenty of resources in a network and they are comparatively reliable, then replication scheme is the better, else checkpoint scheme is the better.

We have a plan to apply the real-life trace to mobile cloud computing environment using GloudSim simulator. And we will conduct provisioning of reliable services based on SLA and QoS in mobile cloud computing.

References

1. Kang, W., Huang, H.H., Grimshaw, A.: A highly available job execution service in computational service market. In: 8th IEEE/ACM International Conference on Grid Computing, September 19-21, pp. 275–282 (2007)

2. Katsaros, K., Polyzos, G.C.: Evaluation of scheduling policies in a Mobile Grid architecture. In: Proc. International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 2008), Edinburgh, UK (June 2008)

3. Silva, D., Cirne, W., Brasileiro, F.: Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 169–180. Springer, Heidelberg (2003)


4. Dobber, M., van der Mei, R., Koole, G.: Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison. IEEE Transactions on Parallel and Distributed Systems 20(2), 207–218 (2009)

5. Limaye, K., Leangsuksun, C.B., et al.: Job-Site Level Fault Tolerance for Cluster and Grid environments. In: The 2005 IEEE Cluster Computing, Boston, MA, September 27-30 (2005)

6. Baghavathi Priya, S., Prakash, M., Dhawan, K.K.: Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using Check Point. In: Sixth International Conference on Grid and Cooperative Computing, GCC 2007 (2007)

7. Katsaros, P., Angelis, L., Lazos, C.: "Performance and Effectiveness Trade-Off for Checkpointing in Fault-Tolerant Distributed Systems. Concurrency and Computation: Practice and Experience 19(1), 37–63 (2007)

8. Darby III, P.J., Tzeng, N.-F.: Decentralized QoS-Aware Checkpointing Arrangement in Mobile Grid Computing. IEEE Transactions On Mobile Computing 9(8), 1173–1186 (2010)

9. Wu, C.-C., Lai, K.-C., Sun, R.-Y.: GA-Based Job Scheduling Strategies for Fault Tolerant Grid Systems. In: IEEE Asia-Pacific Services Computing Conference (2008)

10. Chtepen, M., Claeys, F.H.A., Dhoedt, B., De Turck, F., Demeester, P., Vanrolleghem, P.A.: Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids. IEEE Transactions on Parallel and Distributed Systems 20(2), 180–190 (2009)

11. Katsaros, K., Polyzos, G.C.: Optimizing Operation of a Hierarchical Campus-wide Mobile Grid for Intermittent Wireless Connectivity. In: 15th IEEE Workshop on Local & Metropolitan Area Networks, LANMAN 2007, June 10-13, pp. 111–116 (2007)

12. Balazinska, M., Castro, P.: Characterizing Mobility and Network Usage in a CorporateWireless Local-Area Network. In: Proceedings of the First International Conference on Mobile Systems, Applications, and Services (2003)

13. Henderson, T., Kotz, D.: CRAWDAD trace dartmouth/campus/syslog/05_06 (February 8, 2007), http://crawdad.cs.dartmouth.edu

14. Lee, J.H., Choi, S.J., Suh, T., Yu, H.C.: Mobility-aware Balanced Scheduling Algorithm in Mobile Grid Based on Mobile Agent. The Knowledge Engineering Review (2010) (accepted for publication)

15. Buyya, R., Murshed, M.: GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. J. Concurrency and Computation: Practice and Experience 14, 13–15 (2002)

16. Sulistio, A., Cibej, U., Venugopal, S., Robic, B., Buyya, R.: A toolkit for modelling and simulating data Grids: an extension to GridSim. Concurrency and Computation: Practice & Experience 20(13), 1591–1609 (2008)

Date post:	23-Dec-2016
Category:	Documents
Upload:	javier-garcia
View:	213 times
Download:	1 times

[Communications in Computer and Information Science] Grid and Distributed Computing Volume 261 ||...

Documents