Fault Tolerant Parallel Data-Intensive Algorithms
Mucahid Kutlu Gagan Agrawal Oguz KurtDepartment of Computer Science and Engineering
The Ohio State UniversityDepartment of Mathematics
The Ohio State University
†
†
HIPC'12 Pune, India
Outline• Motivation• Related Work• Our Goal• Data Intensive Algorithms• Our Approach– Data Distribution– Fault Tolerant Algorithms– Recovery
• Experiments• Conclusion
HIPC'12 Pune, India 2
Why Is Fault Tolerant So Important?• Typical first year for a new cluster*
- 1000 individual machine failures- 1 PDU failure (~500-1000 machines suddenly disappear)- 20 rack failures (40-80 machines disappear,1-6 hours to get
back)- Other failures because of overheating, maintenance, …
• If you have a code which runs on 1000 computers more than one day, it is high probability that you will get a failure before your code finishes.
*taken from Jeff Dean’s talk in Google IO(http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx)
HIPC'12 Pune,India 3
Fault Tolerance So Far • MPI fault-tolerance, which focus on checkpointing [1],
[2], [3]. - High overhead• MapReduce[4]- Good at fault tolerance for data-intensive applications• Algorithm-based fault-tolerance- Most of them are for scientific computation like
linear algebra routines [5], [6], iterative computations [7], including conjugate gradient [8].
HIPC'12 Pune,India 4
Our Goal• Our main goal is to develop algorithm-based fault-
tolerance solution for data intensive algorithms.• Target Failure : – We focused on hardware failures.– We lose everything on the failed node.– We do not recover that failed node and continue
process without it.• In an iterative algorithm, when a failure occurs:– The system shouldn’t start the failed iteration from the
beginning – The amount of data lost should be small as much as
possible.
HIPC'12 Pune,India 5
Data Intensive Algorithms• We focused on two algorithms: K-Means & Apriori• But it can be generalized to all algorithms that have
the following reduction processing structure.
HIPC'12 Pune,India 6
Our Approach• We used Master-Slave approach.•Replication : Divide the data to be replicated in parts, and distribute them among different processors.
- The amount of lost data will be smaller.
•Summarization : The slaves can send the results of parts of their data before they processed all the data.
- We won’t need to re-process the data that we already got the results.
HIPC'12 Pune,India 7
MasterMaster
P1P1 P2P2 P3P3 P4P4
D1D1 D2D2 D3D3 D4D4
D5D5 D6D6 D7D7 D8D8
D1D1 D2D2 D3D3 D4D4
D5D5 D6D6 D7D7 D8D8
Data DistributionReplication
P3P3P1P1 P5P5 P7P7P4P4P2P2 P6P6
77
7722 22
77
77
K-means with No Failure
55
44
11
55 1313
44
66
11
11 22
66 33
33
5533
11 22
77 88
99 1010
33 44
11 22
1111 1212
55 66
11 22
1313 1414
77 88
6633 44
55 66
99 1010
33 44
1313 1414
1111 1212
55 66
99 1010
1414
1111 1212
77 88
Primary Data
MasterMaster
Replicas
Send the summary of data portion 1
11 221- 141- 14
Broadcast new centroids
Single Node Failure
Recovery-Master node notifies P3 to process D2 at this iteration and P3 to process D1 starting from next iteration.
Case 1: P1 Fails after sending the result of D1
P3P3 P4P4 P5P5 P6P6 P7P7P2P2P1P1
11 22
77 88
99 1010
33 44
11 22
1111 1212
55 66
11 22
1313 1414
77 88
33 44
55 66
99 1010
33 44
1313 1414
1111 1212
55 66
99 1010
1313 1414
1111 1212
77 88
Primary Data
MasterMaster
Replicas
HIPC'12 Pune,India 10
Multiple Node Failure
Recovery-Master node notifies P7 to read first data block(D1 and D2) from storage cluster.-Master Node notifies P6 to processD5 and D6, P4 to process D3, P5 to process D4.-P7 reads D1 and D2 P3P3 P4P4 P5P5 P6P6 P7P7P2P2P1P1
11 22
77 88
99 1010
33 44
11 22
1111 1212
55 66
11 22
1313 1414
77 88
33 44
55 66
99 1010
33 44
1313 1414
1111 1212
55 66
99 1010
1313 1414
1111 1212
77 88
Primary Data
MasterMaster
Replicas
Storage ClusterStorage Cluster
11 22
HIPC'12 Pune,India 11
Case 2: P1, P2 and P3 Fail, D1 and D2 are lost totally!
Experiments• We used Glenn at OSC for our experiments.• The nodes have :
– Dual socket, quad core 2.5 GHz Opterons– 24 GB RAM
• Allocated 16 slave nodes with 1 core per each.• Implemented in C programming language by using MPI library.• Generated Datasets:
– K-Means• Size : 4.87 GB• Coordinate Number : 20 • Maximum Iteration : 50
– Apriori• Size : 4.79 GB• Item Number : 10• Support Vector :%1• Maximum Rule Size :6 (847 rules)
HIPC'12 Pune,India 12
HIPC'12 Pune,India 13
Effect of Summarization
Changing Message Numbers with Different Percentages
Changing Data Block Numbers with Different Number of Failures
HIPC'12 Pune,India 14
Effect of Replication for Apriori Scalability Test for Apriori
Effect of Replication & Scalability
Fault Tolerance in Map Reduce
• Replication data in file system– We replicate the data in processors, not in file
system• If a task fails, re-execute it.• Completed map tasks also need to be re-
executed since the results are stored in local disks.
• Because of dynamics scheduling, it can have better parallelism after a failure.
HIPC'12 Pune,India 15
Experiments with Hadoop
• Experimental Setup– Allocated 17 nodes and 1 core per each.– Used one of nodes as master and the rest as slaves– Used default chunk size– No backup nodes are used– Replication is set to 3
• Each test executed 5 times and took the average of 3 after eliminating the maximum and minimum results.
• For our system, we set R=3 and S=4 and M=2• Calculated total time, including I/O operation
HIPC'12 Pune,India 16
HIPC'12 Pune,India 17
Single Failure Occurring at different Percentages
Multiple Failure Test for Apriori
Experiments with Hadoop(Cont’d)
Conclusion
• Summarization has good effect in fault tolerance– We recover faster when we divide the data into
more parts• Dividing data into small parts and distributing
them with minimum intersection decrease the amount of data to be recovered in multiple failures.
• Our system behaves much better Hadoop
HIPC'12 Pune,India 18
References1. J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart
process fault tolerance for open mpi. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1 –8, march 2007
2. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In Supercomputing, ACM/IEEE 2002 Conference, page 29, nov. 2002.
3. Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC ’06 ACM, 2006.
4. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004.
5. J.S. Plank, Youngbae Kim, and J.J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on, pages 351 –360, jun 1995.
6. Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS ’11, pages 162–171. ACM, 2011.
7. Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, HPDC 2011, San Jose, CA, USA, June 8-11, 2011, pages 73–84, 2011.
8. Zizhong Chen and J. Dongarra. A scalable checkpoint encoding algorithm for diskless checkpointing. In High Assurance Systems Engineering Symposium, 2008. HASE 2008. 11th IEEE, pages 71 –79, dec. 2008.
HIPC'12 Pune,India 19
Thank you for listening..
Any questions?
HIPC'12 Pune,India 20