Date post: | 08-Jan-2017 |
Category: |
Technology |
Upload: | 0darkking0 |
View: | 80 times |
Download: | 2 times |
1/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Similarity and Locality Based Indexing for HighPerformance Data Deduplication
Wen Xia 1 Hong Jiang 1 Dan Feng 1 Yu Hua 1
1Huanzhong University
2University of Nebraska-Lincoln
Presented by: Fajar Purnama 152-D8713
August 24, 2016
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
2/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Outline
Introduction
Related Work
Method
Performance Evaluation
Conclusion
Supplementary
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
3/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Data DeduplicationDefinition: the process to eliminate duplicate data.Purpose: to reduce storage usage / to save space.Implementation: disk-to-disk backup, virtual machine storage,WAN replication, and primary storage.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
4/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Data Hashing
I Represent any amount of data with fixed value.
I Very fast indexing (O1) compared to manual indexing (On).
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
5/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Data FingerprintingI When hash function can generate unique hash = Fingerprint.
I Block based deduplication divides files into chunks, assign fingerprints.
I Dedup Removes chunks with same fingerprint and replaced with pointers.
source:https://upload.wikimedia.org/wikipedia/commons/0/09/Fingerprint.svg
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
6/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Problem
The number of fingerprints are too large:
I Does not fit into the memory (limited performance).
I Have to rely on disk speed 1-6 MB/sec (too slow).
I For example a data set of 1 PB needs at least 2.5 TB ofSHA-1 fingerprints.
Previous general approach:
I Locality approach for example chunk stash.
I Similarity approach for example extreme binning.
I But none of them alone suffice for Peta Byte scale data.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
7/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Objective
Problem Summary:
I Current technique is too slow for today’s large data.
I Real implementation demands faster process.
I Imagine when you only have 1 day maintenance for backup .
This work proposes Similarity-Locality (SiLo):
I Combine similarity and locality based approach.
I To reduce Random Access Memory (RAM) usage.
I To increase throughput.
I To keep deduplication accuracy.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
8/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Locality
I Normally chunk lookups are one by one but some backup streamshave high locality: between the first, second, and next backupshave a very high probability that chunks are in the same order.
I Locality approach: exploit this locality, on the Figure below uponthe lookup of fingerprint 4a, will prefetch the fingerprint ”4a, c7,9e”.
I However this approach shows low speed on backup stream withweak locality.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
9/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Similarity
I Instead of lookups per chunks or per local chunks (locality) thelookups are per files.
I Similarity approach: on the below figure shows that file V1 issimilar to file V2 and the lookup is represented by the minimalfingerprint 2f, later on detect duplicate chunks between the two files.
I Although is much faster than locality approach it can sacirfice theduplication accuracy.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
10/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
MotivationSimilarity approach can cover locality approach vice versa.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
11/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Duplicate Eliminated vs Similarity Degree
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
12/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Segments and BlocksMany small files produce many fingerprints and large filesreduce similarity detection, thus it is better:
I For small files: combine into segments to reduce number of fingerprints.
I For large files: divide into segments to increase the similarity detection.
I Group similar segments order into blocks (preserve locality).
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
13/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Workflow
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
14/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
SiLo System Architecture
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
15/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Experimental SetupThis experiment evaluates SiLo in comparison to Chunkstash which implementslocality-based and Extreme Binning which implements similarity-based. Thehardware configuration includes a quad-core CPU running at 2.4 GHz, with a 4GB RAM, 2 gigabit network interface cards, and two 500 GB 7200 rpm harddisks. The data experimented on as follow:
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
16/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Segment and Block Size to AccuracySmall segments = high similarity exposure, large blocks = high locality.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
17/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Segment and Block Size to AccuracySmall segments = high similarity exposure, large blocks = high locality.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
18/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Segment and Block Size to RAMSmall segments = more fingerprints, large blocks = more unrelated segments.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
19/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Segment and Block Size to RAMSmall segments = more fingerprints, large blocks = more unrelated segments.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
20/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Comparison of Duplicates Eliminated of 4 State of The ArtMethod
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
21/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Locality 100%, Silo ≈ 100%, Similarity ≈ 75%
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
22/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Comparison of RAM Usage of 4 State of The Art Method
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
23/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
SiLo Low, Similarity Medium, Locality High
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
24/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Comparison of Throughput of 4 State of The Art Method
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
25/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
SiLo Fast, Similarity Medium, Locality Slow
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
26/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Conclusion
This work presented SiLo, adeduplication system thatexploits both similarity andlocality in backup streams toachieve:
High Deduplication Accuraccy
Lower RAM Usage Higher Throughput
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
27/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Thank youAny comments or questions?
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
28/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
List of Related Work
I Content Define Chunking (CDC) adopting rabin fingerprint byLow-Bandwidth Network File System (LBFS).
I Many other chunking studies and incremental filesynchronization.
I Other studies consists of fingerprint indexing.
I Sparse indexing, DDFS, ChunkStash, and other localityapproach.
I Extreme Binning, a similarity approach.
I Load distribution, multi thread, pipelining, parallelcomputation etc.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
29/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Similarity and LocalityLeft figure shows distribution of similarity degree and right figureshows that not all duplicate data eliminated by Extreme Bining (asimilarity approach).
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
30/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Deduplication Server Data StructureI Similarity Hash (SH)Table provides the similarity detection for input
segments and Locality Hash (LH)Table serves to quickly index and filterout duplicate chunks. The write buffer and read cache contain therecently accessed blocks to exploit the backup stream locality.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
31/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Similarity Algorithm Data StructureSmall files are group into segments to minimize fingerprints while large filesare divided into segments for more similarity exposure.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
32/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Locality-Based Stateless Routing Algorithm
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
33/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Flowchart of SiLo Deduplication
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
34/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Throughput due to blocks in read cacheThe deduplication throughput will increase with the number of blocks in theread cache, but it results in more RAM overhead. It can be seen that beyondsixteen blocks the throughput increased only slowly and it even decreases fortwo backup sets.
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015
35/35
Introduction Related Work Method Performance Evaluation Conclusion Supplementary
Load Distribution of This System
Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB
IEEE Transactions on Computers, Vol. 64, No. 4, April 2015