UKSM: Swift Memory Deduplication via Hierarchical and Adaptive
Memory Region Distilling
Nai Xia* Chen Tian* Yan Luo+ Hang Liu+ Xiaoliang Wang**: Nanjing University +: University of Massachusetts Lowell
Feb/15/2018
Background• What is Kernel Samepage Merging (KSM)?
2
page 1
page 2Identical ?
page 1
page 2Update, different?
page 1
page 2
• Goal: Reduce memory consumption when duplication exists.• Effectiveness: There exist tremendous (~86%) memory duplications in
real-world applications, Change et al. [ISPA 2011].
…
Merge Unmerge
Unique Challenges
• Storage deduplication deals with relatively static content, only concerns about duplication ratio. • Sparse Indexing [FAST 2009] , CAFTL [FAST 2011], El-Shimi et al. [ATC 2012], Cao et al. [Just now]
• Responsiveness:• Remove duplications before they exhaust the memory.
• Dynamic nature:• Duplication status may change over time.
3
Accelerate the deduplication of memory which is dynamic in nature!
4
Outline
• Observation (Opportunity)• Overview• Hierarchical Region Distilling• Adaptive Partial Hashing• Evaluation• Conclusion
5
Observation I: Pages within the Same Region Present Similar Patterns0 200 400 600 800 1000
0
2
4
6
8x 104
KVM Memory SpaceDu
plica
ted
Page
s
0 200 400 600 800 10000
2000
4000
6000
8000
Docker Memory Space
Dupl
icate
d Pa
ges
6
• Test: Apache web server and MySQL database serving wordpress website in Ubuntu 16.04 (kernel version 4.4).
Duplicated pages concentrate by memory region.
*Please refer to our paper for other pattern analysis
Observation II: Hashing Needs to Be Adaptive
• Various applications need different hashing strengths to differentiate:• Image applications contain pages with highly similar contents.• Crypto applications contain diverse contents.
7
We should adjust hashing strength accordingly.
Page i
Page j
Page i
Page j
Overview
• Assuming we have 9 memory regions, i.e., R0 – R8.
8
R0 R1 R2 R3 R4
R5 R6 R7 R8
Ri
Low HighSimilarity
Overview
• Hierarchical memory region clustering.
9
R0
R1
R2
R3
R4R5 R6
R7
R8
Level 1
Level 2
Level N
Ri
Low HighSimilarity
…
Overview
• Hierarchical region distilling.
10
R0
R1
R2
R3
R4R5 R6
R7
R8
Level 1
Level 2
Level N
Ri
Low HighSimilarity
…
R3R3
R8
Overview
• Hierarchical region distilling.
11
Ri
Low HighSimilarity
R0
R1
R2 R4R5 R6
R7
R8
Level 1
Level 2
Level N
Round n
…
R0
R1
R2 R4R5 R6
R7
Level 1
Level 2
Level N
Round n + 1
…
R3
R3 R8
Overview
• Hierarchical region distilling + Adaptive partial hashing.
12
Ri
Low HighSimilarity
R0
R1
R2 R4R5 R6
R7
R8
Level 1
Level 2
Level N
R0
R1
R2 R4R5 R6
R7
…
Round n Round n + 1
R3
R3 R8
Overview
13
R0
R1
R2 R4R5 R6
R7
R8
Level 1
Level 2
Level N
R0
R1
R2 R4R5 R6
R7
• Takeaway 1: Promote/demote regions.
Takeaway 1…
Takeaway 2
• Takeaway 2: Sampling offset shift.
Takeaway 3
• Takeaway 3: Hash strength adjustment.
Round n Round n + 1
• Hierarchical region distilling + Adaptive partial hashing.
Hierarchical Region Distilling• Memory region characterization – Signatures:• Vcow: promote regions whose COW-broken ratios are lower than this.• Vdup: promote regions whose duplication ratios are higher than this.• Vlife: regions living longer than this threshold can be effectively scanned.
• Default empirical values:• Vcow = 10%, Vdup = 20% and Vlife = 100ms.
Various commercial products adopt UKSM and observe different sweet spots.
14
* COW: copy on write
Hierarchical Region Distilling
15
Region Ri Sample & Hash
Treemerge
Treeunmerge
Adjust Vdup
*: We adopt Linux KSM black-red tree design to track ’merged’ and ’unmerged’ pages.
Write on merged tree, adjust Vcow
move page from unmerged to merged tree
Adaptive Partial Hashing
16
Half hashing strength Strength = Strength ± DeltaProbe state
Adjusthash strength
We optimize SuperFastHash with the following key contributions:• Minimizing collisions – Optimizing avalanche for SuperFastHash [Hsieh 2004].• Progressive hashing – Support additivity while adjust hash strengths.
Hash Hash value H2 (round n+1)
Combine to H1,2Hash Hash value H1 (round n)1
st half
2nd half
Sampled page
Evaluation
• 6,000 Lines of Code in Linux kernel.• OS: Vanilla kernel 4.4. • Hardware: • Intel® Core ™ i7 CPU 920 with four 2.67 GHz cores.• 12 GB memory.
• For fair comparison• KSM is upgraded to SuperFastHash.
17
Evaluation Goals
• How efficient is UKSM on different workloads?• How flexible is UKSM regarding customization?• What’s the responsiveness of UKSM vs KSM?• How does adaptive partial hashing perform compared to non-adaptive
algorithm?• What’s the performance penalty of UKSM?
18
Evaluation Goals
• How efficient is UKSM on different workloads?• How flexible is UKSM regarding customization?• What’s the responsiveness of UKSM vs KSM?• How does adaptive partial hashing perform compared to non-adaptive
algorithm?• What’s the performance penalty of UKSM?
19
Parameter Analysis
20
0 50 100 150 200 250 3000
10
20
30
40
50
60
70
80
90
100
Seconds
CPU
Util
izat
ion
(%)
FullQuiet
0 50 100 150 200 250 3000
1000
2000
3000
4000
5000
6000
Seconds
Mem
ory
Savi
ng (M
B)
FullMediumLowQuiet
• UKSM allows four levels of scanning strengths:• Level Full allows upto 95% CPU consumption and can scan the entire memory in 2 seconds.• Each lower level will half the CPU and potentially increase the scan time by 2x.
Setting: Booting 25 VMs, each with 1 VCPU, 1GB memory.
Catching up time
Responsiveness Analysis
0 100 200 300 400 500 600Seconds
4000
5000
6000
7000
8000
9000
10000
11000
Mem
ory
Util
izat
ion
(MB)
UKSMKSM 100 PagesKSM 1000 PagesKSM 2000 Pages
21
611
95
615
0 100 200 300 400 500 600Seconds
0
10
20
30
40
50
60
70
80
90
100
CPU
(% o
ne c
ore)
UKSMKSM 100 PagesKSM 1000 PagesKSM 2000 Pages
UKSM is 8.3×, 12.6×, 11.5× more efficient than KSM at scan speed of 100, 1000, 2000 pages.
Efficiency = "#"$%&'()*+,-./0$+'1"23*$+
Setting: Two processes, each with 4GB memory. One contains identical pages while the other random ones.
Related Work
• Content-based approach:• VMware ESX server, IBM active memory deduplication, Red Hat ksmtuned.• Majority of them treat every page equally.
• I/O hint based approach:• KSM++ [Resolve 2012], XLH[Usenix ATC 2013], CMD [VEE 2014].• Cannot track anonymous memory space (no I/O) or require hardware change.
• SmartMD [Usenix ATC ‘17]:• Consider various page sizes; we are orthogonal.
22
Conclusion
• Memory deduplication faces the unique challenges. Our techniques:• Hierarchical region distilling.• Adaptive partial hashing.
• UKSM saves 12.6x and 5x more memory than KSM on static and dynamic workload, respectively, in the same time envelope.
• UKSM is an in production system: https://github.com/dolohow/uksm.• It has ~110 (watch, star and fork) after less than one year in GitHub.
23
Thank You & Questions?
24
We would like to thank our shepherd Dr. Hong Jiang and anonymous reviewers!