+ All Categories
Home > Technology > Hecatonchire kvm forum_2012_benoit_hudzia

Hecatonchire kvm forum_2012_benoit_hudzia

Date post: 15-Nov-2014
Category:
Upload: benoit-hudzia
View: 1,066 times
Download: 1 times
Share this document with a friend
Description:
Modularizing and aggregating physical resources in a datacenter depends not only on low-latency networking, but also on software techniques to deliver such capabilities. In the session we will present some practical features and results of our work, as well as discuss implementation details. Among these features are delivering high-performance, transparent, and partially fault tolerant memory aggregation; and reducing the downtime of live migration using post-copy implementations. We will present and discuss methods of transparently integrating with the MMU at the OS level, without impacting a running VM; We will then introfuce the Hecatonchire project ( http://www.hecatonchire.com ) which aim at the disaggregation of datacenter resources and enable true utility computing.
Popular Tags:
41
Benoit Hudzia ; Sr. Researcher; SAP Research Belfast With the contribution of Aidan Shribman , Roei Tell , Steve Walsh , Peter Izsak November 2012 Memory Aggregation For KVM Hecatonchire Project
Transcript
Page 1: Hecatonchire kvm forum_2012_benoit_hudzia

Benoit Hudzia; Sr. Researcher; SAP Research BelfastWith the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter Izsak

November 2012

Memory Aggregation For KVMHecatonchire Project

Page 2: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 2

Agenda

• Memory as a Utility • Raw Performance• First Use Case : Post Copy • Second Use case : Memory aggregation• Lego Cloud• Summary

Page 3: Hecatonchire kvm forum_2012_benoit_hudzia

Memory as a UtilityHow we Liquefied Memory Resources

Page 4: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 4

The Idea: Turning memory into a distributed memory service

Breaks memory from the bounds of the physical box

Transparent deployment with performance at scale and Reliability

Page 5: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 5

Network

High Level Principle

Memory Sponsor A

Memory Sponsor B

Memory Demander

Virtual Memory Address Space

Memory Demanding Process

Page 6: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 6

How does it work (Simplified Version)

Virtual Address

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

RDMA Engine RDMA Engine

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

Miss

Remote PTE

(Custom Swap Entry)

Page request

Page Response

PTE write

Update MMU

Invalidate PTE

Invalidate MMU

Extract Page

Extract Page

Prepare Page for RDMA transfer

Physical Node BPhysical Node A

Network

Fabric

Page 7: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 7

Reducing Effects of Network Bound Page Faults

Full Linux MMU integration (reducing the system-wide effects/cost of page fault) Enabling to perform page fault transparency (only pausing the requesting thread)

Low latency RDMA Engine and page transfer protocol (reducing latency/cost of page faults) Implemented fully in kernel mode OFED VERBS Can use the fastest RDMA hardware available (IB, IWARP, RoCE) Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED)

Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults) Currently only a simple fetching of pages surrounding page on which fault occurred

Page 8: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 8

Transparent Solution

Minimal Modification of the kernel (simple and minimal intrusion)• 4 Hooks in the static kernel , virtually no overhead when enabled for normal operation

Paging and memory Cgroup support (Transparent Tiered Memory)• Page are pushed back to their sponsor when paging occurs or if they are local they can be

swapped out normally

KVM Specific support (Virtualization Friendly)• Shadow Page table (EPT / NPT ) • KVM Asynchronous Page Fault

Page 9: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 9

Transparent Solution (cont.)

Scalable Active – Active Mode (Distributed Shared Memory)• Shared Nothing with distributed index• Write invalidate with distributed index (end of this year)

Library LibHeca (Ease of integration) • Simple API bootstrapping and synching all participating nodes

We also support:• KSM • Huge Page • Discontinuous Shared Memory Region • Multiple DSM / VM groups on the same physical node

Page 10: Hecatonchire kvm forum_2012_benoit_hudzia

Raw PerformanceHow fast can we move memory around ?

Page 11: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 11

Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps

Total Gbit/sec (SIW - Seq)

Total Gbit/sec (IW-Seq)

Total Gbit/sec (IB-Seq)

Total Gbit/sec (SIW- Bin split)

Total Gbit/sec (IW- Bin split)

Total Gbit/sec (IB- Bin split)

Total Gbit/sec (SIW- Random)

Total Gbit/sec (IW- Random)

Total Gbit/sec (IB- Random)

0

5

10

15

20

251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads

Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM

Maxing out Bandwidth

Not enough core to saturate (?)

No degradation under high load

Software RDMA has significant

overhead

Page 12: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 12

Hard Page Fault Resolution Performance

Resolution timeAverage (μs)

Time spend over the wire one way Average (μs)

Resolution timeBest (μs)

SoftIwarp (10 GbE)

355 150 + 74

Iwarp (10GbE)

48 4-6 28

Infiniband (40 Gbps)

29 2-4 16

Page 13: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 13

Average Compounded Page Fault Resolution Time(With Prefetch)

1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM

icro-seconds

Avg IW

Avg IB

Page 14: Hecatonchire kvm forum_2012_benoit_hudzia

Post-Copy Live MigrationTechnology first Use Case

Page 15: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 15

Post Copy – Pre Copy – Hybrid Comparison

1 GB 4 GB 10 GB 14 GB0

0.5

1

1.5

2

2.5

3

3.5

4

Pre-copy

Post-Copy

Hybrid - 3 seconds

Hybrid - 5 Seconds

Dow

ntim

e (s

econ

ds)

(Forced after 60s)

VM RamHost: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM

Network : 10 GB Eth – Chelsio T422-CR IWARP

Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds)

Page 16: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 16

Post Copy vs Pre copy under load

0

10

20

30

40

50

60

70

80

90

100

Post Copy Dirtying Rate 1GB/s

Post Copy Dirtying Rate 5GB/s

Post Copy Dirtying Rate 25GB/s

Post Copy Dirtying Rate 50GB/s

Post Copy Dirtying Rate 100GB/s

Pre Copy Dirtying Rate 1GB/s

Pre Copy Dirtying Rate 5GB/s

Pre Copy Dirtying Rate 25GB/s

Pre Copy Dirtying Rate 50GB/s

Pre Copy Dirtying Rate 100GB/s

Seconds

Deg

rada

tion

(%)

Virtual Machine :• 1 GB RAM -1vCPU

• Workload: App Mem Bench

Hardware: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM

• Network : 10 GB Eth Switch – NIC : Chelsio T422-CR (IWARP)

Page 17: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 17

Post Copy Migration of HANA DB

Baseline Pre-Copy Post-CopyDowntime N/A 7.47 s 675 ms

Benchmark Performance Degradation

0% Benchmark Failed

5%

Virtual Machine:• 10 GB Ram , 4 vCPU• Application : HANA ( In memory Database )• Workload : SAP-H ( TPC-H Variant)

Hardware:•Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM •Fabric: 10 GB Ethernet Switch •NIC: Chelsio IWARP T422-CR

Page 18: Hecatonchire kvm forum_2012_benoit_hudzia

Memory AggregationSecond use case: Scaling out Memory

Page 19: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 19

Scaling Out Virtual Machine MemoryBusiness Problem

• Heavy swap usage slows execution time for data intensive applications

Hecatonchire/ RRAIM Solution• Applications use memory mobility for high

performance swap resource• Completely transparent

• No integration required• Act on results sooner

• High reliability built in

• Enables iteration or additional data to improve results

Solution

Memory Cloud

ApplicationVM swaps to memory

Cloud

RAM

Compression / Deduplication / N-tiers

storage / HR-HA

Page 20: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 20

Redundant Array of Inexpensive RAM: RRAIM

1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.

2. No immediate effect on computation node upon failure of node.

3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.

Page 21: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 21

Quicksort Benchmark with Memory Constraint

Memory Ratio (constraint using cgroup)

DSM Overhead RRAIM Overhead

3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%

Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset

3:04 1:02 1:03 1:04 1:050.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

DSM Overhead

RRAIM Overhead

Page 22: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 22

Scaling out HANA

Virtual Machine:• 18 GB Ram , 4 vCPU• Application : HANA ( In memory Database )• Workload : SAP-H ( TPC-H Variant)

Memory Ratio

DSM Overhead

RRAIM Overhead

1:2 1% 0.887%

1:3 1.6% 1.548%

2:1:1 0.1% -

1:1:1 1.5% -

Hardware:•Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM •Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM •Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

1:02 1:03 2:01 1:010.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

1.40%

1.60%

DSM Overhead

RRAIM Overhead

Page 23: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 23

Transitioning to a Memory Cloud(Ongoing work)

Compute VMMemory Demander

Memory Cloud Management Services (OpenStack)

App App

memoryMemoryCloud

RRAIM

VM

RAM

VM VM

RAM

Many Physical NodesHosting a variety of VMs

Combination VMMemory Sponsor & Demander

Memory VMMemory Sponsor

PoC Q1-Q2 2013

Page 24: Hecatonchire kvm forum_2012_benoit_hudzia

Lego CloudGoing beyond Memory

Page 25: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 25

Virtual Distributed Shared Memory System(Compute Cloud)

Compute aggregation Idea : Virtual Machine compute and memory span

Multiple physical Nodes

Challenges Coherency Protocol Granularity ( False sharing )

Hecatonchire Value Proposition Optimal price / performance by using commodity

hardware Operational flexibility: node downtime without downing

the cluster Seamless deployment within existing cloud

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication

Future Works

Page 26: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 26

Disaggregation of datacentre ( and cloud ) resources(Our Aim)

Breaking out the functions of Memory ,Compute, I/O, and optimizing the delivery of each.

Disaggregation, provides three primary benefits:

• Better Performance: • Each function is isolated => limiting the scope of

what each box must do• We can leverage dedicated hardware and software

=> increases performance.• Superior Scalability:

• Functions are isolated from each other => alter one function without impacting the others.

• Improved Economics: • cost-effective deployment of resource => improved

provisioning and consolidation of disparate equipment

Page 27: Hecatonchire kvm forum_2012_benoit_hudzia

Summary

Page 28: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 28

Hecatonchire Project

• Features: • Distributed Shared Memory • Memory extension via Memory Servers • HA features• Future :Distributed Workload executions

• Use standard Cloud interface • Optimise Cloud infrastructure• Support COTS HW

Page 29: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 29

Key takeaways

• Hecatonchire project aim at disaggregating datacentre resources

• Hecatonchire Project currently deliver memory cloud capabilities

• Enhancements to be released as open source under GPLv2 and LGPL licenses by the end of November 2012

• Hosted on GitHub, check: www.hecatonchire.com

• Developed by SAP Research Technology Infrastructure (TI) Programme

Page 30: Hecatonchire kvm forum_2012_benoit_hudzia

Thank you

Benoit Hudzia; Sr. Researcher;SAP Research Belfast [email protected]

Page 31: Hecatonchire kvm forum_2012_benoit_hudzia

Backup Slide

Page 32: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 32

Hypervisor

Hypervisor

Hypervisor

Infra

Instant Flash Cloning On-Demand

Business Problem Burst load / service usage that cannot be satisfied in time

Existing solutions Vendors: Amazon / VMWare/ rightscale Startup VM from disk image Requires full VM OS startup sequence

Hecatonchire Solution Go live after VM-state (MBs) and hot memory (<5%) cloning Use post-copy live-migration schema in background Complete background transfer and disconnect from source

Hecatonchire Value Proposition Just in time (sub-second) provisioning

PlannedPrototype Q4 2012

Page 33: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 33

DRAM Latency Has Remained Constant

CPU clock speed and memory bandwidth increased steadily (at least until 2000)

But memory latency remained constant – so local memory has gotten slower from the CPU perspective

Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010

Page 34: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 34

CPUs Stopped Getting Faster

Moore’s law prevailed until 2005 when core’s speed hit a practical limit of about 3.4 GHz

Since 2005 you do get more cores but the “single threaded free lunch” is over

Effectively arbitrary sequential algorithms have not gotten faster since

Source: http://www.intel.com/pressroom/kits/quickrefyr.htm

Source: “The Free Lunch Is Over..” by Herb Sutter

Page 35: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 35

While … Interconnect Link Speed has Kept Growing

Panda et al. Supercomputing 2009

Page 36: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 36

Result: Remote Nodes Have Gotten Closer

Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM

And remote DRAM has far better performance than paging in from an SSD or HDD device

Fast interconnects have become a commodity - moving out of the High Performance Computing (HPC) niche

HANA Performance Analysis, Chaim Bendelac, 2011

Page 37: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 37

Post-Copy Live Migration (pre-migration)

Host A Host B

Guest VM

Page Pushing1

Round

Stopand Copy

Commit

Total Migration Time

DowntimeLive on A Degraded on B Live on B

ReservationPre-migrate

Page 38: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 38

Post-Copy Live Migration (reservation)

Host A Host B

Guest VM Guest VM

Page Pushing1

Round

Stopand Copy

Commit

Total Migration Time

DowntimeLive on A Degraded on B Live on B

ReservationPre-migrate

Page 39: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 39

Post-Copy Live Migration (stop and copy)

Host A Host B

Guest VM Guest VM

Page Pushing1

Round

Stopand Copy

Commit

Total Migration Time

DowntimeLive on A Degraded on B Live on B

ReservationPre-migrate

Page 40: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 40

Post-Copy Live Migration (post-copy)

Host A Host B

Guest VM Guest VM

Page Pushing1

Round

Stopand Copy

Commit

Total Migration Time

DowntimeLive on A Degraded on B Live on B

ReservationPre-migrate

Page fault

Page push

Page 41: Hecatonchire kvm forum_2012_benoit_hudzia

© 2012 SAP AG. All rights reserved. 41

Post-Copy Live Migration (commit)

Host A Host B

Guest VM

Page Pushing1

Round

Stopand Copy

Commit

Total Migration Time

DowntimeLive on A Degraded on B Live on B

ReservationPre-migrate


Recommended