Date post: | 14-Apr-2017 |
Category: |
Technology |
Upload: | stephen-gordon |
View: | 845 times |
Download: | 0 times |
DUDE, THIS ISN’T WHERE I PARKED MY INSTANCE?
Moving instances around your OpenStack cloud for fun and profit.
Stephen Gordon (@xsgordon)Sr. Technical Product Manager, Red Hat
October 29th, 2015
2
● What are we moving? *● Why are we moving instances?● How are we moving instances?● What new enhancements do we get in:
○ Liberty?○ Mitaka?
* #spoileralert: instances
AGENDA
WHAT ARE WE MOVING?
4
GUEST CONFIGURATION
● Guest configuration including vCPUs, memory, devices etc.
GUESTSTORAGE
● Initial image or volume.
WHAT ARE WE MOVING?What is an instance (“server”)?
All paths for moving instances involve moving some subset of these elements.
GUESTSTATE
● In-memory state.● On-disk state.
WHY ARE WE MOVING INSTANCES?
6
WHEN PERFORMING NODE MAINTENANCE
● Adding hardware● Updating software● Response to imminent
failure
IN REACTION TO NODE FAILURE
● Host lost power● Host lost connectivity● Host otherwise went
down (e.g. DC fire)
FOR CAPACITY MANAGEMENT
● Consolidate or spread instances to save power or avoid resource contention issues respectively.
WHY ARE WE MOVING INSTANCES?Moving instances is an operational tool for use...
HOW ARE WE MOVING INSTANCES?
8
$ nova help | grep -E '(migrat|evacuat)'
evacuate Evacuate server from failed host.
live-migration Migrate running server to a new machine.
migrate Migrate a server. The new host will be..
migration-list Print a list of migrations.
host-servers-migrate Migrate all instances of the specified host to...
host-evacuate Evacuate all instances from failed host.
host-evacuate-live Live migrate all instances of the specified host to...
MECHANISMS FOR MOVING INSTANCESLet me google that for you!
9
$ nova help | grep -E '(migrat|evacuat)'
evacuate Evacuate server from failed host.
live-migration Migrate running server to a new machine.
migrate Migrate a server. The new host will be..
migration-list Print a list of migrations.
host-servers-migrate Migrate all instances of the specified host to...
host-evacuate Evacuate all instances from failed host.
host-evacuate-live Live migrate all instances of the specified host to...
MECHANISMS FOR MOVING INSTANCESLet me google that for you!
10
EVACUATE
Rebuild an instance that is currently on a compute node
that is down on a different compute node.
MIGRATE
Rebuild* an instance that is currently on a compute node
that is up on a different compute node**.
LIVE-MIGRATION
Move an instance to a different compute node
without downtime.
MECHANISMS FOR MOVING INSTANCES
* By rebuild we really mean resize.
** Where this behavior will change if you turn on resizing to the same host (off by default)
11
HOST-EVACUATE
Rebuild all instances that are currently on a compute node
that is down on another compute node.
HOST-SERVERS-MIGRATE
Rebuild* all instances that are currently on a compute node
that is up on another compute node**.
HOST-EVACUATE-LIVE
Move all instances on a compute node to another
compute node without downtime.
HELPERS FOR MOVING INSTANCES
* By rebuild we really mean resize.
** Where this behavior will change if you turn on resizing to the same host (off by default)
EVACUATION
13
● Works when compute node hosting instance fails due to a hardware failure or other issue.
● Rebuilds instance on a new compute node either selected by the scheduler or optionally the user initiating the evacuation.○ Benefit over and above starting afresh is keeping same UUID, IP etc.
● Requires that Nova recognizes the source compute node is down.● Requires shared storage to maintain user data on disk (not mandatory).● Allows injecting a new admin password (if shared storage is not being used).
EVACUATION nova evacuate [--password <password>] [--on-shared-storage] <server> [<host>]
14
$ nova evacuate instance-001
+-----------+--------------+
| Property | Value |
+-----------+--------------+
| adminPass | pjaDV46p94Nz |
+-----------+--------------+
$
EVACUATION nova evacuate [--password <password>] [--on-shared-storage] <server> [<host>]
COLD MIGRATION
16
● Works when compute node hosting instance is up (at least to start with…).● Rebuilds instance on a new host selected by the scheduler.
○ Actually uses the resize path in the code base.○ Shuts down instance.○ Copies disk to the new compute node.○ Starts the instance there and removes it from the source hypervisor.
● Instance’s current host must be operational.● Like resize requires a manual confirmation step.● Unlike evacuation and live migration doesn’t allow specification of target host to
override scheduler.
COLD MIGRATIONnova migrate [--poll] <server>
17
$ nova migrate instance-001 --poll
Server migrating... 100% complete
Finished
$ nova list
+--------------+--------------+---------------+------------+-------------+ ...
| ID | Name | Status | Task State | Power State | ...
+--------------+--------------+---------------+------------+-------------+ ...
| 5819a2e0-... | instance-001 | VERIFY_RESIZE | - | Running | ...
+--------------+--------------+---------------+------------+-------------+ ...
$ nova resize-confirm instance-001
COLD MIGRATIONnova migrate [--poll] <server>
LIVE MIGRATION
19
● Moves powered on virtual machine to a new compute node without any (noticeable) downtime.
● Two approaches to live migration:○ Using shared storage (including volume-based).
■ Requires either /var/lib/nova/instances/ to be on shared storage (e.g. NFS, GlusterFS, Ceph, etc.)across all compute nodes in the migration domain; or
■ Volume-backed instances■ Still requires memory state transfer/sync
○ Using block migration.■ Direct transfer/sync of not just memory state but also disks from source
compute node to destination
LIVE MIGRATION$ nova live-migration [--block-migrate] [--disk-over-commit] <server> [<host>]
20
1. Scheduler selects destination host, unless user specified2. Check migration source and destination (disk, ram, cpu model, mapped volumes)3. Iterative pre-copy, copying memory pages from the active virtual machine on the source
to a new paused instance on the destination4. Source instance is paused while remaining memory pages and CPU state is copied.5. Destination instance is started, source is cleaned up
LIVE MIGRATION - HOW IT WORKS
21
● Maximum performance is obtained by exposing as many host CPU features to the guest as possible
● Live migration will fail if destination host is not able to expose the same CPU features to guests as the source host
● Performance versus Flexibility trade-off● Nova provides configuration keys, including libvirt_cpu_mode, for deployers to make
the performance versus flexibility trade-off for their environment○ host-passthrough○ host-model○ custom
LIVE MIGRATION - HOW IT DOESN’T WORKCPU mode/model compatibility
22
$ virsh cpu-models x86_64
...
SandyBridge
Westmere
Nehalem
...
$ grep ‘libvirt_cpu_mode’ /etc/nova/nova.conf
libvirt_cpu_mode = custom
libvirt_cpu_model = Sandybridge
LIVE MIGRATION - HOW IT DOESN’T WORKCPU mode/model compatibility
Can also use qemu-kvm -cpu help
23
● Incompatible QEMU machine types● Inconsistent networking configuration
○ Source hypervisor must be able to hit destination’s live_migration_uri and vice versa (live_migration_uri = qemu+tcp://%s/system)
● Inconsistent clocks○ Synchronize clocks using ntp or chronyd
● Incompatible VNC listening addresses● Incompatible or no SSH tunnelling configuration
LIVE MIGRATION OTHER WAYS TO FAIL
24
● Migrations take too long or fail to complete.● Many common user operations are not supported during migration (e.g. pause).● Need to use virsh, bypassing Nova, to:
○ Control a running migration (e.g. throttle or cancel)○ Monitor a running migration○ Tune migration max downtime
● Certain instance configurations can not be migrated.○ Use a config drive (e.g. config_drive_format=iso9960) or mix local/remote
storage○ Use passed through devices associated with them (SR-IOV, GPU, etc.)
● Live migration doesn’t correctly account for overcommit when checking destination host validity.
● Tenant admin initiating needs to know if shared or block storage available.
LIVE MIGRATION - OTHER OPERATOR ISSUES
LIBERTY
26
● Primary factors in determining how long it will take to migrate a guest:○ Amount of guest RAM○ Speed with which guest RAM is being dirtied○ Speed of the migration network
● Previously live migrations in OpenStack ran with fixed maximum downtime as determined by QEMU.
● As of Liberty:○ The downtime allowable is scaled up exponentially (to a limit) to allow a better
chance for completion.○ The number of concurrent outbound live migrations is limited○ The number of concurrent inbound build requests is limited
● QEMU endeavors to estimate when the number of dirty pages is low enough to finalize
LONG RUNNING LIVE MIGRATIONSI’m gonna let you finish...but...
27
● Scaling downtime to finalize migration:○ live_migration_downtime - Maximum permitted guest downtime for switchover (minimum
100ms)○ live_migration_downtime_steps - Number of incremental steps to reach max downtime
value (minimum 3)○ live_migration_downtime_delay - Time to wait, in seconds, between each step in increase
of max downtime (minimum 10s)● Timeouts:
○ live_migration_completion_timeout - Time to wait (in seconds) for migration to complete (default 800 seconds, 0 means no timeout) - is scaled by GB of guest RAM
○ live_migration_progress_timeout - Time to wait (in seconds) for migration to make forward progress (default 150 seconds).
LONG RUNNING LIVE MIGRATIONSNew configuration keys to control this behavior...
28
● Concurrent operations:○ max_concurrent_live_migrations - Maximum outbound live migrations to run concurrently,
defaults to 1. Do not change unless absolutely sure.○ max_concurrent_builds - Maximum inbound instance builds to run concurrently, defaults to
10.
LONG RUNNING LIVE MIGRATIONSNew configuration keys to control this behavior...
29
● Delay between steps is set to 30 * 3 (seconds of delay * GB of RAM).○ 0 seconds -> set downtime to 37ms○ 90 seconds -> set downtime to 38ms○ 180 seconds -> set downtime to 39ms○ 270 seconds -> set downtime to 42ms○ 360 seconds -> set downtime to 46ms○ 450 seconds -> set downtime to 55ms○ 540 seconds -> set downtime to 70ms○ 630 seconds -> set downtime to 98ms○ 720 seconds -> set downtime to 148ms○ 810 seconds -> set downtime to 238ms○ 900 seconds -> set downtime to 400ms
LONG RUNNING LIVE MIGRATIONS EXAMPLE400 millisecond max, 10 steps, 30 second delay, 3 GB guest
30
● Liberty provides a mechanism for external tools to report into Nova when a node has failed (“mark host down”/”force down” API call)
● As soon as host has been explicitly marked down evacuation can commence, triggered by the external tool.
● Used to provide “instance high availability” using e.g. Pacemaker.○ http://redhatstackblog.redhat.com/2015/09/24/highly-available-virtual-
machines-in-rhel-openstack-platform-7/
MARK HOST DOWN API CALL
MITAKA AND BEYOND
32
Short Term
● CI coverage● Improve API documentation● Support for migrating instances with mixed storage● Support for pausing (and perhaps cancelling) migrations● Better resource tracking● Use Libvirt storage pools instead SSH for migrate/resize.
○ Enabler for other work including migrating suspended instances.● Correct memory overcommit handling for live migration.
Mid to Long Term
● TLS encryption (work underway in QEMU)● Auto-convergence - adjusting instance activity to help complete migration● Post copy migration - start instance at destination and then copy memory over on demand
CURRENTLY UNDER DISCUSSION
Q & A
34
● Where can I find the slides?○ http://www.slideshare.net/sgordon2
● Where can I submit anonymised feedback?○ Session Feedback Survey in the official OpenStack Summit App
● Where can I contact you?○ Twitter: @xsgordon○ Email: [email protected]○ IRC: sgordon on irc.freenode.net
● How can I get involved?○ https://etherpad.openstack.org/p/mitaka-live-migration
FAQ
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews
@xsgordon - Stephen Gordon
36
● Outstanding work items:○ Etherpad: https://etherpad.openstack.org/p/mitaka-live-migration○ Bug list: https://docs.google.
com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307
● Past presentations:○ Live Migration at HP Public Cloud:
■ https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/live-migration-at-hp-public-cloud
○ Intel Dive into VM Live Migration:■ https://www.openstack.org/summit/vancouver-2015/summit-
videos/presentation/dive-into-vm-live-migration
RECOMMENDED READING, VIEWING, AND REFERENCES