Date post: | 15-Jan-2017 |
Category: |
Technology |
Upload: | shapeblue |
View: | 262 times |
Download: | 0 times |
The Cloud Specialists
When the Cloud is a Rockin': High Availability
in Apache CloudStackshapeblue.com • @ShapeBlue
John Burwell • @john_burwellVP of Software Engineering
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
A b o u t M e
• VP of Software Engineering @ ShapeBlue• Member, Apache CloudStack PMC (June
2013)• Ran operations and designed automated
provisioning for analytic/virtualization clouds• Led architectural design and server-side
development of a SaaS physical security platform
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• Rohit Yadav• Abhi Prateek• Murali Reddy• Boris Stoyanov
T h e re ’ s N o “ I ” i n Te a m
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
M o t i v a t i o n
Currently [sic] KVM HA works by monitoring an NFS based heartbeat file and it can often fail whenever this network share becomes slower, causing the hypervisors to reboot. … This is embarrassing. How can we fix it? Ideas, suggestions? How are other hypervisors doing it?
- Nux15 October 2015
CLOUDSTACK-8943
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• Limited to hosts and VMs using NFS storage
• Tight coupling between the Agent and HighAvailabilityManager
• False positives which interrupt the operation healthy resources
L i m i t a t i o n s / I s s u e s
Inconsistent behavior prevents operators from trusting KVM HA
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
B u i l d v s . B u y
Pros• Integration with the
CloudStack control plane and abstractions
• Simpler configuration• Integrated
instrumentation and logging
Cons• Complex mechanism to
implement, test, and maintain
• Foregoing a proven, battle tested implementation
• Less functionality initially
A robust infrastructure control plane must include the ability to recover and fence resources
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H A Re s o u rc e M a n a g e m e n t S e r v i c e
HA Resource Management
Service
Plugin
• Manages per resource FSM• Persistence• Concurrency/Back Pressure• Common Business Logic
• Resource-specific Business LogicHA Provider
Resource
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• Loose coupling between resources and HA
• Consolidate orthogonal HA concerns• Prove the correct operation of the HA
Resource Management Service and HA Providers independently
• Leverage CloudStack abstractions• Develop a model for architectural
evolution
G o a l s
To create a trustworthy system, operational correctness must be the prevailing priority
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• Health Check: An idempotent check of a resource to directly verify its proper operation
• Activity Check: An idempotent check to observe the side-effects of a resource’s proper operation
• Eligibility: An idempotent determination of a resource’s eligibility for HA management
• Recovery: Take potentially destructive actions to bring a resource back to a healthy state
• Fence: Take potentially destructive actions to prevent an unrecoverable resource from impacting the health of its peers
Te rm s a n d C o n c e p t s
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• DISABLED: The resource is part of a partition where HA operations have been disabled or have been disabled for the resource.
• INITIALIZING: The initial health and eligibility of the resource for HA management is currently being determined.
• AVAILABLE: The resource is available based on the passage of the most recent health check and it containing partition has an HA state of ACTIVE.
• INELIGIBLE: The resource's enclosing partition has an HA state of ACTIVE but its current state does not support HA check and/or recovery operations.
• SUSPECT: The resource pending an activity check due to failing its most recent health check.
• CHECKING: An activity check is currently being performed on the resource.
• RECOVERING: Recovery operations are in-progress to bring the resource back to a healthy state.
• DEGRADED: The resource cannot be managed by the control plane but passed its most recent activity check indicating that the resource is still servicing end-user requests
• FENCED: The resource is not operating normally and automated attempts to recover it failed. Manual operator intervention is required to recover the resource.
S t a t e s
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
S t a t e M o d e l
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H A P r o v i d e r I n t e r f a c e
public interface HAProvider<R> extends Adapter {
ResourceType resourceType();
ResourceSubType resourceSubType();
boolean isEligible(R r);
boolean isHealthy(R r) throws HACheckerException;
boolean hasActivity(R r) throws HACheckerException;
boolean recover(R r) throws HARecoveryException;
boolean fence(R r) throws HAFenceException;
}
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
K V M H o s t H A
KVM Host HA Provider
Storage Processor
ActivityCheck
Host
Recover /Fence using
OOBM
KVM Agent
HealthCheck
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
C o n c u r re n c y M o d e l• Producer/consumer model• Size bounded work queues• Time bounded operations• Fixed sized thread pools
• Idempotent operations are ephemeral
• Non-Idempotent operations are managed through AsyncJobManager using a new time-delayed dispatcherHA operations cannot overwhelm the control plane
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• Focused on KVM host HA• Initial implementation started —
validating the design• Draft specification — functional spec
will be published in the next 1-2 weeks
• Robust unit and integration test model to verify both the service and KVM host HA provider
• Delivery of the first version in July 2016 for inclusion in 4.10 (August 2016)
S t a t u s
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
• Support Nested HA Resources• Instrumentation• Migrate VM HA to the HA Resource
Management Service
W h a t ’ s N ex t
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
Questions? Comments?
#cloudstackworks