CAS Array DAG MBX-A MBX-B DB1 Load Balancer.

transcript

Exchange Server 2013 Managed AvailabilityScott Schnoll

OFC-B315

Agenda

Understand our approach to monitoring

Understand Managed Availability

Service Management

Service Health LandscapeExchange Online drove changes in the on-premises product, and changed our approach to monitoringScale drives automationComponent-based monitoring does not measure client experienceClient access is proxied by CAS to the Mailbox server that hosts the active database copy

CAS Array

MBX-A MBX-BDB1DB1

Load Balancer

Exchange 2013 Managed Availability

Cloud Trained

Bring experience from the service to the enterprise

User Focused

Monitor end user experience

Recovery Oriented

Restore end user experience with recovery actions

Cloud TrainedExchange Engineering team has been operating Exchange Online service for 7 yearsRelevant features, experience, and knowledge from service operation is put into on-premises product

Engineers are on-call for service related issuesDrives accountability for awareness and motivates the team toward auto-healing and recoveryScale, Auto-Deployment, Optics, High Availability are key tenets

User FocusedIf you can’t measure it, you can’t manage it

AvailabilityCan I access the service?

LatencyHow is my experience?

ErrorsAm I able to accomplish what I want?

Availability

ErrorsLatency

Customer Touch Points

Recovery Oriented

—OWA send probe—OWA failure monitor—OWA fast recovery responder—OWA verified as healthy —OWA send probe—OWA failure monitor—OWA fast recovery responder—Failover database responder—OWA verified as healthy—MBX1 is failover target

LB CAS1

DB1 DB2

OWA DB1 DB2

OWA DB1

“Stuff breaks, but the experience does not”

Overview

Probe EngineCheck user experience

Managed Availability Components

Notify

MonitorEvaluate probe data

EscalateGenerate event log

RespondRestore service or prevent failure System Center

Operations Manager

Exchange Server 2013

Managed Availabilit

Probe EngineProbesKey goal: measure the user’s perception of the serviceTypically synthetic user transactions (e.g., send a message via OWA)

ChecksKey goal: measure actual user traffic and become aware when users are experiencing issuesTypically implemented as performance counters where thresholds can be set to detect spikes

NotifyKey goal: take action immediately based on a critical eventTypically conditions that can be detected without large sample set

Notify

MonitorsEvaluates data collected by probes to determine if action needs to be takenDepending on the rule, a monitor can initiate a responder or escalate

Defines the time from failure that a responder is executed

MonitorEvaluate probe data

RespondRestore service or prevent failure

RespondersExecutes a response to alert generated by a monitor

RespondersRestart – Terminates and restarts serviceReset AppPool – Cycles IIS application poolFailover – Initiates a database or server failoverBugcheck – Initiates a bugcheck of the serverOffline – Takes a protocol on a machine out of serviceOnline – Places a machine back into serviceEscalate – Escalates an issue to an admin

RespondRestore service or prevent failure

Responder ThrottlingBuilt-in sequencing mechanism controls actionsThrottling ensures the entire service isn’t compromised

All responders can throttle in some fashionSome take into account minimum number of servers within a groupSome take into account timeSome take into account number of occurrencesSome may use a combination of the above

When throttling occurs, responder action may be delayed or skippedFor example, when the Bugcheck Responder is throttled, the action is skipped, not delayed

Monitoring Layers

PROTOCOL

PROTOCOL PROXY

PROACTIVE REACTIVE

20s 5min 20min

System Level ChecksMailbox Self Test(e.g. OWA MST) [detection 5m]

Protocol Self Test(e.g. OWA PST) [detection 20 secs]

Proxy Self Test(e.g. OWA PrST) [detection 20 secs]

End User Experience Level ChecksCustomer Touch Point – CTP(e.g. OWA CTP) [detection 20m]

Monitor States

Managed Availability PipelineSampling Detection Recovery

Probe Definition

Monitor Definition

Responder Results

(Responses)

Responder Definition

00:00:00

00:00:10

00:00:30

RestartReset AppPool

FailoverBugcheck

Offline

Escalate

Sequenced Responder Pipeline

Named Times

Probe Results (Samples) ResponderProbe

Notification Item

Monitor Results (Alerts)

Healthy

Monitor

Architecture

Managed Availability ArchitectureUses worker process modelExchange Health Manager Service (MSExchangeHMHost.exe)Exchange Health Manager Worker process (MSExchangeHMWorker.exe)

Uses persistent storageRegistry used to store runtime data, like bookmarks and local overridesActive Directory used to store global overridesCrimson Event Channel used to store work item resultsConsumes data collected by Exchange Diagnostic Service

Managed Availability ArchitectureLeverages multiple HealthMailboxes per databaseThese are user accounts, visible in the MESO/Monitoring Mailboxes container

Can also be viewed usingGet-Mailbox -Monitoring

Server and Service Health

Server and Service HealthHealth SetsHealth GroupsManagement Tasks and CmdletsEvent Logging

Health Sets

Health SetsA health set is a group of monitors, probes and responders for a component that determine whether the component is healthy or unhealthy

View list of health sets:Get-HealthReport –Identity <ServerName>

Health SetsA health set is a group of monitors, probes and responders for a component that determine whether the component is healthy or unhealthy

View list of probes, monitors and responders associated with a health set:Get-MonitoringItemIdentity -Server <ServerName> -Identity <HealthSetName> | ft Identity,ItemType,Name -auto

Health SetsHealth reported using “worst of” monitors in the health set

View details of Health Set to see what monitors are healthy/unhealthy:$Health = Get-HealthReport –Server EX1 | ? {$_.HealthSet –ilike "<Name>"}$Health.Entries | ft Name, AlertValue -auto

OWA Health Sets | Monitoring Layers

ProtocolHealth Set

ProxyHealth Set

CTPHealth Set

PROTOCOL

PROTOCOL PROXY

OWA.Proxy

OWA.Protocol

System Level ChecksMailbox Self Test(e.g. OWA MST) [detection 5m]

Protocol Self Test(e.g. OWA PST) [detection 20 secs]

Proxy Self Test(e.g. OWA PrST) [detection 20 secs]

End User Experience Level ChecksCustomer Touch Point – CTP(e.g. OWA CTP) [detection 20m]

PROACTIVE REACTIVE

20s 5min 20min

Health Groups

Health GroupsPortals in System Center Operations ManagerCustomer Touch Points – components that affect real-time user interactions (protocols)Service Components – components without direct real-time, user interactions (MRS, OABGen)Server Components – physical resources of the server (disk space, memory, network)Dependency Availability – server’s ability to use dependencies (AD, DNS, etc.)

System Center Operations ManagerDisplays health information related to the Exchange environmentManagement Pack Support: SCOM 2007 R2, SCOM 2012

Escalate responder writes event to event log which is consumed by monitor within SCOMWhen alert is received by SCOM, it may not be the sum total of problems at a given point in time

Dashboard is broken down into three areasActive AlertsOrganization HealthServer Health

Viewing Health in System CenterThe state of a health group is based on the health of the monitors within the groupHealth evaluated by a "worst of" evaluation of the monitors in the group

A health group can have one of six states: Healthy, Degraded, Unhealthy, Repairing, Disabled or Unavailable

Viewing Health in SCOM

Management Tasks and Cmdlets

Management Tasks and CmdletsExtract or view system healthGet-ServerHealthGet-HealthReport

View probes, monitors and responders for a health setGet-MonitoringItemIdentity

Details about probes, monitors, and respondersGet-MonitoringItemHelp

OverridesAdmins can alter the thresholds and parameters used by the probes, monitors and respondersEnables emergency actionsEnables fine tuning of thresholds specific to the environment

Can be deployed for specific servers or for the entire environmentServer related overrides are stored in the registryGlobal overrides are stored in Active Directory

OverridesCan be set for a specified duration or to apply to a specific version of the server

Are not immediately implementedExchange Health Service reads configuration every 10 minutesGlobal changes depend on Active Directory replication

Wildcards are not supportedCannot override entire health set in one task

Management Tasks and CmdletsCreate an overrideAdd-ServerMonitoringOverrideAdd-GlobalMonitoringOverride

View overridesGet-ServerMonitoringOverrideGet-GlobalMonitoringOverride

Remove an overrideRemove-ServerMonitoringOverrideRemove-GlobalMonitoringOverride

Event Logging

Event LoggingManaged Availability makes extensive use of crimson channel event logMicrosoft-Exchange-ActiveMonitoring

ProbeDefinitionProbeResultMonitorDefinitionMonitorResultResponderDefinitionResponderResult

Microsoft-Exchange-ManagedAvailabilityMonitoringRecoveryActionResults

DefinitionsProbe, monitor and responder definitions initialized and logged when Health Manager worker process starts

Recovery ActionsManaged availability logs all recovery actions to the crimson channelMicrosoft.Exchange.ManagedAvailability/RecoveryActionsEvent 500 indicates that a recovery action was startedEvent 501 indicates that a recovery action was successfulEvent 502 indicates that a recovery action was unsuccessful

Managed Availability – Recovery ActionsUseful properties for Recovery Action eventId - Action that was taken. Common values are RestartService, RecycleApplicationPool, ComponentOffline, or ServerFailoverState - Whether the action has started (event 500) or finished (event 501/502)ResourceName - The object that was affected by the action. This will be the name of a service for RestartService actions, or the name of a server for server-level actionsEndTime - The time the action completedResult - Whether the action succeeded or notRequestorName - The name of the Responder that took the action

Summary

Exchange 2013 Managed Availability is…

Cloud Trained

Bring experience from the service to the enterprise

User Focused

Monitor end user experience

Recovery Oriented

Restore end user experience with recovery actions

OFC-B318 Microsoft Exchange Server 2013 SP1 High Availability and Site Resilience

OFC-B244 Microsoft Exchange Server 2013 SP1 Tips and Tricks

OFC-B248 Publishing Microsoft Exchange Server: Which TLA Should You Choose?

OFC-B321 Monitoring and Tuning Microsoft Exchange Server 2013 Performance

CAS Array DAG MBX-A MBX-B DB1 Load Balancer.

Documents