Post on 23-Dec-2015
transcript
Exchange Server 2013 High AvailabilityScott SchnollMicrosoft Corporationscott.schnoll@microsoft.comTwitter: @SchnollBlog: http://aka.ms/schnoll
Copyright© Microsoft Corporation
Agenda DAG architecture
MSExchangeRepl MSExchangeDAGMgmt Cluster Crimson Channel
Witness Server Placement Dynamic Quorum DAG Member Maintenance Managed Availability
DAG Replication Service Introduced in Exchange 2007 RTM
Microsoft Exchange Replication service | MSExchangeRepl MSExchangeRepl.exe Runs on all Mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members
Includes 16 componentsActive Directory lookup Replay RPC server wrapper TPR API manager
Copy status lookup Remote data provider wrapper Support API manager
Replay core manager VssWriter Server locator manager
Seed manager Active Manager Health state tracker
Autoreseed manager Active Manager RPC server wrapper
Disk reclaimer manager Failure item manager
Copyright© Microsoft Corporation
DAG Management Service Introduced in RTM CU2
Microsoft Exchange DAG Management service | MSExchangeDagMgmt
MSExchangeDagMgmt.exe Runs on all Mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members
Includes 4 components Active Directory lookup Copy status lookup Monitoring Tracer instance
Copyright© Microsoft Corporation
DAG Management Service Writes events to same place as Replication service Application event log (source of MSExchangeRepl) HighAvailability crimson channel
Created for two primary reasons: so the Replication service can have more focused functionality so Managed Availability actions can kill lower-priority activities
As we refactor more, other functions will move to this service AutoReseed Disk Reclaimer Dynamic replay lag playdown Future AutoDAG copy layout and mobility features
Copyright© Microsoft Corporation
Cluster Service Introduced in NT Server Enterprise Edition (1997) Cluster Service | ClusSvc Clussvc.exe
Exchange DAGs use several Cluster components Quorum Membership and Node Management Networks and Heartbeating Cluster Registry
Copyright© Microsoft Corporation
Cluster Service Quorum is required in order to mount databases
Quorum is based on votes, not membership Voting can be rigged
Votes can be taken away manually or dynamically
Exchange manages quorum model, not quorum Exchange management of quorum model based on nodes, not votes Removing votes requires manual configuration of quorum model Exchange will make incorrect quorum model management decisions
if votes are manually removed at the cluster level
Copyright© Microsoft Corporation
Cluster Registry Active Manager stores database / server information in the cluster registry for DAG members Registry changes are replicated immediately to all DAG members
Stored information is used as part of BCSS
Copyright© Microsoft Corporation
Cluster RegistryIsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-07-15T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True*
ActiveServer Name of the server where the database is currently mounted or is expected to be mounted
when mount operations complete
LastMountServer The name of the server where the database was last successfully mounted
LastMountedTime The date and time stamp of the last time the database was mounted
Copyright© Microsoft Corporation
Cluster RegistryIsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-07-15T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True*
MountStatus The current mount status for the database Possible values are mounted / dismounted
IsAdminDismounted Designates whether the current dismounted status of the database is the result of
administrator action Possible values are True / False
IsAutomaticActionsAllowed Designates whether the database can be automatically activated by AM Possible values are True / False
Copyright© Microsoft Corporation
Cluster Registry Last Log
Entry for each database copy in the DAG (named by the database GUID)
Stores the last sequence number of the last generated log (in decimal)
Copyright© Microsoft Corporation
Crimson Channel Applications and Services logs
Area of Windows Server event log used by applications for logging and internal communication
These logs store events from a single application or component rather than events that might have system-wide impact
This is referred to as an application's crimson channel
Exchange 2013 has multiple channels ActiveMonitoring HighAvailability MailboxDatabaseFailureItems ManagedAvailability PushNotifications Troubleshooters
Copyright© Microsoft Corporation
Witness Server An external voter that adds a tie-breaking vote to a DAG with an even number of members Does not contain a full copy of quorum data Represented by File Share Witness resource
File share witness cluster resource, directory, and share are automatically created by Exchange when needed and removed by Exchange when not needed
Uses IsAlive Check for availability If witness is not available, cluster core resources are failed and
moved to another node If another node does not bring witness resource online, the
resource will remain in a Failed state, with restart attempts every 60 minutes
Copyright© Microsoft Corporation
Witness Server Placement Basic guidance for placement of witness server in Exchange 2010
“We recommend that you use a Hub Transport server running on Microsoft Exchange Server 2010 in the Active Directory site containing the DAG. This allows the witness server and directory to remain under the control of an Exchange administrator.”
“If your DAG is extended to multiple datacenters, we recommend deploying the witness server in the datacenter that is considered to be the primary datacenter.”
Copyright© Microsoft Corporation
Witness Server Placement Exchange 2013 guidance more complicated due to new options introduced by architectural changes
Exchange 2013 includes support for configuration options that were not recommended or possible in previous versions of Exchange A third location, such as a third physical datacenter or branch office
Copyright© Microsoft Corporation
Witness Server Placement Ultimately, the placement of a DAG’s witness server depends on business requirements and the options available to the organizationDeployment Scenario Recommendations
Single DAG deployed in a single datacenter
Locate witness server in the same datacenter as DAG members
Single DAG deployed across two datacenters; no additional locations available
Locate witness server in primary datacenter
Multiple DAGs deployed in a single datacenter
Locate witness server in the same datacenter as DAG members. Additional options include:• Using the same witness server for multiple DAGs• Using a DAG member to act as a witness server for a different DAG
Multiple DAGs deployed across two datacenters
Locate witness server in the same datacenter as DAG members. Additional options include:• Using the same witness server for multiple DAGs• Using a DAG member to act as a witness server for a different DAG
Single or Multiple DAGs deployed across more than two datacenters
Locate the witness server in the datacenter where you want the majority of quorum votes to exist
Copyright© Microsoft Corporation
Witness Server Placement If the organization has a third location, then the DAG’s witness server can be deployed there for automatic site resilience The witness server location must have network infrastructure and
connectivity that is isolated from network failures that affect the two datacenters with Exchange
For all DAGs, the availability of the witness server should be on the Exchange administrator’s radar
Copyright© Microsoft Corporation
Witness Server Placement Azure is not supported for use as a Witness Server for Exchange DAGs
Azure does not support the required underlying network configuration to enable an Azure file server VM to act as a witness server
More info at http://aka.ms/DAGAzure
Copyright© Microsoft Corporation
Dynamic Quorum Windows Server 2012 Cluster (and later) feature
Cluster quorum majority is determined by the nodes that are active members of the cluster at a given time This is an important distinction from the cluster quorum in previous
versions of Windows Server, where the quorum majority is fixed and based on membership
Enabled for all clusters by default
Copyright© Microsoft Corporation
Dynamic Quorum Cluster dynamically manages the vote assignment to nodes, based on the state of each node When a node shuts down or crashes, the node loses its quorum vote When a node successfully rejoins the cluster, it regains its quorum
vote By dynamically adjusting the assignment of quorum votes, the
cluster can increase or decrease the number of quorum votes that are required to keep running
This enables the cluster to maintain availability during sequential node failures or shutdowns
Copyright© Microsoft Corporation
Dynamic Quorum With dynamic quorum management, it is also possible for a cluster to run on the last surviving cluster node By dynamically adjusting the quorum majority requirement, the
cluster can sustain sequential node shutdowns to a single node This is referred to as “Last Man Standing” scenario
Copyright© Microsoft Corporation
Dynamic Quorum Dynamic quorum management does not allow the cluster to sustain a simultaneous failure of a majority of voting members
To continue running, the cluster must always have a quorum majority at the time of a node shutdown or failure
If you explicitly remove the vote of a node, the cluster cannot dynamically add or remove that vote
Copyright© Microsoft Corporation
Dynamic Quorum Use Get-ClusterNode to verify DynamicWeight common property of Node 0 = does not have quorum vote 1 = has quorum vote
Get-ClusterNode <Name> | ft name, *weight, state
Vote assignment for all cluster nodes can be verified by using the Validate Cluster Quorum test
Name DynamicWeight NodeWeight State---- ------------- ---------- -----EX1 1 1 Up
Copyright© Microsoft Corporation
Dynamic Quorum and DAGs Dynamic quorum does not change quorum requirements for DAGs
Dynamic quorum does work with DAGs All internal DAG testing is performed with dynamic quorum enabled
Dynamic quorum is enabled in Office 365 for DAG members on Windows Server 2012
Exchange is not dynamic quorum-aware
Copyright© Microsoft Corporation
Dynamic Quorum and DAGs Cluster team guidance on dynamic quorum:
“Selecting this option generally increases the availability of the cluster. By default the option is enabled, and it is strongly recommended to not disable this option. This option allows the cluster to continue running in failure scenarios that are not possible when this option is disabled.”
Exchange team guidance on dynamic quorum: Leave it enabled for majority of DAG members Don’t factor it into availability plans
The advantage is that, in some cases where 2008 R2 would have lost quorum, 2012 can maintain quorum; this only applies to a few cases, and should not be relied upon when planning a DAG
Copyright© Microsoft Corporation
DAG Member Maintenance Basic guidance for DAG member maintenance in Exchange 2010 Run StartDagServerMaintenance.ps1 to put DAG member in
maintenance mode Perform the maintenance (e.g., install the update rollup) Run StopDagServerMaintenance.ps1 to take DAG member out of
maintenance mode and put it back into production Optionally rebalance the DAG by using
RedistributeActiveDatabases.ps1
Copyright© Microsoft Corporation
DAG Member Maintenance Exchange 2013 guidance more complicated
Go into Maintenance ModeSet-ServerComponentState <Server> -Component HubTransport -State Draining -Requester MaintenanceSet-ServerComponentState <Server> -Component UMCallRouter –State Draining –Requester MaintenanceRedirect-Message -Server <Server> -Target <FQDNTarget>Suspend-ClusterNode <Server>Set-MailboxServer <Server> -DatabaseCopyActivationDisabledAndMoveNow $TrueSet-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy BlockedSet-ServerComponentState <Server> -Component ServerWideOffline -State Inactive -Requester Maintenance
Verify Maintenance ModeGet-ServerComponentState <Server> | ft Component,State -AutosizeGet-MailboxServer <Server> | ft DatabaseCopy* -AutosizeGet-ClusterNode <Server> | flGet-Queue
Copyright© Microsoft Corporation
DAG Member Maintenance Exchange 2013 guidance more complicated
Go into Production ModeSet-ServerComponentState <Server> -Component ServerWideOffline -State Active -Requester MaintenanceSet-ServerComponentState <Server> -Component UMCallRouter –State Active –Requester MaintenanceResume-ClusterNode <Server>Set-MailboxServer <Server> -DatabaseCopyActivationDisabledAndMoveNow $FalseSet-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy UnrestrictedSet-ServerComponentState <Server> -Component HubTransport -State Active -Requester Maintenance
Verify Production ModeGet-ServerComponentState <Server> | ft Component,State -AutosizeGet-MailboxServer <Server> | ft DatabaseCopy* -AutosizeGet-ClusterNode <Server> | flGet-Queue
Exchange 2013 Managed Availability
Cloud Trained
Bringing the learnings from the service to the enterprise
User Focused
Monitoring based on the end user’s experience
Recovery Oriented
Protect the user’s experience through recovery oriented computing
Copyright© Microsoft Corporation
Cloud Trained 5+ Years of Directly Operating the Service
Since 2007, the Exchange Engineering Team has been operating a cloud version of Exchange
Learnings Are Put Back Into the Product Engineers are on-call for service related issues Drives the right accountability for awareness (noise/gap ratio) and
motivates the team toward auto-healing and recovery
Scale, Auto-Deployment, Optics, High Availability are key tenets Decentralized complex processing Rollouts don’t require extra configuration
User Focused
If you can’t measure it, you cannot manage it
AvailabilityCan I access the service?
LatencyHow is my experience?
ErrorsAm I able to accomplish what I want?
Availability
Errors
Latency
Customer Touch Points
Copyright© Microsoft Corporation
Recovery Oriented
—OWA send—OWA failure—OWA fast recovery—OWA verified as healthy —OWA send—OWA failure—OWA fast recovery—Failover server’s databases—OWA verified as healthy —Server becomes “good” failover target (again)
LB CAS-1
CAS-2
DAG
MBX-1
DB1 DB2
MBX-2
OWA DB1 DB2
MBX-3
OWA DB1 DB2
OWA
OWA
OWA
OWA DB1
DB1
“stuff breaks and the Experience does not”
Monitoring Layers
CAS
MBX
PROTOCOL
STORE
PROTOCOL PROXY
4
3
2
1
PROACTIVE REACTIVE
20s 5min 20min
System Level Checks1. Mailbox Self Test
• (e.g. OWA MST) [detection 5m]2. Protocol Self Test
• (e.g. OWA PST) [detection 20 secs]
3. Proxy Self Test• (e.g. OWA PrST) [detection 20
secs]
End User Experience Level Checks4. Customer Touch Point – CTP
• (e.g. OWA CTP) [detection 20m]
ProbesPROBES
The key goal is to measure the customer’s perception of the serviceThese are typically synthetic end to end customer transactions
CHECKSThe key goal is to measure actual customer traffic and become aware when they are experiencing issuesThese are typically implemented as performance counters where thresholds can be set to detect spikes in customer failures
NOTIFYThe key goal is to take action immediately based on a critical eventThese are typically exceptions or conditions that can be detected without a large sample set
PROBE
CHECK
NOTIFY
MonitorsMonitors query the data collected by the probes and determine if an action needs to occur based on a rule set
Depending on the rule, a monitor can escalate or initiate a responder
Monitors can be Healthy, Degraded, Unhealthy, Repairing, Disabled, or Unavailable
Defines the time from failure that a responder is executed
MONITOR
“state of the world”
ESCALATE
“take human driven action”
Responders
A responder is a “plug-in” that executes a response to an alert generated by a monitor
There are several types of respondersRestart Responder – Terminates and restarts serviceReset AppPool Responder – Cycles IIS application poolFailover Responder – Takes a MBX server out of serviceBugcheck Responder – Initiates a bugcheck of the serverOffline Responder- Takes a protocol on a machine out of serviceOnline Responder – Places a machine back into serviceEscalate Responder – escalates an issueSpecialized Component Responders
Built-in sequencing mechanism to control recovery actions
ESCALATE
“take human driven action”
RECOVER
“restore service or prevent
failure”
Copyright© Microsoft Corporation
MonitorStates
Managed Availability PipelineSampling Detection Recovery
Probe
Probe Definition
Monitor
Monitor Results (Alerts)
Monitor Definition
Responder
Responder Results
(Responses)
Responder Definition
Healthy
T1
T2
T3
00:00:00
00:00:10
00:00:30
Restart ResponderReset AppPool
ResponderFailover responder
Bugcheck responderOffline Responder
Escalate Responder
Sequenced HA Responder PipelineExample
Named Times
Probe Results
(Samples)
Notification Item
Copyright© Microsoft Corporation
Managed Availability Exchange Server Health Summary
Get-HealthReport -Identity <ServerName>
Get-HealthReport <ServerName> -RollupGroup
Get-HealthReport <ServerName> -RollupGroup -HealthSet <HealthSetName>
(Get-DatabaseAvailabiltyGroup dag1).Servers | Get-HealthReport –RollupGroup
Get-ServerHealth –Identity <ServerName> | ft Server,CurrentHealthSetState,Name,HealthSetName,AlertValue,HealthGroupName -auto
If a Health Set is Unhealthy, find out why Get-ServerHealth -Identity <Server Name> -HealthSet
<HealthSetName>
Copyright© Microsoft Corporation
For More Information Exchange Server Home Page - http://aka.ms/ExHome
Exchange Team Blog – http://aka.ms/EHLO Exchange 2013 Docs - http://aka.ms/E15docs Exchange 2013 RelNotes - http://aka.ms/E15RelNotes
Exchange 2013 Hybrid - http://aka.ms/E15Hybrid Exchange 2013 SDK - http://aka.ms/E15SDK
49
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Exchange Server 2013 High Availability
Scott SchnollMicrosoft Corporationscott.schnoll@microsoft.comhttp://aka.ms/schnoll Twitter: @schnoll