How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...

transcript

How CCR and SCR provide High Availability in Exchange Server 2007 SP1Scott Schnollscott.schnoll@microsoft.comPrincipal Technical WriterExchange ServerMicrosoft Corporation

How CCR and SCR provide High Availability in Exchange Server 2007 SP1Ilse Van CriekingeIlse.vancriekinge@GlobalKnowledge.BeExchange MVP, Trainer & ConsultantMicrosoft Unified CommunicationsGlobal Knowledge

Agenda

• Mailbox Server High Availability Options

• CCR and SCR: Better Together• Why CCR? Why not SCC?• Continuous Replication Demystified• Troubleshooting Exchange Clusters

and Continuous Replication• Known Issues

Mailbox Server

High Availability

Options

Mailbox Server High Availability Options

Local Continuous Replication (LCR)

Single Copy Cluster (SCC)

Cluster Continuous Replication (CCR)

Standby Continuous Replication

Standalone

Standalone MailboxServer (w/o LCR)

Standby Cluster with Passive Mailbox Role

SCR Sources SCR Targets

CCR and SCR: Better

Together

CCR and SCR: Better Together

• CCR provides high-availability for Mailbox data and services within the datacenter

• SCR replicates data remotely to provide site resilience for the Mailbox data

Datacenter A

Datacenter B

CCR across 2 SitesDatacenter A Datacenter B

CCR local / SCR to remote Site

Datacenter A Datacenter B

CCR/SCR vs SCC/Sync – 2 sites

Datacenter A Datacenter B

Exchange Disaster Recovery or 3rd Party Failover

PhysicalCorruption Physical

Corruption

Undetected Physical Corruption

1 month later, Undetected Physical Corruption

On full Storage or Site Failure in Primary Site,corruption is detected, must Recover from Backup

Log corruption detected immediately on replication at both targets

Physical Corruption

Setup /recovercms, play logs forward

On Site Failure in Primary Site,if corruption not detected and corrected from a test failover, must Recover from Backup

Why CCR?Why Not

CCR SCC

Single Point of Failure

None when stretched across sites or combined with SCR for site resiliency

Data, Storage and Site single points of failurePotential for massive data loss on single failure:• Storage device failures can lose collocated

backups• Hardware replication can propagate

physical errors• Storage failure requires activation of remote

copy if one exists• SCC requires two VSS clones plus a remote

copy of data to achieve RPO equal to CCR

Simplicity

Simple setup• No special

storage configuration required

Built-in Site Resilience

Same technology and redundancy model for intra- and inter-site protection

Shared storage Storage configuration before and after

forming cluster Complex storage stack • Driver mgmt• Cluster WCL• Switches• Multipathing• Queue depths

Complex deployment to approach RTO/RPO of 1 CCR cluster

Why CCR? Why not SCC?

CCR SCC

Backups Backups off passive copy eliminates/reduces backup window

Backups must be off active

TCO Reduced TCO• Cheaper hardware• No special storage

expertise required• In-the-box solution• Integrated

management• Single operations

team• Reduced backup cost

Higher TCO• Additional products needed to

achieve equivalent combined RTO/RPO

• Separate management tools for HA operations may be required

• Higher-end servers and storage required

• Storage expertise needed

Large Mailboxes

• Great RTO/RPO, Simplicity, No Maintenance Window, Reduced TCO → improved support for larger mailboxes

Higher TCO, long recovery times constrain mailbox size

Failure CCRStretched CCR or CCR + SCR

SCCSCC + SCR/3rd party replication + 2 VSS

clones to approach combined RTO/RPO of 1 CCR

cluster

Server ~ 2 minutes ~ 2 minutes

Data or LUN

~ 2 minutes 15 min – 1 hour

Full Storage

~ 2 minutes ~ 15 min with synchronous replication

Days with VSS clones only

Site ~ 2 minutes for Stretched CCR

30-60 minutes for CCR + SCR

~ 15 min with synchronous replication

Days with VSS clones only

Server 0 for mail*appointment, contact, task,

0 – uses same copy of data

Physical Corrupt

DB 0 Hours to days if sync repl; point in time if VSS

0 (must reseed passive)

N/A if log not needed; same as DB if needed

DB LUN dies

0 0 with synchronous replication

Point-in-time with VSS clones

LOG LUN dies

0 for mail*appointment, contact, task,

0 with synchronous replication Point-in-time with VSS clones

Full Storage

0 for mail*appointment, contact, task,

0 with synchronous replication

Hours to days with VSS clones only

Site Same as Server for Stretched CCR

1 Log**

0 with synchronous replication

Hours to days with VSS clone

* Assumes following best practice guidance for Transport Dumpster

**Assumes replication’s keeping up

SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutions (e.g., Backups)

SCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutions (e.g., log inspection)

With hardware-based replication, deeper stack can lead to corruption caused by:

HBA driver/firmwareMulti-path driver server hardware FC Switch firmwareStorage controller firmware/OStarget Storage controller firmware/OS

Corruptions caused by the application

Logical corruption replicated by all synchronous and asynchronous replication solutions

SCR with lag replay can mitigate if detected early

Logical Corruption

Physical Corruption

Continuous Replication

Demystified

Log Copier

LogReplayer

Basic Replication PipelineSource

StoreLog

Inspector

Source LogDirectory

InspectorDirectory

ReplicaLogDirectory

ReplicaDB

Continuous Replication Basics

• When current log file is closed, it is copied to the replication target by the Replication service

• Replication service• at source: creates read-only shares for log

directory• at target: reads from the shares and pulls a

copy of the log file• contains a ReplicaInstance for each storage

group• Configuration discovered from Active Directory (every

30 sec for LCR/CCR, every 3 min for SCR)

Continuous Replication Basics

• Communication is done via logs, registry, cluster database and RPC• Logs: replicate database changes and backup

status• Registry: used in LCR and SCR. Also in CCR for

checkpointing the current log generation value for loss calculation

• Cluster database: cluster res "Exchange Information Store Instance (CMSName)" /priv | findstr /i replay

• RPCs: Target Replication service RPCs into Store for log truncation coordination

Lost Log Resilience (LLR)• Designed to minimize need to reseed after

lossy failover• Database changes written to log file prior to

database, and the database can be updated as soon as change is logged

• LLR modifies this behavior by delaying updates to the database until 1 or more log generations are created

• Utilizes a new log stream marker called the waypoint• Minimum Log Required to prevent database

divergence• No modifications after the waypoint

have been written to the database

Transaction Markers

Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-

• Committed: Log generation 20

• Checkpoint: Log generation 2

• Waypoint: Log generation 10• What this means:

• We only need logs 2-10• Logs 11-20 can be

discarded

checkpoint

waypoint

Healthy CCR

NodeA fails and a failover to NodeB

occurs

Validate database can mount logs lost <

AutoDatabaseMountDial

Logs are generated on NodeB (beyond gen21)

NodeA recovers and performs a

divergence check

NodeA performs incremental reseed and

copies logs

Healthy CCR

Log Roll Activity

• In the absence of user or database activity, ESE now also forces the active log file to close

• [15 (minutes) ÷ LLR Depth value] = Frequency of log roll activity (in minutes)

Maximum number of logs generated each day due to log roll activity

Mailbox server configuration

Maximum number of logs generated per day by an idle

storage group

Stand-alone (with or without LCR)SCC

CCR 960

When Do I Need A Full Reseed?

• Rarely• Lost log past current Waypoint

• Admin accepted large amount of loss by running Restore-StorageGroupCopy

• Automatic mount while LLR was “not honored”• Automatic lossy mount with “stale” loss window

calculation

• Log corruption prior to log replay• ESE cannot skip over logs

• Database files modified outside of Store or Replication service• E.g., Offline defrag, eseutil /r

Transport Dumpster

• Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reached

• Transport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage group

• Transport Dumpster statistics:Get-StorageGroupCopyStatus -DumpsterStatistics

Output:DumpsterServersNotAvailable:{HUB1}DumpsterStatistics:

{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}

CCR CMS

HUB1SG Dumpster Contents

HUB2SG Dumpster Contents

SG1 SG2

Passive

SG Dumpster Contents

SG1 Msg1

SG2 Msg1

SG1 Msg2

SG1 Msg1

SG2 Msg1,Msg3

SG1 Msg2,Msg4

SG2 Msg4

SG Resubmit Required

HUB1,HUB2

Redeliver SG1,SG2(returns Retry)

Redeliver SG1,SG2(returns timeout)

SG Resubmit Required

Active

Redeliver SG1,SG2(returns Success)

Redeliver SG1,SG2(returns retry)Redeliver SG1,SG2(returns success)

Transport Dumpster

• How much data loss can transport dumpster mitigate?• 18 MB dumpster per storage group on 8 Hub Transport

servers = 144 MB / storage group• [20 MB / 10 hour] x [100 users / SG] = 200 MB message

traffic in one hour• Putting the above two together gives

60 min X 144 / 200 43.2 minutes worth of datain 43.2 minutes 144+ logs created per SG

• Customize transport dumpster size/time limitSet-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB –

MaxDumpsterTime 07.00:00:00

• No time window guarantees• If there are no message size limits, a single large

message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server

Transport Dumpster

• When CCR detects a lossy failover:• Expands loss window by 12 hours back and 1 hour

forward • Finds all Hub Transport servers in the local Active

Directory site• Requests transport dumpster redelivery from all detected

servers• New servers not added to redelivery list

• Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime

• If multiple lossy failovers take place, new loss is window added to previous one

• Restore-StorageGroupCopy on LCR is one time request, no retries

• Redelivery not triggered as part of Setup /recoverCMS

• No other ways to redeliver messages from transport dumpster

Redundant Networks• Use for log shipping and seeding in CCR

Enable-ContinuousReplicationHostName

SeedingUpdate-StorageGroupCopy -DataHostNames:Host1,Host2

Get-ClusteredMailboxServerStatus OperationalReplicationHostNames:FailedReplicationHostNames:InUseReplicationHostNames:

Watch out for misconfigured host file

Circular Logging• One configuration setting with two consumers

• Store service: requires database to be dismounted and re-mounted to take effect

• Replication service: picks up new setting dynamically

• In CCR, it’s no big deal to switch between on/off/on

• In some settings, logs are deleted prematurely• Example: turn off circular logging, then enable LCR

without dismount/mount of database• ESE is still doing log truncation with circular logging logic

• Logs will get truncated before making it to the LCR copy

• To be safe follow this recipe: • Suspend, dismount, change setting, mount, resume

Troubleshooting Exchange

Clusters and Continuous Replication

Troubleshooting Replication & Failover• Get-StorageGroupCopyStatus• Test-ReplicationHealth• Cluster Log• Get-ClusteredMailboxServerStatus• Getscrsources.ps1• Test-Mailflow• Application Event Log – Replication events

• Get-EventLogLevel -id:"MSExchange Repl" | Set-EventLogLevel -Level expert

• Get-EventLogLevel -id:"MSExchange Cluster" | Set-EventLogLevel -Level expert

• System Event Log – Cluster events• Active Directory management tools• Network Monitor

= LastLogCopyNotified – LastLogCopied

TroubleshootingGet-StorageGroupCopyStatus

Time stamp on source SG of most recent log

Time of sources most recent log known to copy

Time stamp on source SG of last successful log copy

Must use –DumpsterStatistics option to get these values

TroubleshootingTest-ReplicationHealth

ClusterNetwork:• Checks connectivity of

all network interfaces • Checks cluster group is

up• Warns in multi subnet

topologies since not all cluster networks can be up at the same time

SGCopyQueueLength Warns at 3 and Errors at 6

SGReplayQueueLength Warns at 30 and Errors at 60

TroubleshootingCluster Log

• Windows Server 2003• %windir%\cluster\cluster.log• Logs are always appended to this file

• Windows Server 2008• Must generate the cluster log file

cluster.exe [[/CLUSTER:]cluster-name] LOG <options><options> =

/G[EN[ERATE]] [/COPY[:"directory"]] [/NODE:"node-name"][/SPAN[MIN[UTE[S]]]:min] ]/SIZE:logsize-MB/LEVEL:logLevel

• If /COPY is not specified, %windir%\Cluster\Reports\Cluster.log• If /NODE is not specified, a log file is generated on every node • /SIZE must be between 8 and 1024 MB• /LEVEL must be between 0 and 10

TroubleshootingServer Failover but Databases Didn’t Mount• Steps to troubleshoot:

1. Run Get-StorageGroupCopyStatus2. Check the log directories on Active

and Passive3. Run Restore-StorageGroupCopy

and then Mount-Database

TroubleshootingLog File Corrupted

• Steps to troubleshoot:• Run Get-StorageGroupCopyStatus and/or

Test-ReplicationHealth• Reseed passive copy / SCR target by

running Suspend-StorageGroupCopy • Run Update-StorageGroupCopy on the

passive node or SCR target

TroubleshootingSMB File Share for Replication Missing

• Steps to troubleshoot (SCR/CCR)1. Run Test-ReplicationHealth on Passive2. Run Get-StorageGroupCopyStatus on

Passive3. Run Get-ClusteredMailboxServerStatus4. Verify share on Active Node5. Stop Sharing the File Share – Replication

Service recreates in 30 seconds6. Run Test-ReplicationHealth on Active7. Check Application Event Log8. Check Active Directory Permissions

Known Issues

• Update Rollup 5 for Exchange 2007 SP1 can cause Enable-StorageGroupCopy to fail in an SCR topology that consists of a parent and child domain structure:“Standby continuous replication is not supported between computers in different Active Directory domains. The target node is in domain <child domain> which is different from the source domain of <parent domain>”

• Workarounds• Uninstall UR5, enable SCR, re-install RU5• Use a Management Console running pre-RU5

• Exchange12 bug 152967• Expected fix in RU7 for Exchange 2007 SP1

Known Issues• Network shares get deleted and created every 5 minutes by

the replication service on a Windows 2008 SCC when SCR is enabled

• Replication service share names intermittently disappear from the cluster causing replication status to repeatedly switch back and forth between failed and healthy states• Test-ReplicationHealth on SCR target may succeed

showing all tests passed• Get-StorageGroupCopyStatus on SCR target status shows

Healthy, Initializing or Failed• No events on source, but SCR target will log ESE event

522 in the application event log• Exchange12 bug 146483• Expected fix in RU7 for Exchange 2007 SP1

Known Issues• When running VSS backup, ESE event 522 is logged

on the passive node; Event is logged on resuming a suspended storage group Event log fills

• Event message details:Microsoft.Exchange.Cluster.ReplayService (7012) Log Verifier e0a 31573001: An attempt to open the device name "\\source\share$" containing "\\source\share$\" failed with system error 5 (0x00000005): "Access is denied. ". The operation will fail with error -1032 (0xfffffbf8).

• Workaround• If Get-StorageGroupCopyStatus is healthy for storage

groups, ignore the event• If Test-ReplicationHealth passes all tests, ignore the event

Known Issues

• Reseed fails when you restore 1 full backup and then more than 2 differential backups.

• Restoring to active node can succeed, but CCR no longer works after recovery.

• Workaround• Take full backup when restore is finished

(note this may not be practical with large databases)

Key Takeaways

• Exchange 2007 includes several Mailbox Server availability configurations

• CCR+SCR provide higher availability at a lower cost than any other solution

• There are a number of cmdlets and tools that can be used for troubleshooting and managing continuous replication

• LLR minimizes need for full reseeds• Transport Dumpster redelivers all routed mail after

failover• CCR addresses all ranges of failures from disk to full site• CCR on DAS provides great RTO and RPO

Thank You!

Ilse.VanCriekinge@GlobalKnowledge.Be

• Here this week!• Visit our usergroup booth• Core members

• Ilse Van Criekinge• Johan Delimon• Tonino Bruno

• http://www.proexchange.be

Pro-Exchange

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,

it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How CCR and SCR provide High Availability in Exchange Server 2007 SP1 Scott Schnoll...

Documents