transcript
- Slide 1
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- \Web Service(Default Web Site)\Current Connections
- Slide 8
- \MSExchange Active Manager(_total)\Database Mounted
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Large Organization Configuration 36 Cores / 450 GB RAM per
server Higher Mailbox Density Deployed Exchange 2013 in All-In-One
configuration Hardware NLB configured for Least Connections What
Happened? Policy change required removal of local storage of email
Outlook now required to run in Online Mode Impact Increased in
network traffic Users frequently disconnected during peak periods
~2 weeks to isolate problem ~2 weeks to get remediation changes in
place
- Slide 14
- Exchange 2013 All-in-One Virtual IP Network Load Balancer 40k
users Exchange.cohovineyard.com 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29 30 31 32 40 41 42 43 44 45 1 2 3 4 5
6
- Slide 15
- Exchange 2013 All-in-One Virtual IP Network Load Balancer 40k
users Exchange.cohovineyard.com 3 4 5 7 8 9 10 13 14 15 19 20 21 25
26 27 31 32 40 41 42 44 2 1 50 51 52 53 54 55 56 57585960616263 !
46 47 48 49
- Slide 16
- Exchange 2013 All-in-One Virtual IP Hardware NLB 40k users
Exchange.cohovineyard.com 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 24 25 26 27 33 29 30 31 32 283435 36 1 2 3 4 5 23
- Slide 17
- /RPC https.sys MSExchangeRpcProxyFrontEndAppPool (W3WP)
M.E.RpcClientAccess MSExchangeRpcProxyAppPool (W3WP) Port 6001 Port
443 MBxDB M.E.Store.Worker Lookup Active Mailbox Location RpcHttp
HttpProxy RpcHttp IIS RPC Client Access Store Worker https.sys IIS
Port 444 57
- Slide 18
- IIS Connection Manager Request Router /RPC:443 W3WP Queue
/RPC:444 W3WP Queue MSExchangeRpcProxyFrontEndAppPool (W3WP)
System.Web Buffer /RPC:443 /RPC:444 Max 65535 Requests Buffer
Thread 57 58 59 60 61 62 63 Managed Availability 64 65 66 67 68
69
- Slide 19
-
datetimes-ipcs-methodcs-uri-stemcs-uri-querys-portcs-usernamec-ipcs(User-Agent)cs(Referer)sc-statussc-
substatus sc-win32-statustime-taken 201 4- 07- 21 07: 59: 44 192.16
8.1.1 RPC_IN_DATA/rpc/ rpcpr oxy.d ll 8416409b-081e- 4fe8-9200-
7e54d8874d7c@cohov ineyard.com:6001&R equestId=fc60c175-
9c77-47d0-b435- ae3d04acea1b 443COHOVI NEYARD \SM_4f 3083c2 bd6a40
d8b 192.168.1.5 MSRPC-20006429513 datetimec-ipc-ports-ips- port Cs-
version Cs-methodCs-uriSc- status S- siteid S-reasonS-queuename
2014- 07-21 07:5 9:44 192.16 8.1.5 160 45 192.16 8.1.1 44 4 HTTP
/1.1 RPC_IN _DATA /rpc/rpcproxy.dll?COHO- EXCH.cohovine
yard.com:6001 4002Connection_DroppedMSExchangeRpcPro xyAppPool
2014- 07-21 07:5 9:44 192.16 8.1.5 160 45 192.16 8.1.1 44 3 HTTP
/1.1 RPC_IN _DATA /rpc/rpcproxy.dll? 8416409b- 081e-4fe8- 9200-
7e54d8874d7c@ COHO- EXCH.cohovine yard.com:6001
-1Connection_Dropped_List_FullMSExchangeRpcPro xyAppPool IIS
indicating it cannot hand off connection because queue is full
inetpub\logs\LogFiles\W3SVC1\httperrXXXXX.log
inetpub\logs\LogFiles\W3SVC1\u_exXXXXXX.log
- Slide 20
- IISRpcHttpHttpProxyIISRpcHttp RPC Client Access Location
inetpub \logs \LogFiles \W3SVC1 Logging \RpcHttp \W3SVC1 Logging
\HttpProxy \RpcHttp Inetpub \logs \LogFiles \W3SVC2 Logging
\RpcHttp \W3SVC2 Logging \RPC Client Access File Names
u_exXXXXXX.log httperrXXXXX.log RpcHttpXXXXXXXX- X.log
HttpProxyXXXXXX XXXX-X.log u_exXXXXXX.log httperrXXXXX.log
RpcHttpXXXXXXXX- X.log RCA_XXXXXXXXXX- X.log Perfmon Counter \Web
Service(Default Web Site) \Current Connections \RPC/HTTP Proxy
\Current Number of Incoming RPC over HTTP Connections \MSExchange
HttpProxy \Accepted Connection Count \Web Service(Exchange Back
End) \Current Connections \RPC/HTTP Proxy\ Current Number of
Incoming RPC over HTTP Connections \MSExchange RPC ClientAccess
\Current Connections
- Slide 21
- NetworkCPUMemoryStorage
- Slide 22
- Network (Requests) \Web Service(Default Web Site)\Current
Connections \MSExchangeIS Store(*)\RPC Average Latency< 100 ms
\MSExchangeIS Client Type(*)\RPC Average Latency < 100 ms
\MSExchangeIS Store(*)\RPC Operation/Sec \MSExchangeIS Client
Type(*)\RPC Operation/Sec Overall RPC Average Latency is not
impacted CAS Experience MoMT \MSExchange RpcClientAccess\RPC
Averaged Latency \MSExchange RpcClientAccess\RPC Operations/sec EAS
\MSExchange ActiveSync\Requests/sec \MSExchange ActiveSync\Current
Requests EWS \MSExchangeWS\Average Response Time
\MSExchangeWS\Requests/sec OWA \MSExchange OWA\Average Response
Time \MSExchange OWA\Average Search Time \MSExchange
OWA\Requests/sec POP \MSExchangePop3(*)\Average LDAP Latency
\MSExchangePop3(*)\Average RPC Latency \MSExchangePop3(*)\Request
Rate IMAP \MSExchangeImap4(*)\Average LDAP Latency
\MSExchangeImap4(*)\Average RPC Latency \MSExchangeImap4(*)\Request
Rate Management / Background Ops PS
\MSExchangeRemotePowershell\Current Connection Sessions
\MSExchangeRemotePowershell\Current Connected Unique Users
- Slide 23
- Memory (Exchange Process Usage) \Memory\% Committed Bytes in
Use < 80% \Memory\Available MBytes> 5% or RAM.NET CLR
Memory(*)\% Time in GC Should be below 10% on average.NET CLR
Exceptions(*)\# of Excepts Thrown / sec Should be less than 5% of
total requests per second (RPS) (Web Server(_Total)\ Connection
Attempts/sec *.05)..NET CLR Memory(*)\# Bytes in all Heaps Only 30%
bytes committed Memory (WorkstationGC to ServerGC).NET CLR
Memory\Allocated Bytes/sec Sustained >50mb
- Slide 24
- Storage (Exchange I/O) \MSExchange Active
Manager(_total)\Database Mounted Balanced across all MBX servers
\MSExchange Database ++> Instances(*)\I/O Database Reads
(Attached) Average Latency < 20ms \MSExchange Database ++>
Instances(*)\I/O Database Writes(Attached) Average Latency <
50ms \MSExchange Database ++> Instances(*)\I/O Log Writes
Average Latency < 10ms \MSExchange Database ++>
Instances(*)\I/O Database Reads (Recovery) Average Latency <
200ms \MSExchange Database ++> Instances(*)\I/O Database
Writes(Recovery) Average Latency < read latency for same
instance as above I/O is acceptable
- Slide 25
- CPU (Exchange Processes) Processor(_Total)\% Processor Time
Should be less than 75% on average. \Processor(_Total)\% Privileged
Time (kernel) Should be less than 75% on average.
\Processor(_Total)\%User TimeShould be less than 75% on average.
\Process (*)\% Processor Time System\Processor Queue Length (all
instances) Shouldn't be greater than 5 per processor. W3wp#3 high
CPU W3WP#3 is the MSExchangeRpcProxyFrontEndApp Pool
- Slide 26
- ntdll!ZwWaitForMultipleObjects
KERNELBASE!WaitForMultipleObjectsEx
clr!WaitForMultipleObjectsEx_SO_TOLERANT
clr!Thread::DoAppropriateAptStateWait
clr!Thread::DoAppropriateWaitWorker clr!Thread::DoAppropriateWait
clr!CLREventBase::WaitEx clr!AwareLock::EnterEpilogHelper
clr!AwareLock::EnterEpilog clr!AwareLock::Contention
clr!JITutil_MonContention
System_Web_ni!System.Web.BufferAllocator.GetBuffer()
System_Web_ni!System.Web.Hosting.RecyclableArrayHelper.GetIntPtrArray(Int32)
System_Web_ni!System.Web.Hosting.IIS7WorkerRequest.FlushCachedResponse(Boolean)
System_Web_ni!System.Web.HttpResponse.UpdateNativeResponse(Boolean)
System_Web_ni!System.Web.HttpResponse.Flush(Boolean, Boolean)
System_Web_ni!System.Web.HttpWriter.WriteFromStream(Byte[], Int32,
Int32) mscorlib_ni!System.IO.Stream. b__11(System.Object)
mscorlib_ni!System.Threading.Tasks.Task`1[[System.Boolean,
mscorlib]].InnerInvoke()
mscorlib_ni!System.Threading.Tasks.Task.Execute()
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,
System.Threading.ContextCallback, System.Object, Boolean)
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,
System.Threading.ContextCallback, System.Object, Boolean)
mscorlib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task
ByRef)
mscorlib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean)
mscorlib_ni!System.Threading.ThreadPoolWorkQueue.Dispatch()
clr!CallDescrWorkerInternal clr!CallDescrWorkerWithHandler
clr!MethodDescCallSite::CallTargetWorker
clr!MethodDescCallSite::Call_RetBool
clr!QueueUserWorkItemManagedCallback
clr!ManagedThreadBase_DispatchInner
clr!ManagedThreadBase_DispatchMiddle
clr!ManagedThreadBase_DispatchOuter
clr!ManagedThreadBase_DispatchInCorrectAD clr!Thread::DoADCallBack
clr!ManagedThreadBase_DispatchInner
clr!ManagedThreadBase_DispatchMiddle
clr!ManagedThreadBase_DispatchOuter
clr!ManagedThreadBase_FullTransitionWithAD
clr!ManagedThreadBase::ThreadPool
clr!ManagedPerAppDomainTPCount::DispatchWorkItem
clr!ThreadpoolMgr::ExecuteWorkRequest
clr!ThreadpoolMgr::WorkerThreadStart
clr!Thread::intermediateThreadProc kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart Start Most Recent
- Slide 27
- Source From:
http://referencesource.microsoft.com/#System.Web/BufferAllocator.cs
- Slide 28
- Investigation ~4 weeks Preferred architecture not followed
Customer scaled beyond tested configuration NLB algorithm not
optimized for Exchange load profile Resolution Least Connection /
Slow Start on hardware LB Reduced Cores < 20 Scalability
Improvements coming.NET 4.6 (In Preview) Large number of
connections to server in short timeframe RpcProxy FrontEnd AppPool
requests backlogged Managed Availability Probe Fails Managed
Availability restarts service Network load balancer takes server
out of rotation Network load balancer adds server to rotation
- Slide 29
- Large Organization Configuration 16 Cores / 92 GB RAM per
server Deployed Exchange 2013 in All-In-One configuration NLB
configured for Round Robin What Happened? File writes failing, MA
Probe failures, MDB Failovers Encountered bug with Anti-Virus
Failed to deploy recommended fixes prior to migration Exposed new
bug Impact Users frequently disconnected during peak periods ~8
weeks to isolate problem ~3 weeks to get fix and configuration
changes in place
- Slide 30
- RpcHttp HttpProxy RpcHttp IIS RPC Client Access Store Worker
IIS I/O Manager File System Driver Anti-Virus Filter Driver Device
Driver Mini-Port Driver Continued I/O delayed stalled forces MA to
move Databases. MBxDB Stalled I/O delaying clients response (dump
showed 6min lock) Is Valid File to Scan?
- Slide 31
- Monitors Services Performance Counters Event Logs
OutlookProxyTestProbe OutlookRpcSelfTestProbe OutlookRpcCtpProbe
Goals Bring Office365 Capabilities On-Premises Monitor based upon
end user experience Focus on recovery oriented computing Components
Probes test components and user experience Monitors analyze
probe(s) for Pass/Fail Responders take action based up monitor
results When troubleshooting Monitor failures are a signal to a
problem Consistent failures can force a bluescreen Responders
Restart Reset AppPool Failover MBX BugCheck Offline Escalate
- Slide 32
- Storage Some Database I/O Latencies, but overall all I/O is
fairly healthy.
- Slide 33
- W3wp#11 CPU util running hot? CPU The server appears to be busy
but uncertain if this normal or a bug
- Slide 34
- Private Bytes reached 10GB+ before restarting Memory Massive
growth in memory footprint of w3wp#11 process throughout the day.
W3WP Process ID = 62192
- Slide 35
- AppDomain Used to enable isolation within a process 3 AppDomain
by default Normal W3WP for Exchange has 3-4 AppDomains Created as a
result of config change Exchange Leak in W3SVC/1=
MSExchangeRpcProxyFrontEndApp Pool Process Explorer View AppDomains
and other.NET stats for running processes. Process Explorer
- Slide 36
- Outlook Anywhere Servicelets used by Exchange for minor tasks
RPCHTTPServicelet runs every 15 minutes RPCHTTPServicelet was
writing update to the Default Web Site/Rpc site from SSL to None on
every run. What was causing this change to continually be
updated?
- Slide 37
- MSExchangeRPCAppPool MSExchange Services Host System AppDomain
Default AppDomain Front-End AppDomain Back-End AppDomain
BinariesConfig Heaps AppDomain (~125mb at startup) Connections
Front-End AppDomain RPC Client Access Store Worker Instance
Front-End AppDomain Every 15 Min Set SSLOffloading = true +100
Users +50Users +60Users +200 Users MBxDB
- Slide 38
- Investigation ~10 weeks of investigation Many iterations of
data collected and analyzed Deployment Guidance Missteps NLB
Configuration Set to Round Robin Most recent CU Update + Hotfixes
Resolution NLB Configuration changed to Slow Start Most recent CU
Update + Hotfixes installed Interim configuration change until
KB2925281 hotfix release Final fix in Exchange 2013 Service Pack 1
Data Collection Analysis
- Slide 39
- Slide 40
- BRK3131: Exchange Design Concepts and Best Practices BRK3197:
Exchange Server Preferred Architecture BRK3178: Exchange on IaaS:
Concerns, Tradeoffs, and Best Practices BRK3173: Experts Unplugged:
Exchange Server Deployment and Architecture BRK3158: Experts
Unplugged: Exchange Top Issues BRK3129: Deploying Exchange Server
2016 BRK3102: Experts Unplugged: Exchange Server High Availability
and Site Resilience
- Slide 41
- Slide 42