Date post: | 25-May-2015 |
Category: |
Data & Analytics |
Upload: | couchbase |
View: | 859 times |
Download: | 5 times |
©2014 Couchbase, Inc. 2
What does a Healthy Cluster look like
What does a Unhealthy Cluster look like
Key System Metrics
Health-Check tools Couchbase provides
Agenda
A Healthy Cluster
©2014 Couchbase, Inc. 4
Healthy Couchbase Cluster
‘Active vBuckets’ count across all the servers should be
equal to “1024”
‘Replica vBuckets’ count across all the servers should be
equal to “1024 * <num of replica’s configured>”
©2014 Couchbase, Inc. 5
Healthy Couchbase Cluster
‘Cache Miss Ratio’ and ‘Disk Reads per Sec’ across all the
servers should be equal to “0” (depending on resident ratio!)
Items in ‘TAP Queue’ for 2.x
Items in ‘DCP Queues’ for 3.x
An Unhealthy Cluster
©2014 Couchbase, Inc. 7
Unhealthy Couchbase Cluster
‘Memory used’ is equal to ‘High Water Mark’ that means
Active items are evicted from RAM
‘Drain rate’ is much lower than ‘Fill rate’ than Disk Queue
will hit TAP/DCP back-off
©2014 Couchbase, Inc. 8
Unhealthy Couchbase Cluster
TMP Out Of Memory that means memory usage is
at or above 90% of bucket memory quota
‘Disk Reads per Sec’ and ‘Cache Miss Ratio’
growing at hundreds or thousands that means no
more memory capacity
Key Metrics
Perry Krug http://blog.couchbase.com/how-many-nodes-part-4-monitoring-sizing
Justin Michaels http://blog.couchbase.com/monitoring-couchbase-cluster
©2014 Couchbase, Inc. 10
"cache miss ratio"
Goal: Low as possible
Many customers run with <1%, some are upwards of 10-20% with the understanding that the performance is not as good but that it is okay for their application.
Relationship between "memory used" and the "high watermark".
Add another node: "red flag" that you need more RAM
Memory used reaching the high watermark ok but never for any sustained period of time.
Key Metrics
©2014 Couchbase, Inc. 11
"disk write queue" (TAP or DCP)
Goal: Peaks approaching 1 million items per node
Peaks/valleys getting higher over time, that's an indication of needing more IO and likely needing more nodes for that.
Will grow and shrink over time but is a good indication of available IO capacity.
"temp OOM"
should always be 0 unless they are explicitly expecting it (like in a bulk load scenario)
Key Metrics
©2014 Couchbase, Inc. 12
"fragmentation” (docs and views)
Goal: Under 2x%
Will grow and shrink
Problematic if constantly higher than the compaction threshold that's an indication that compaction is not keeping up.
Not running at all for some reason and may lead to out of disk space issues.
Note: Monitoring disk usage outside of Couchbase and make sure that it doesn't reach a critical level (90%, it shouldn't ever get to that)
Key Metrics
©2014 Couchbase, Inc. 13
"outbound XDCR mutations"
as an indication of how many outstanding writes are waiting to be sent to the destination cluster.
This will likely always have some value in it under load and so it's hard to say what a "good" threshold is, but it's something you should understand and monitor so that it doesn't go out of whatever your expected range is.
"items” in the "TAP queues" section
If this is above 0, it's an indication that some items are not replicated between nodes and therefore are at risk of data loss if that node fails.
It's extremely unlikely for this to happen during steady state but if there is a network slowdown or disruption this queue will grow and should be an immediate sign that something is wrong.
Key Metrics
Sample Couchbase ClusterJustin Michaels | Couchbase
Couchbase Monitoring
©2014 Couchbase, Inc. 16
Real-time traffic graphs
Per bucket, per node and aggregate statistics
Monitor inter-node traffic
REST API accessible (Extend existing monitoring system)
Couchbase Monitoring WebUI
©2014 Couchbase, Inc. 17
External systems can access all statistics from Couchbase's REST API
External systems are in a good position to take into account components that are outside the scope of Couchbase Server.
A network switch is failing and that there is a dependency on that switch by the Couchbase cluster.
Shared storage supporting the cluster are functioning.
Routes to nodes in the cluster are healthy.
External Monitoring
©2014 Couchbase, Inc. 18
Some options (I’m sure there’s more)
Health Check Tool
©2014 Couchbase, Inc. 20
What is CBHealthChecker Tool
… Insert sample syntax used …
Web based report
ALERTING user on issues where immediate action is required.
Input to cluster health.
Important buckets statistics
Important stats on each Node
WARNING indicators to point out issues that needs to be addressed before they become an issue.
Sample CBHealthCheckerJustin Michaels | Couchbase
©2014 Couchbase, Inc. 22
Couchbase Admin UI – Cluster Overview
Cluster Overview Page Cluster RAM
Usage
Cluster DISK Usage
Buckets Deployed in
Cluster
Servers Deployed in
Cluster
Cluster Rebalance Progress Indicator
©2014 Couchbase, Inc. 23
Couchbase Admin UI – Server Nodes
Server Page Node
List Active Servers
Expand Individual Servers
Servers Ready for Rebalance
©2014 Couchbase, Inc. 24
Couchbase Admin UI – Server Nodes
Additional Server Details
Keys transferred, Keys yet to be transferred
Rebalance Progress Indicator in-detail
Memory Utilization on this server
Disk Utilization on this server
©2014 Couchbase, Inc. 25
Couchbase Admin UI – Monitoring System
Monitoring Stats per Bucket on entire Cluster
120+ Stats collected from entire Cluster
View stats by aggregate
Click eclipse to view this stat on per Server basis
Tooltip provides description and stats used for calculating
©2014 Couchbase, Inc. 26
Couchbase Admin UI – Logging System
Log Event Page
Logs all events occurring on the
clusterWith timestamp
Server where event occurred
©2014 Couchbase, Inc. 27
vBucket Resources
Active State Replica State Pending State Total
Active vBuckets
Replica vBuckets
©2014 Couchbase, Inc. 28
Disk Queues
Active State Replica State Pending State Total
©2014 Couchbase, Inc. 29
Tap Queues
Replication Queue Rebalance Queues Client Queues Total
©2014 Couchbase, Inc. 30
XDCR Stats
Outgoing XDCR stats section displays information about the XDCR operations that are supporting cross datacenter replication from the current cluster to a destination cluster.
Incoming XDCR stats section displays information about the XDCR operations that are coming into to the current cluster from a remote cluster.
©2014 Couchbase, Inc. 32
Sample Healthy Report
Categories to jump to
Time periods
Expanding this provides detailed
Sizing info
Cluster-wide stats analyzed
©2014 Couchbase, Inc. 33
Sample Un-Healthy Report
User needs to take action
Details about what action