Best Practices: Managing a Healthy Couchbase Server Deployment: Couchbase Connect 2014

Managing a Couchbase DeploymentJustin Michaels | Couchbase

[email protected]

©2014 Couchbase, Inc. 2

What does a Healthy Cluster look like

What does a Unhealthy Cluster look like

Key System Metrics

Health-Check tools Couchbase provides

Agenda

A Healthy Cluster


Healthy Couchbase Cluster

‘Active vBuckets’ count across all the servers should be

equal to “1024”

‘Replica vBuckets’ count across all the servers should be

equal to “1024 * <num of replica’s configured>”


Healthy Couchbase Cluster

‘Cache Miss Ratio’ and ‘Disk Reads per Sec’ across all the

servers should be equal to “0” (depending on resident ratio!)

Items in ‘TAP Queue’ for 2.x

Items in ‘DCP Queues’ for 3.x

An Unhealthy Cluster


Unhealthy Couchbase Cluster

‘Memory used’ is equal to ‘High Water Mark’ that means

Active items are evicted from RAM

‘Drain rate’ is much lower than ‘Fill rate’ than Disk Queue

will hit TAP/DCP back-off


Unhealthy Couchbase Cluster

TMP Out Of Memory that means memory usage is

at or above 90% of bucket memory quota

‘Disk Reads per Sec’ and ‘Cache Miss Ratio’

growing at hundreds or thousands that means no

more memory capacity

Key Metrics

Perry Krug http://blog.couchbase.com/how-many-nodes-part-4-monitoring-sizing

Justin Michaels http://blog.couchbase.com/monitoring-couchbase-cluster

http://blog.couchbase.com/how-many-nodes-part-4-monitoring-sizing

http://blog.couchbase.com/monitoring-couchbase-cluster


"cache miss ratio"

Goal: Low as possible

Many customers run with <1%, some are upwards of 10-20% with the understanding that the performance is not as good but that it is okay for their application.

Relationship between "memory used" and the "high watermark".

Add another node: "red flag" that you need more RAM

Memory used reaching the high watermark ok but never for any sustained period of time.

Key Metrics


"disk write queue" (TAP or DCP)

Goal: Peaks approaching 1 million items per node

Peaks/valleys getting higher over time, that's an indication of needing more IO and likely needing more nodes for that.

Will grow and shrink over time but is a good indication of available IO capacity.

"temp OOM"

should always be 0 unless they are explicitly expecting it (like in a bulk load scenario)

Key Metrics


"fragmentation” (docs and views)

Goal: Under 2x%

Will grow and shrink

Problematic if constantly higher than the compaction threshold that's an indication that compaction is not keeping up.

Not running at all for some reason and may lead to out of disk space issues.

Note: Monitoring disk usage outside of Couchbase and make sure that it doesn't reach a critical level (90%, it shouldn't ever get to that)

Key Metrics


"outbound XDCR mutations"

as an indication of how many outstanding writes are waiting to be sent to the destination cluster.

This will likely always have some value in it under load and so it's hard to say what a "good" threshold is, but it's something you should understand and monitor so that it doesn't go out of whatever your expected range is.

"items” in the "TAP queues" section

If this is above 0, it's an indication that some items are not replicated between nodes and therefore are at risk of data loss if that node fails.

It's extremely unlikely for this to happen during steady state but if there is a network slowdown or disruption this queue will grow and should be an immediate sign that something is wrong.

Key Metrics

Sample Couchbase ClusterJustin Michaels | Couchbase

Couchbase Monitoring


Real-time traffic graphs

Per bucket, per node and aggregate statistics

Monitor inter-node traffic

REST API accessible (Extend existing monitoring system)

Couchbase Monitoring WebUI


External systems can access all statistics from Couchbase's REST API

External systems are in a good position to take into account components that are outside the scope of Couchbase Server.

A network switch is failing and that there is a dependency on that switch by the Couchbase cluster.

Shared storage supporting the cluster are functioning.

Routes to nodes in the cluster are healthy.

External Monitoring


Some options (I’m sure there’s more)

Health Check Tool


What is CBHealthChecker Tool

… Insert sample syntax used …

Web based report

ALERTING user on issues where immediate action is required.

Input to cluster health.

Important buckets statistics

Important stats on each Node

WARNING indicators to point out issues that needs to be addressed before they become an issue.

Sample CBHealthCheckerJustin Michaels | Couchbase


Couchbase Admin UI – Cluster Overview

Cluster Overview Page Cluster RAM

Usage

Cluster DISK Usage

Buckets Deployed in

Cluster

Servers Deployed in

Cluster

Cluster Rebalance Progress Indicator


Couchbase Admin UI – Server Nodes

Server Page Node

List Active Servers

Expand Individual Servers

Servers Ready for Rebalance


Couchbase Admin UI – Server Nodes

Additional Server Details

Keys transferred, Keys yet to be transferred

Rebalance Progress Indicator in-detail

Memory Utilization on this server

Disk Utilization on this server


Couchbase Admin UI – Monitoring System

Monitoring Stats per Bucket on entire Cluster

120+ Stats collected from entire Cluster

View stats by aggregate

Click eclipse to view this stat on per Server basis

Tooltip provides description and stats used for calculating


Couchbase Admin UI – Logging System

Log Event Page

Logs all events occurring on the

clusterWith timestamp

Server where event occurred


vBucket Resources

Active State Replica State Pending State Total

Active vBuckets

Replica vBuckets


Disk Queues

Active State Replica State Pending State Total


Tap Queues

Replication Queue Rebalance Queues Client Queues Total


XDCR Stats

Outgoing XDCR stats section displays information about the XDCR operations that are supporting cross datacenter replication from the current cluster to a destination cluster.

Incoming XDCR stats section displays information about the XDCR operations that are coming into to the current cluster from a remote cluster.


Sample Healthy Report

Categories to jump to

Time periods

Expanding this provides detailed

Sizing info

Cluster-wide stats analyzed


Sample Un-Healthy Report

User needs to take action

Details about what action

Date post:	25-May-2015
Category:	Data & Analytics
Upload:	couchbase
View:	859 times
Download:	5 times

Best Practices: Managing a Healthy Couchbase Server Deployment: Couchbase Connect 2014

Data & Analytics