Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

transcript

Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

Angshuman Bagchi (angshuman@mongodb.com)Technical Services Engineer

Agenda

• What is MMS Monitoring?• What are Alerts?• How to pick an Alert?• Five recommended Alerts• Wrap up

What is MMS Monitoring?

Who uses MMS?

What are MMS alerts?

Source:http://www.cleanfunnypics.com/no-its-not-empty/#axzz2pqknJJbC

How to pick an Alert?

• Is there an absolute limit to alert on?• What is normal (baseline) ?• What is worrying (warning) ?• What is a definite problem (critical) ?• Likelihood of false positives ?

... there is no magic formula

Five recommended alerts

• Host Recovering (All, but by definition Secondary)

• Replication Lag (Secondary)• Connections (All mongos, mongod)• Lock % (Primary, Secondary)• Replica (Primary, Secondary)

Host Recovering

• General alert triggered if any instance enters RECOVERING mode

• Required for all use-cases• All Replica Sets should have this. • Sometimes, during maintenance this

may be expected

Host Recovering

Replication Lag

• No secondary should be behind• Secondary reads affected• All Replica Sets should have this• Only exception is configured slaveDelay

Replication Lag

Absolute Limit?Yes, about 1 or 2s. To prevent false positives absolute threshold > 240s should be alerted

Normal Lag is ideally 0s

Worrying < 60s, some false positives

Critical > 240s

False positives Above 240s likelihood low.

Example: replication lag

150,000s of lag ~ almost 2 days of lag!

• Secondaries under specified vs primaries• Access patterns between primary /

secondaries• Insufficient bandwidth• Foreground index builds on secondaries

“…when you have eliminated the impossible, whatever remains, however improbable, must be the truth…” -- Sherlock Holmes

Sir Arthur Conan Doyle, The Sign of the Four

Example:• ~1500 ops per minute (opcounters)• 0.1 MB per object (average object size,

local db)

~1500 ops/min / 60 seconds * 0.1 MB/op * 8b/B =~ 20 mbps required bandwidth

Connections

• Each connection consumes ~ 1MB and a file descriptor

• 5000 connections => 5GB of RAM• Stability and predictability are key

Pro-Tip: know thyself

You have to recognize normal to know when it isn’t.

Source: http://www.flickr.com/photos/skippy/6853920/

Connections

Absolute Limit? Yes, but this is too high. We need to alert before that

NormalTBD based on deployment, number of nodes, connection pool settings, app servers, load etc. Say, X during peak load

Worrying 50% increase, so, 1.5X

Critical Double, so 2X

Lock %

• Lock contention degrades performance• High lock % starves replication, reads.• Bounds need to be determined

Lock %

Absolute Limit?Yes, >80% occasional degraded performance, 90% major impact regularly

NormalTBD. Write heavy loads see higher values. Normal, say X% during peak load

Worrying Double, so approximately 2X%

Critical TBD. For Prod > 80%

Replica

• Represents oplog window• Depends on

– Rate of operations inserted into oplog– Size of operations– Size of oplog capped collection

• Normal maintenance window X 3 • Resizing the oplog is non-trivial

Replica

Absolute Limit? 50% below Normal

Normal TBD. Say X hours during peak

Worrying 25% below Normal

Critical 50% below Normal

Summary

• Use similar approach for other metrics• Different audiences for alerts

– Worrying alerts ops team– Critical goes out to a wider audience

• Get started with MMS Monitoring and alerts!

I got alerted … now what?

mms.mongodb.com

angshuman@mongodb.com

Webinar: Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track

Technology