Cassandra MetricsBy: Chris Lohfink
Blackbird
About Me
• Engineer at Blackbird
• Worked with C* since 0.8 (3 years)• 7 years as a Java/Python developer• Interests
o Data Scienceo Hobbyist Electronicso Development
Blackbird
About Cassandra
• Fault tolerant to a faulto easy to ignore until it gets bad
• Like all other systems:o If not many events no one pays attention to ito If theres a lot of events need to keep eye on ito When things happen need information to quickly diagnose
Basically...
Blackbird
Blackbird
Lots of Metrics
A lot of data but with no context or understanding doesn’t have that much use
… but you have lots of pretty graphs
Blackbird
Disclaimer
This not all of the important metrics, in fact it is missing many critical ones
• Heap• OS metrics• Latencies• Log messages
Blackbird
An Example for a little background
Threads
ReadStage
x32
Clie
nt R
equ
est RequestResponse
231-1 231-1 Threads
ReadRepairStage
Threads231-1
MessagingService
Blackbird
Cassandra Key Metrics
● Cassandra internal messaging based on SEDA with many asynchronous elements
● Its easy to overrun the processing capabilities of a stage that is not in the requests feedback loop (i.e. ReadRepairStage)
Blackbird
Access the metrics
● nodetool tpstatsPool Name Active Pending Completed Blocked All time blockedReadStage 0 0 113702 0 0RequestResponseStage 0 0 0 0 0MutationStage 0 0 164503 0 0...InternalResponseStage 0 0 0 0 0HintedHandoff 0 0 0 0 0
Message type DroppedRANGE_SLICE 0READ_REPAIR 0...REQUEST_RESPONSE 0COUNTER_MUTATION 0
● JMXorg.apache.cassandra.request:type=* and org.apache.cassandra.internal:type=*
● Metrics Reporter
MBean Attribute tpstats name Description
ActiveCount Active Number of tasks pulled off the queue with a Thread currently processing.
PendingTasks Pending Number of tasks in queue waiting for a thread
CompletedTasks Completed Number of tasks completed
CurrentlyBlockedTasks Blocked When a pool reaches its core pool size (configurable or set per stage, more below) it will begin queuing until the max size is reached. When this is reached it will block until there is room in the queue.
TotalBlockedTasks All time blocked Total number of tasks that have been blocked
Blackbird
Examples
• Read/Mutation Stageo Too many reads/writes, disk failure, poor tuning
• ReplicateOnWrite (CounterMutationStage in 2.1+)o High throughput of counter increments
• FlushWritero writes over running disk capabilities, poor tuningo large collections
• GossipStageo vnodes + many servers (pre 2.0.3)
Blackbird
Questions
?