Best Practices for Monitoring Distributed In-Memory ComputingDenis MekhanikovJuly 31, 2019
2019 © GridGain Systems
2019 © GridGain Systems2019 © GridGain Systems
What communication with GridGain support often looks like
Customer: The cluster is hanging.
GG: Please send logs.
Customer: We don’t have logs.
GG: Did you take thread dumps?
Customer: Nope.
GG: The problem is probably in GC.
What is the memory consumption level?
Customer: ...
2019 © GridGain Systems2
2019 © GridGain Systems2019 © GridGain Systems
Why should we monitor?
3
• Check if everything is fine• Prevent upcoming issues• Discover and react to the issues that
already happened
• Find a reason for an issue and prevent it from happening again
Dashboarding
Logging
2019 © GridGain Systems2019 © GridGain Systems
What to monitor?
4
• Every node in isolation• Connection between nodes• System as a whole
2019 © GridGain Systems2019 © GridGain Systems
Every node is...
5
• Hardware (hypervisor)• Operating System• Virtual machine• Application
2019 © GridGain Systems2019 © GridGain Systems
Hardware / Hypervisor / OS
6
• CPU• Memory• Disk• System logs • Cloud Provider’s logs
2019 © GridGain Systems2019 © GridGain Systems
Network
7
• Ping monitoring• Network hardware monitoring
TCP dumps
2019 © GridGain Systems2019 © GridGain Systems
JVM
8
GC logs• JMX
Java Flight RecorderThread DumpsHeap Dumps
● java -XX:+HeapDumpOnOutOfMemoryError ...
2019 © GridGain Systems2019 © GridGain Systems
Application
9
• Logs• JMX• Throughput / Latency• Test queries
2019 © GridGain Systems2019 © GridGain Systems
Tools
10
2019 © GridGain Systems2019 © GridGain Systems
Tools
11
Metrics
2019 © GridGain Systems2019 © GridGain Systems
Tools
12
Logs
2019 © GridGain Systems2019 © GridGain Systems
Tools
13
JVM
MAT
Java Flight Recorder
2019 © GridGain Systems2019 © GridGain Systems
Tools
14
Network
2019 © GridGain Systems2019 © GridGain Systems
Tools
15
Benchmarking
2019 © GridGain Systems2019 © GridGain Systems
Tools
16
2019 © GridGain Systems
GridGain
17
2019 © GridGain Systems2019 © GridGain Systems
GridGain
18
OS
JVM
GridGain
Hardware
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Cache Metrics
19
CacheMetricsMXBean• CacheGets• AverageGetTime• AverageTxCommitTime• ...
CacheGroupMetricsMXBean• LocalNodeMovingPartitionsCount• ClusterMovingPartitionsCount• ClusterOwningPartitionsCount• ...
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Cache Metrics
20
How to enable cache metrics
CacheConfiguration<K, V> cacheCfg = new CacheConfiguration<>("cache");
// Enable metrics.cacheCfg.setStatisticsEnabled(true);
ignite.createCache(cacheCfg);
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Discovery and Communication
21
TcpDiscoverySpiMBean• MessageWorkerQueueSize• AvgMessageProcessingTime• Coordinator• NodesFailed• ...
TcpCommunicationSpiMBean• OutboundMessagesQueueSize• SentMessagesCount• ReceivedMessagesCount• ...
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Data Storage
22
Ram
Disk
WAL
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Data Storage Metrics
23
Data volume
DataStorageMetricsMXBean• WalTotalSize• TotalAllocatedSize• OffheapUsedSize• ...
DataRegionMetricsMXBean• TotalAllocatedPages• AllocationRate• PagesFillFactor• ...
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Data Storage Metrics
24
Checkpoints
DataStorageMetricsMXBean• DirtyPages• CheckpointTotalTime• LastCheckpointDuration• UsedCheckpointBufferSize• LastCheckpointPagesWriteDuration• LastCheckpointMarkDuration• LastCheckpointTotalPagesNumber• ...
Checkpoint marker
Ram
Disk
WAL
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Data Storage Metrics
25
Page replacement
DataRegionMetricsMXBean• PagesReplaceRate• PagesReplaceAge• PagesReplaced
Ram
Disk
R/W
2019 © GridGain Systems2019 © GridGain Systems
GridGain: Data Storage Metrics
26
How to enable data storage metrics
DataStorageConfiguration storageCfg = new DataStorageConfiguration();DataRegionConfiguration regionCfg = new DataRegionConfiguration();regionCfg.setName("myDataRegion");
// Enable metrics.storageCfg.setMetricsEnabled(true); // Metrics for data storage.regionCfg.setMetricsEnabled(true); // Metrics for a particular data region.
storageCfg.setDataRegionConfigurations(regionCfg);
2019 © GridGain Systems2019 © GridGain Systems
GridGain: IO metrics
27
Coming in 2.8
IoStatisticsMetricsMXBean• CacheGroupLogicalReads• CacheGroupPhysicalReads• IndexLogicalReads• IndexPhysicalReads• ...
2019 © GridGain Systems2019 © GridGain Systems
GridGain: WebConsole
28
2019 © GridGain Systems2019 © GridGain Systems
GridGain Monitoring
29
Demo
2019 © GridGain Systems2019 © GridGain Systems
Checklist for monitoring
30
• CPU / Memory / Disk / Network• GC logs• Application logs
+ Problematic places specific to your setup
2019 © GridGain Systems
Q&A
31
https://github.com/dmekhanikov/ignite-elk/
https://console.gridgain.com/ [email protected]