Hawkular MetricsMetric Storage & Alerting
Stefan Negrea
2
About Me
Co-Creator of Hawkular Metrics
3
Introduction to Hawkular Metrics
Hawkular DemoHawkular Metrics
& Alerting
4
Pre-History● 2006 JBoss Operations Network 1.0
● 2008 Project RHQ
○ JBoss Operations Network 2.0
○ Metrics stored in Postgres
5
Pre-History
6
Pre-History● 2012 - 2013 RHQ Storage Nodes
○ Cassandra based
○ Store metrics
● 2014 RHQ Metrics
7
Hawkular
It’s a hawk with a monocular. Hawks are known to have a very sharp vision and very good hunters, they can catch preys anticipating their movements at a very fast speed.
The goal is to be able to monitor and catch anomalies in fast pace environments.
All* projects are Apache License 2.0
= +
8
History● 2014 Hawkular organization formed
● 2014 Hawkular Alerting started
● 02/2015 RHQ Metrics joins Hawkular org
● 12/2015 Hawkular Metrics integrated in OpenShift
Origin v3
● 10/2016 Hawkular Metrics includes Hawkular Alerting
Hawkular Metrics is a storage engine for metric data
metric data = a measurement taken at a specific time
storage engine = store metrics efficiently for their useful lifetime
Hawkular Metrics
9
● Gauge○ number
○ varies (not monotonic)
○ rate of change
● Counter○ integer
○ monotonic (increasing or decreasing
○ rate of change
○ support for reset
Supported Metrics
10
Memory usage(metric1, 4.5, 1493301898245)(metric1, 5.6, 1493301898246)(metric1, 1.2, 1493301898247)
Number of visitors(metric2, 4, 1493301898248)(metric2, 5, 1493301898249)(metric2, 9, 1493301898250)(metric2, 0, 1493301898251)
● Availability○ Availability of a resource
○ up, down, or unknown
○ can compute interesting stats based
on values
● String○ just that
○ possible uses: logs, events, config
Supported Metrics
11
Server status(metric3, UP, 1493301898253)(metric3, DOWN, 1493301898254)(metric3, UP, 1493301898255)
Value of configuration key ‘k’(metric4, “k=v”, 1493301898256)(metric4, “k=t”, 1493301898257)(metric4, “k=1”, 1493301898258)(metric4, “k=4”, 1493301898259)
12
Management & Support● Highly available, fault tolerant● No specialized node roles● Minimal configuration
Performance & Scalability● Optimized for writes● Data compression● Indexing
Cassandra - Storage
13
● CQL based● Partitioning & indexing of data based on
usage● Use built-in compression & TTL● Use the Datastax driver fully async● Support for latest C* 3.0.x release● Keep updating to latest stable● Use multiple tables for indexing
Cassandra - Storage
14
● REST API with JSON● JAX-RS 2.0 (async spec)● Fully async = JAX-RS 2.0 async + RX Java
+ async C* driver ● Stateless** server (Metrics, mostly)● Minimal clustering via Infinispan● Schema Management● Easy to use
○ packaged distribution with WildFly
○ download and run, only JDK required
App Layer
15
C* - 4 CPU, 4GBHawkular - 4 CPU, 4GBmessage sizes:10 datapoints: 2592 req/sec => 25920 datapoints/sec100 datapoints: 365 req/sec => 36500 datapoints/sec5000 datapoints: 7.6 req/sec => 38000 datapoints/sec
C*, 8 CPU, 8GBHawkular, 8 CPU, 4GBmessage sizes:10 datapoints: 4655 req/sec => 46550 datapoints/sec100 datapoints: 604 req/sec => 60400 datapoints/sec5000 datapoints: 15 req/sec => 75000 datapoints/sec
Performance - Sample
16
● Multi-tenant○ tenant id required on each request (HAWKULAR-TENANT header)
○ no way to get data from multiple tenants at once
● Can insert data without pre-creating metrics● Data is compressed using Gorilla compression
○ 2 hour time window
○ further reduces disk footprint
○ LZ4 enabled in Cassandra
○ Load testing:
■ 5000 data points/sec for 5 days = 26GB
■ 83M data points ~ 1GB of disk space
Features
17
● Bulk insertion endpoint for metrics and data● Tagging support for metrics and single data points
○ key, value; multi-tag support
■ tag1 = d
○ metrics queryable via TQL (tag query language)
■ AND, OR, NOT
■ grouping
■ wildcard matching
■ a1 = 'd' OR ( a1 != 'ab' AND c1 )
Features
18
● Endpoint for each metric type○ /gauges, /availability, /counters, /strings○ Each metric type has almost identical endpoints
● Raw data - /gauges/raw● Raw data for single metric - /strings/{metric_id}/raw● Query time aggregation
○ multiple metrics - /availability/stats○ single metric - /counter/{metric_id}/stats
● Bulk operations - /metrics
** String metrics do not have stats (yet?)
Features - Simple REST API
19
● Query Time Aggregation○ Combine multiple metrics and get statistical data
○ Gauge and counter: average, median, percentile, sum
○ Availability: ratios for uptime and downtime, downtime duration
○ Time Slicing: first group data, then compute stats
○ Single or multiple metrics
● Rate○ available for gauges and counters
○ rate of change of the values for the timespan
○ ex: how fast is the number of total requests increasing
Features - Aggregation & Rate
20
● Natural fit: collect data and then alert on anomalies
● Two ways to alert on metric data○ Dedicated API for setting up alerts, incoming
data is filtered and processed by the alerting
engine
○ Metrics Alerter that queries single or multiple
metrics, no need to predefine alerts triggers
ahead of time.
Metrics + Alerting
21
● Single and group Triggers● Template triggers● Complex conditions● Dampening● Auto-resolve/auto-disable triggers● Pluggable notifiers
Alerting Features
22
23
● Automatic & persisted aggregation
● Management capabilities for the Cassandra cluster
● Query language
● Performance improvements
○ already have a good baseline, but can do better
○ read/write
Roadmap - 2017
24
Demo
● Install ccm ○ https://github.com/pcmanus/ccm
● Start a single node C* cluster○ ccm create -v 3.0.12 -n 1 -s hawkular
● Download, extract and start Hawkular Metrics○ https://origin-repository.jboss.org/nexus/content/groups/public/org/ha
wkular/metrics/hawkular-metrics-wildfly-standalone/0.26.1.Final/
○ bin/standalone -b 0.0.0.0
● Download, extract and start Grafana● Download, install, and configure the Hawkular plugin for Hawkular
○ https://grafana.com/plugins/hawkular-datasource/installation
○ https://github.com/hawkular/hawkular-grafana-datasource
i. pick a tenant id of your choice
Demo
26
● Install the Hawkular Metrics python client via pip○ pip install hawkular-client
● Install psutil to collect CPU stats○ pip install psutil
● Create an custom agent (using python client)○ make sure you use the same tenant id configured with Grafana
○ pre-create and tag a metric for each CPU
○ collect CPU usage every 10 seconds
○ send the data to Hawkular Metrics
Demo
27
Demo
28
#! /usr/bin/env python3
import psutil, timefrom hawkular.metrics import HawkularMetricsClient, MetricType
client = HawkularMetricsClient(tenant_id='test')
cpu_percent = psutil.cpu_percent(interval = 1, percpu = True)for index, cpu in enumerate(cpu_percent) : client.create_metric_definition(MetricType.Gauge, 'cpu%s' % index, cpu = 'cpu%s' % index)
while True : cpu_percent = psutil.cpu_percent(interval = 1, percpu = True) for index, cpu in enumerate(cpu_percent) : client.push(MetricType.Gauge, 'cpu%s'% index, float(cpu))
time.sleep(10)
● Web - http://www.hawkular.org/
● Github - https://github.com/hawkular
● Metrics Documentation - http://www.hawkular.org/tags/metrics.html
● Alerting Documentation - http://www.hawkular.org/tags/alerts.html
● Twitter - https://twitter.com/hawkular_org
Resources
29