Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | andrew-dufour |
View: | 250 times |
Download: | 0 times |
Andrew [email protected] EngineerChef Software @andrewdufour
Nathan [email protected] Team ManagerChef Software @ndcerny
The Art of Monitoring
“There is no instance of a nation benefitting from prolonged
warfare.”
― Sun Tzu, The Art of War
Problem Statement
To make effective decisions and to effectively respond to incidents, we must have visibility into our systems.
Simplicity > Perfection
“Everything should be made as simple as possible. But not simpler.”
― Albert Einstein
What should you monitor?
Operating SystemDisk
CPUMemory
System Logs
Supporting ServicesRabbitMQ
SolrPostgreSQL
Nginx
Application Services Erchef
Bifrost
Application Logs
Tools 101• StatsD – A network daemon that runs on the Node.js platform and
listens for statistics, like counters and timers. https://github.com/etsy/statsd
• Grafana - Beautiful dashboards• TICK Stack – A series of tools that comprise the ‘Influx Data Platform’,
including an easily scalable time series database. https://influxdata.com/time-series-platform/
• Sensu - Monitoring that doesn't suck. https://sensuapp.org/
• Splunk – centralized logging, operational intelligence, big machine data tool http://www.splunk.com/
Instrumenting our Erlang Based Services - StatsHero• Example metrics emitted in Statsd format:
test_hero.upstreamRequests.rdbms:1200|h
• Enabling StatsHero in your chef-server.rb:
Estatsd[‘enabled’] = true Estatsd[‘protocol’] = ‘stastd’ Estatsd[‘vip’] = ‘<statsd server>’ Estatsd[‘port’] = ‘<statsd port>’
Namespace Category MetricMeasurement
Metric Type (H=histogram)
Instrumenting our Erlang Based Services
ErchefBifrost
Stats Hero Stats Hero
Statsd
Folsom-Graphite
Graphite
Instrumenting our Erlang Based Services - Folsom Metrics• Example metrics:
pooler.chef_depsolver.in_use_count pooler.chef_depsolver.free_count pooler.sqerl.in_use_count pooler.sqerl.free_count
• Enabling folsom metrics in your chef-server.rb folsom_graphite['enabled'] = true folsom_graphite[‘host’] = ‘<your graphite host>’ folsom_graphite[‘port’] = ‘<your graphite port>’
Instrumenting our Erlang Based Services
ErchefBifrost
Stats Hero Stats Hero
Statsd
Folsom-Graphite
Graphite
Logs Logs
Log Collector
Instrumenting our Erlang Based Services – Collecting Logs• Use a full featured log collector like Splunk to centralize logs.• All of our services log into a common directory structure:
/var/log/opscode/<service name>• The two most important files within that directory are:
currenterror
• There are also request logs which repeat information available elsewhere
• All services shipped with the omnibus package, not just Erlang services, log here
Sometimes Ohai tuning is needed (e.g.. Centrify)
ALWAYS USE PARTIAL SEARCH!(and look at SafeSearch)
Know what a dependency graph is… and manage it.
Chef-server.rb• https://docs.chef.io/config_rb_server.html• https://docs.chef.io/config_rb_server_optional_settings.html• https://github.com/chef/chef-server/blob/master/omnibus/files/private-
chef-cookbooks/private-chef/attributes/default.rb
• How does chef-server.rb work? The Chef servers’ reconfigure is driven by a cookbook called PrivateChef. PrivateChef is a cookbook that’s just like any other - with some helper libraries to
read your chef-server.rb, and make sense of it
• Actually tuning a setting: opscode_erchef[‘db_pool_size’] = “20”
A quick look at PrivateChefYou can see, we’re creating a new Module called PrivateChef.
The Configuration attributes are defined as new Mashes. When you say opscode_erchef[‘key’] = value, you’re truly just assigning a value to the Mash created in the PrivateChef module.
Chef Front-end Server
Bifrost
Erchef
Nginx
NginxEnable cookbook
cacheS3 URL Expiry
Bifrost
Db pooler timeout
Db pooler queue size
Authz
Db pool size
AuthzInitial Pool Count
Max Pool Count
Max Queue Size
Chef Front-end Server
Bifrost
Erchef
Nginx
Erchef
Depsolver workers
Depsolver timeout
Authz
Db pooler timeout
Db pooler queue size
Db pool size
Keygen_cache_size
Chef Back-end Server
RabbitMQ
PostgreSQL
Solr
PostgreSQL
Checkpoint Segments
Checkpoint completion target
Log min duration statement
Solr
Heap size
New size
RabbitMQ
Analytics max length
Dark launch
Max connections
Helpful Links• https://sensuapp.org/ • https://github.com/sensu-plugins/sensu-plugins-postgres• https://github.com/sensu-plugins/sensu-plugins-rabbitmq• https://github.com/sensu-plugins/sensu-plugins-solr• https://github.com/sensu-plugins/sensu-plugins-nginx• https://github.com/sensu-plugins/sensu-plugins-filesystem-checks
Sensu:
Statsd: https://github.com/etsy/statsd
InfluxDB: https://influxdata.com/
Splunk: http://www.splunk.com/
More Useful Tools• PGBadger - https://github.com/dalibo/pgbadger• Monitor Postgresql: https://wiki.postgresql.org/wiki/Monitoring• How to Monitor Nginx: https://
www.scalyr.com/community/guides/how-to-monitor-nginx-the-essential-guide
• Pgtune - http://pgfoundry.org/projects/pgtune pgtune takes the wimpy default postgresql.conf and expands the database server
to be as powerful as the hardware it's being deployed on Be careful about shared resources, Pgtune assumes you have a dedicated Postgres
server.• GCViewer
Helps you analyze your GC activity, so you can make decisiosn on tuning. http://www.tagtraum.com/gcviewer.html
Alternatives Tools• ELK: https://www.elastic.co/webinars/introduction-elk-stack• Graylog: https://www.graylog.org/• Loggly: https://www.loggly.com/• Graphite: https://github.com/graphite-project/• Datadog - https://www.datadoghq.com/
• So many more….
Special Thanks• Irving Popovetsky and his tuning the chef server for scale blog:
http://irvingpop.github.io/blog/2015/04/20/tuning-the-chef-server-for-scale/• Mark Harrison, Paul Mooring and the Chef server team. The
dashboards are heavily based on their dashboards for hosted Chef.• Phil Dibowitz and Facebook for teaching Andrew a lot about tuning the
Chef server for scale that almost none of our other customers hit.
Live Demo• Link to github: https://github.com/andy-dufour/chef-server-monitoring/