Post on 08-May-2015
description
transcript
Tools bridging the gap between MySQL engineering, ops & DBAs
Shlomi Noach
MySQL DevOps @ Outbrain
Shlomi Noach
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
About me
● Engineer, DBA● Working with MySQL since 2000● Formerly consultant, instructor● Author of common_schema, openark-kit, propagator● Write at http://openark.org● Work at the infrastructure team, Outbrain
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
About Outbrain
● The leading content discovery platform on the web● Embedded in over 90,000 websites● Serves over 150 million unique US visitors, 15 billion
pages and 100 billion recommendations per month● You may not be familiar with us by name, but have met us
frequently.● We aim to provide with reliable content to our users.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
PAID DISCOVERY
INTERNAL DISCOVERY
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
About Outbrain
● Managing total of over 2,000 servers (Hadoop, Cassandra, MySQL, web services, …)
● Processing about 1 Petabyte of information● Over 70 engineers● Doing continuous deployments● Fans and supporters of open source● Have "Ownership" culture: "You build it, you run it!"
○ Must be supported by technology
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
What's DevOps?
● Or, DevDbaOps?
?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
What's DevOps?
● Often described as developers doing ops work, or ops doing engineering work
● I see this more as the integration between the groups● Avoiding the scenario where parties have no control of
parts of their domain.○ Tools○ Techniques○ Culture
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
What's DevOps?
● With good DevOps, you get:○ Ownership○ Visibility○ Action-ability (word has just been invented and will be used as axiom)
● Allowing engineers own and be responsible for their apps.○ No need for ops telling them something is wrong○ No need to sit with ops to understand what is wrong○ No need to ask ops to deploy changes
● All the while giving ops visibility into engineers actions
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Tribute: automation
● We use chef for automation● Some databags leftovers, changing to attributes● Everything is under version control
○ Allows ops/DBAs easily add/remove packages○ Different treatment for masters○ Different my.cnf settings based on MySQL role○ Different my.cnf settings based on hardware○ Setting up backup servers○ More...
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Automation: One Ring to Rule Them All
● Outbrain's onering is an orchestration solution● Provisioning servers: from operating system through
packages (via chef integration) to application deployment (via glu integration)
○ Allows for a one click "I want a host with MyService tomcat service", or "I want a host with MySQL server"
● Then acting as inventory service○ "give me all MySQL servers in the LA data center"○ "which disks do our OLAP servers use?"
● https://github.com/outbrain/onering
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Onering & pmysql: on-demand semi-automated actions
● pmysql is a parallel MySQL client (originally developed by Domas Mituzas)
● Using onering's API, we can:
curl "https://my.onering.service/api/devices /list/name/where/chef.run_list/mysql/name/olap? format=txt" | pmysql -pmypass "stop slave"
curl "https://my.onering.service/api/devices /list/name/where/chef.run_list/mysql/name/olap? format=txt" | pmysql -pmypass "select @@version" | grep tokudb | awk '{print $1}' | pmysql -pmypass "stop slave"
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Visibility
● A classic developers-ops collision: slow queries○ Ops notice increased I/O, slave lags○ What do they know of the domain of the problem?○ Developers see long response times○ What visibility do they get?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Box Anemometer
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
What makes Anemometer such a good DevOps tool?
● It provides visibility to everyone● The engineer doesn't need to know what slow logs are,
where they are located, how to interpret them.● It promotes ownership in that it gets the drill down per
query/per host/per service● The Permalink. How such a small thing can make all the
difference● Ops can hand over what they think is the "guilty query"
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemometer Host
Anemometer @ Outbrain, behind the scenes
MySQL Slowlog
MySQL Slowlog
MySQL Slowlog
Slowlog
Slowlog
Slowlog
logstash
logstash
logstash
pt-query-digest
Webinterface
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Multiple services, multiple MySQL hosts: who makes it slow?
MySQL Slowlog
MySQL Slowlog
MySQL Slowlog
service
service
service
● What is our analysis granulation?
● Are slow logs caused by a query?
● Affected by a loaded MySQL host?
● By a loaded service?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemometer, collecting the slow logs
input { tcp { port => 23306 type => "mysql-slow" mode => "server" }}
filter { dns { reverse => [ "@source_host", "source_host_name" ] action => "replace" }}
output { file { type => "mysql-slow" message_format => "%{@message}" path => "/path/to/slow_logs/logstash/%{@source_host}-mysql-slow.log" }}
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemometer, rotating the slow logs
/outbrain/slow_logs/logstash/*.log { daily nocompress size 1 missingok ifempty copytruncate prerotate /bin/bash /var/www/html/anemometer/outbrain/pre_rotate.sh $1 endscript nosharedscripts rotate 100}
● logstash streams logs onto the anemometer machine● We choose not to aggregate them into one; the target file
name indicates the source host name
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemometer, processing the slow logs
#!/bin/bash
rotated_slow_log_file=$1rotated_slow_log_file_path=$(dirname $rotated_slow_log_file)rotated_slow_log_file_name=$(basename $rotated_slow_log_file)
hostname=${rotated_slow_log_file_name%%-mysql-slow.log*}
/bin/grep -v "^$" $rotated_slow_log_file | /usr/bin/pt-query-digest --user=... --password=... \ --review u=,p=,h=localhost,D=...,t=global_query_review \ --history u=,p=,h=localhost,D=...,t=global_query_review_history \ --filter=" \$event->{Bytes} = length(\$event->{arg}) and \$event->{hostname}=\"${hostname}\" and \$event->{clustername}=\"${clustername}\"" \ --no-report --group-by-extra=host
● Reading files per mysql-host, adding host & cluster● Secondary grouping by client-host
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemomaster: visibility into master DML
● One of those "How did we ever live without it?" tools.● Provides near real time (10 minute granularity) visibility
into queries issued on master.● Got an unexpected burst of INSERTs? Anemomaster
provides a quick and accurate access into the specific "guilty" query.
● And ops take a permalink to the owner.● "Anemomaster" is a nickname. This is Anemometer on
top of binary log analysis instead of slow log, analyzing number of executions instead of total run time.
● Also writing all DMLs to graphite.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemomaster
● Pinpointing count executions of a specific UPDATE query● This query is owned by a known team.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemomaster Host
Anemomaster @ Outbrain, behind the scenes
MySQLMaster
Binarylog
MySQLSlave
pt-query-digestRelaylog
MySQLMaster
Binarylog
MySQLSlave
pt-query-digest
Relaylog
Binarylog
Webinterface
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemomaster, processing the binary logs
/usr/bin/mysql -umy_user -pmy_password -e 'flush relay logs\G;'sleep 1
binlog_file=$(ls -tr /path/to/mysql/mysqld-relay-bin.[0-9]* | tail -n 2 | head -n 1)
mysqlbinlog $binlog_file | /usr/bin/pt-query-digest \ --type binlog --order-by Query_time:cnt --group-by fingerprint \ --limit 100 --review h=myhost,D=anemomaster,t=global_query_review \ --history h=myhost,D=anemomaster,t=global_query_review_history \ --filter=" \$event->{Bytes} = length(\$event->{arg}) and \$event->{hostname}=\"$(hostname)\" and \$event->{clustername}=\"${clustername}\" and \$event->{host}=\"n/a\" " \ --no-report
● Actually processing the relay logs on slaves● Assumes SBR, work in progress for RBR
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Anemomaster, writing to graphitequery=" select ... "mysql anemomaster --silent --silent --raw -e "$query" | while IFS=$'\t' read -r -a result_values do fingerprint_cluster=${result_values[0]} ; fingerprint_count=${result_values[1]} ; fingerprint_query=${result_values[2]} ; fingerprint_query=$(echo $fingerprint_query | sed -r -e "s/^(-- .*)]//g") fingerprint_query=$(echo $fingerprint_query | tr '\n' ' ' | tr '\r' ' ' | tr '\t' ' ') fingerprint_query=${fingerprint_query%%(*} fingerprint_query=${fingerprint_query%%,*} fingerprint_query=${fingerprint_query%% set *} fingerprint_query=${fingerprint_query%% SET *} fingerprint_query=${fingerprint_query%% where *} fingerprint_query=${fingerprint_query%% WHERE *} fingerprint_query=${fingerprint_query%% join *} fingerprint_query=${fingerprint_query%% JOIN *} fingerprint_query=${fingerprint_query%% using *} fingerprint_query=${fingerprint_query%% USING *} fingerprint_query=${fingerprint_query%% select *} fingerprint_query=${fingerprint_query%% SELECT *} fingerprint_query=$(echo $fingerprint_query | tr -d "\`") fingerprint_query=$(echo $fingerprint_query | tr -d "*") fingerprint_query=$(echo $fingerprint_query | tr " " "_") fingerprint_query=$(echo $fingerprint_query | tr "." "__") echo "data.mysql.${fingerprint_cluster}.mysql_dml.${fingerprint_query}.count \ ${fingerprint_count} $unixtime" | nc -w 1 graphite 3003 done
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
audit_login: a login auditing plugin
● Auditing every single login to our databases○ Keeping track of connects per minute, find problems○ Detecting unused accounts○ Detecting failed connects, taking action○ Detecting naughty scripts executed by developers
(haha, got your IP!)○ And, well, auditing for the record
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
audit_login, output
{"ts":"2013-09-11 09:11:47","type":"successful_login","myhost":"gromit03","thread":"74153868","user":"web_user","priv_user":"web_user","host":"web-87.localdomain","ip":"10.0.0.87"}
{"ts":"2013-09-11 09:11:55","type":"failed_login","myhost":"gromit03","thread":"74153869","user":"backup_user","priv_user":"","host":"web-32","ip":"10.0.0.32"}
{"ts":"2013-09-11 09:11:57","type":"failed_login","myhost":"gromit03","thread":"74153870","user":"backup_user","priv_user":"","host":"web-32","ip":"10.0.0.32"}
{"ts":"2013-09-11 09:12:48","type":"successful_login","myhost":"gromit03","thread":"74153871","user":"root","priv_user":"root","host":"localhost","ip":"10.0.0.111"}
{"ts":"2013-09-11 09:13:26","type":"successful_login","myhost":"gromit03","thread":"74153872","user":"web_user","priv_user":"web_user","host":"web-11.localdomain","ip":"10.0.0.11"}
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
logstashread, transform, write
Kibana
Searchable via Lucene
audit_login @ Outbrain, behind the scenes
MySQLMaster
auditlog
MySQLMaster
auditlog
MySQLMaster
auditlog
audit meta log
grep-able likemama used to make
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
audit_login, logstash
input { file { type => "mysql_audit_login" format => "json" sincedb_path => "/var/cache/logstash/.since_audit_login_log" sincedb_write_interval => 1 path => [ "/path/to/audit_login.log" ] }}
filter { grep { type => "mysql_audit_login" match => [ "user", "monitoring_user" ] negate => true } grep { type => "mysql_audit_login" match => [ "user", "heartbeat_user" ] negate => true }
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
audit_login, logstash
output { rabbitmq { host => "my.rmq.host" user => "logstash_user" password => "logstash_password" exchange => "logstash.out" exchange_type => "fanout" type => "mysql_audit_login" }}
output { tcp { type => "mysql_audit_login" mode => "client" host => "my.logstash.aggregator" port => "23307" message_format => "%{timestamp},%{type},%{myhost},%{thread},%{user},%{priv_user},%{host},%{ip}" } }
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
audit_login Kibana @ Outbrain
user:webapp AND myhost:east1 AND type:failed_login
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Actionability
● Can developers actually have controlled/automated actions on the database?
● Such that everyone, including DBA/Ops, have visibility into?
● Solving the above gives developers greater ownership over their domain, even within the database server.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Schema & data deployments
● Who controls the database schema design?○ Ops? Is schema design within their domain?○ DBA? Expert about schema design, but is the DBA an
expert about the business domain?○ Developers? Do they understand indexing?
● With many dozens of engineers, we can't have the DBA be the single mutex for any schema change.
● But the DBA must know what's going on.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
HiveMeta
HiveMeta
MySQL & Hive servers @ Outbrain
Slave
MySQL
Slave
Slave
DWH
Slave
Slave
Meta
Slave Hive
Hive Hive
Hive
Hive
Hive Hive
Hive
Hive Hive
Hive
Hive
Hive Hive
MySQLbuildserver
MySQLbuildserver
MySQLunittests
MySQLdev/sim
MySQLdev/sim
MySQLdev/sim
MySQLdev/sim
MySQLbuildserver
MySQLbuildserver
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Database servers @ Outbrain
● Multiple servers● Multiple roles (OLTP, OLAP, Meta, Hive, others)● Multiple environments (dev, QA, Build, Production)● Multiple types (MySQL, Hive)● Multiple engineers who want to deploy to them all. How
and where does a developer issue a CREATE TABLE?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Where & how to CREATE TABLE?
● Not on all servers, since the table is irrelevant to some (e.g. relevant to OLTP, not to DWH). Who keeps track?
● Shall the developer work them out one by one? Maybe a shell script?
○ Does the developer know all the credentials on all the servers?
● What if some deployment goes wrong? (Table already there; server cannot be accessed)
○ Who keeps track and retries/fixes?○ Do you know who did what, when & where?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Are all databases equal?
● We use different schema names on our test servers than we do on production
○ Who keeps record?● We have services which use multiple schemas, all with
exact same structure. Changes must apply on all schemas.○ We've just multiplied the number of deployments for
our CREATE TABLE statement.● Different ports, different credentials, different
FEDERATED/CONNECT targets...
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Existing schema deployment tools
● Some excellent open source solutions. Notable are Liquibase & flywaydb
● However we found them to be unsuitable to our needs○ Both linear○ Multitenancy not easy to achieve○ Mathematically sound, but reality isn't mathematically
sound.○ Require a lot of management to achieve visibility and
ownership● Some Windows-Desktop apps around. Ahem.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Propagator
● Eventually we developed our own, "multi-everything" solution
● Propagator provides ownership, action-ability and visibility
● Developers specify what they want to execute, and for which database role
● Propagator infers the hosts, the schema/query transformations and awaits your approval.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Propagator: submit a script for deployment
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Propagator action-ability, visibility & ownership
● Deployments are fully audited. Any failure is accounted for.
● Propagator tells you who did what, when and on which host. Also encourages "why".
● Engineers do most of the work with no intervention by DBA or ops
● DBA has control over deployments. Can retry, restart, selectively skip or issue partial queries...
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Propagator: history visibility
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Propagator: visibility & ownership
● The DBA may review deployments history● Has immediate feedback on anything that went wrong● Can most of the time figure out by herself why that went
wrong and rerun the deployment● Otherwise knows who to contact● Commenting and tagging enhance visibility● Typical scenario: developer is new, unsure what went
wrong (this can be considered as a bug, actually)● Next typical scenario: developer is experienced.
Everything works.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Propagator: still much TODO
● Propagator has been in production at Outbrain for a few months now, and it gets the job done.
● But still TODO:○ More feedback automation○ Email alerts○ Two-phase approval○ Online schema changes integration○ SVN integration○ Maven integration○ Cassandra
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Data retention
● If disk space runs out, who gets the alert?○ Ops? Sure, they can add some disk space (volume
group free space; spare disks on shelf). But only to up to some point.
● Time for data retention. Ideally, we would store data forever. Reality is not ideal.
○ Who is the owner of retention? If I want to drop a partition, who do I approve this with?
○ Can this be more visible?● Are you doing data retention via shell/Perl scripts? Are
these tested, audited, controlled?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Data retention automation: Gardien
● An Outbrain internal service automating data retention○ Currently works on Hive/HDFS; MySQL in the works
● Every partitioned table is owned by a person or group● Gardien knows the business demands:
○ Rolls new partitions, knows partition scope○ Drops old partitions, has retention policy
● Has a web interface, controlled by the business/engineers● Provides visibility to all, actionability to owners
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Gardien dashboard
● Create rules (partitions), edit, remove● Visible and audited
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
It's night time
Beep
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Being 'nice' to the databaseWork in progress
● Are you happy now that you've made your engineers all-powerful?
● Can you sleep well at night?● No, really. What haunts your dreams?● Darn. It's PagerDuty alert. Beep● Apparently all the slaves are lagging.
● An engineer someone issued too many INSERTs
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Being 'nice' to the databaseWork in progress
● How do you protect your database against malfunctioning/abusing services?
● How do you define/detect/respond to an event where your master is flooded with DMLs, and slaves just can't keep up?
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
What's your slaves' serving capacity?Per DC? Per service?
MySQLMaster
SlaveLaggingSlave Slave
LaggingSlave
LaggingSlave
LaggingSlave
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Visibility: serving capacity
● We measure current serving capacity and make this value visible
● Not only to graphite/alerts. Also visible to any of our services.
● Our services can be nice to the database by self-throttling access or postponing tasks.
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Slave
Visibility: serving capacity,Flow
Slave
LaggingSlave
Slave
LaggingSlave
Outbrainservice
Zookeeper
Zookeeper
Zookeeper
Slave
Availabilitydetectorservice
Readsstatus
Writessummarystatus
Consultsstatus,connects to DB
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Forcing services to be 'nice',Work in Progress
● A connection pool proxy● Proxy consults availability status● Throttles connections based on availability
Outbrainservice
Zookeeper
Zookeeper
Zookeeper
MySQLcluster
Consults status,approves/throttlesconnection
Proxy
Attempts to get a connection
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Contributions
● We love open source and heavily rely on open source solutions.
● We try to contribute back in form of patches, bug reports and subscribing for commercial support for open source projects.
● Some code we have open sourced:○ Onering: https://github.com/outbrain/onering○ Graphitus: https://github.com/ezbz/graphitus○ Propagator: https://github.com/outbrain/propagator○ audit_login: https://github.com/outbrain/audit_login
MySQL DevOps @ OutbrainShlomi Noach
Percona Live 2014Copyright © 2014, Outbrain
Thank you!
Questions?