Post on 08-Jan-2017
transcript
1
MySQL for large scale
social games
Yoshinori Matsunobu
Principal Infrastructure Architect, Oracle ACE Director at DeNA
Former APAC Lead MySQL Consultant at MySQL/Sun/Oracle
Yoshinori.Matsunobu@gmail.com, Twitter: @matsunobu
http://yoshinorimatsunobu.blogspot.com/
2
Table of contents
� Easier maintenance and automating failover
�Non-stop master migration
�Automated master failover
�New Open Source Software: “MySQL MHA”
� Optimizing MySQL for faster H/W
3
Company Introduction: DeNA
� One of the largest social game providers in Japan
�Both social game platform and social games themselves
� Subsidiary ngmoco:) in San Francisco
� Japan localized phone, Smart Phone, and PC games
� 2-3 billion page views per day
� 25+ million users
� 1000+ MySQL servers, 150+ {master, slaves} pairs
� 1.3B$ revenue in 2010
4
Games are expanding / shrinking rapidly
� It is very difficult to predict social game workloads� Sometimes unexpectedly high traffics, sometimes much lower than expected
� Each social game traffic tends to go down after months / years
� For expanding games� Adding slaves
� Adding more shards– It’s possible to add shards without stopping services
� Scaling up master’s H/W– More RAM, HDD->SSD/PCI-E SSD, Faster NW, etc
� For shrinking games� Decreasing slaves
� Migrating master to lower-spec machine
� Consolidating a few masters/slaves within single machine
5
Desire for Easier Operations� We want to move master servers more easily
� Scaling-up: Increasing RAM, replacing with faster SSD
� Upgrading MySQL: Results in 10 minute or more downtime to fill in
buffer pool
� Scaling-down: Moving unpopular games to lower spec servers
� Working around for power outage: Moving games to remote datacenter
� If you can allocate maintenance downtime, it’s easy, but we
can’t do so many times
� Announcing to users, coordinating with customer support, etc
� Longer downtime reduces revenue
� Operating staffs will be exhausted by too many midnight work
� Reducing maintenance time is important to manage hundreds
or thousands of MySQL servers
6
Switching master in seconds
� If we can switch a master in less than 3 seconds, it is
acceptable in most of our cases
� Stopping updates on the master
�Waiting until at least one of the slaves (new master) has
synced with the current master
�Granting writes, allocating virtual ip (etc) to the new master
�All the rest slaves start replication from the new master
7
Blocking writes on master
� MySQL provides several commands/solutions to block writes, but
not all of them are safe
� FLUSH TABLES WITH READ LOCK
– Clients will wait forever, unless setting timeouts on client side
– Running transactions will be aborted in the end
“Updating master1 -> updating master 2 -> committing master1 -> getting error on
committing master 2” will result in data inconsistency
– Flushing all tables sometimes takes very long time
Run “FLUSH NO_WRITE_TO_BINLOG TABLES” beforehand
� SET GLOBAL read_only = 1
– Getting errors immediately
– Running transactions will be aborted
� Dropping MySQL user (used from applications)
– Can not establish new MySQL connection from applications
– Current sessions are NOT terminated until disconnect
– Current sessions do not encounter errors
– Works with non-persistent connections only
8
Trade-off between safeness and performance
�What we are now doing at DeNA is..
�Checking there is not any long running updates
– 100 seconds of updates will take 100 seconds on slaves
�Dropping app user -- starting downtime
�Waiting for a while (2 seconds maximum) until all active
application sessions are disconnected
– Ignoring replication threads, sessions sleeping 1 second or more (highly
likely daemon program or unused sessions, which can be killed safely)
– Not killing active sessions immediately
� Executing FLUSH TABLES WITH READ LOCK when there
are no active sessions or 2 seconds have passed
� Starting slave promotion -- ending donwtime
�At most 1 second is enough to do all processes
9
Our solution
� Developing “MySQL-MHA: Master High
Availability manager and tools”
� http://code.google.com/p/mysql-master-ha
� This is automated failover tool, but can also
be used for fast online master switch
� Switching original master to new master
gracefully
� We have switched 10+ masters so far. We
could switch in 0.5 – 1 second of
downtime
From:
host1 (current master)
+--host2 (backup)
+--host3 (slave)
+--host4 (slave)
+--host5 (remote)
To:
host2 (new master)
+--host3 (slave)
+--host4 (slave)
+--host5 (remote)
10
Master Failover: What makes it difficult?
id=99
id=100
id=101
id=102
master
slave1 slave2
id=99
id=100
id=99
id=100
id=101
MySQL replication is asynchronous.
It is likely that some (or none of) slaves have
not received all binary log events from the
crashed master.
It is also likely that only some slaves have
received the latest events.
In the left example, id=102 is not replicated to
any slave.
slave 2 is the latest between slaves, but
slave 1 and slave 3 have lost some events.
It is necessary to do the following:
- Copy id=102 from master (if possible)
- Apply all differential events, otherwise data
inconsistency happens.
slave3
id=99
Writer IP
1. Save binlog events that
exist on master only
2. Identify which events are not sent
id=101id=100
id=101
3. Apply lost events
id=102 id=102 id=102
11
Current stable HA solutions and issues
� Pacemaker(Heartbeat) + DRBD (or shared disk)� Cost: Additional passive master server (not handing any application traffic)
� Performance: To make HA really work on DRBD replication environments, innodb-flush-log-at-trx-commit and sync-binlog must be 1. But these kill write performance
� Otherwise necessary binlog events might be lost on the master. Then slaves can’t continue replication, and data consistency issues happen
� MySQL Cluster� MySQL Cluster is really Highly Available, but unfortunately we use InnoDB
� Others� Unstable, too complex, too hard to operate/administer, wrong/no document
� Not working with standard MySQL (are you saying we have to migrate all 150+ applications to bleeding edge distributions?)
� not working with remote datacenter, etc
12
Our solution: Developing MySQL-MHA
� MySQL Master High Availability manager and tools� http://code.google.com/p/mysql-master-ha� Manager pings master availability� When detecting master failure, promoting one of slaves to the new master, fixing consistency issues between slaves
master
slave1 slave2 slave3
Manager
MySQL-MasterHA-Manager
- masterha_manager
- other helper commands
MySQL-MasterHA-Node
- save_binary_logs
- apply_diff_relay_logs
- purge_relay_logs
master
slave1 slave2 slave3
13
Internals: steps for recovery
Final Relay_Log_File,
Relay_Log_Pos
Master_Log_File
Read_Master_Log_Pos
Latest SlaveDead Master
(i1) Partial Transaction
(i2) Differential relay logs from each slave’s read pos to
the latest slave’s read pos
(X) Differential binary logs from the latest slave’s read pos
to the dead master’s tail of the binary log
Slave(i)
Wait until SQL thread
executes all events
� On slave(i),� Wait until the SQL thread executes events� Apply i1 -> i2 -> X
– On the latest slave, i2 is empty
14
Advantages of MySQL MHA� Master failover and slave promotion can be done very quickly
� Total downtime can be 10-30 seconds
� Master crash does not result in data inconsistency
� No need to modify current MySQL settings� We use MHA for 150+ normal MySQL 5.0/5.1/5.5 masters, without modifying anything
� Problems of MHA do not result in MySQL failure� You can install/uninstall/upgrade/downgrade/restart without stopping MySQL
� No need to increase lots of servers
� No performance penalty
� Works with any storage engine
� Can also be used for failback (fast online master switch)
15
MySQL MHA Project Info
� Project top page� http://code.google.com/p/mysql-master-ha/
� Documentation� http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6
� Source tarball and rpm package (stable release)� http://code.google.com/p/mysql-master-ha/downloads/list
� The latest source repository (dev release)� https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager source)
� https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQLserver source)
� SkySQL provides commercial support for MHA
16
Table of contents
� Easier maintenance and automating failover
�Non-stop master migration
�Automated master failover
�New Open Source Software: “MySQL MHA”
� Optimizing MySQL for faster H/W
17
Per-server performance is important
� To handle 1 million queries per second..� 1000 queries/sec per server : 1000 servers in total
� 10000 queries/sec per server : 100 servers in total
� Additional 900 servers will cost 10M$ initially, 1M$ every year
� If you can increase per server throughput, you can reduce the total number of servers, which will decrease TCO
� Sharding is not everything
18
History of MySQL performance improvements
� H/W improvements
�HDD RAID, Write Cache
� Large RAM
� SATA SSD、PCI-Express SSD
�More number of CPU cores
� Faster Network
� S/W improvements
� Improved algorithm (i/o scheduling, swap control, etc)
�Much better concurrency
�Avoiding stalls
� Improved space efficiency (compression, etc)
19
� Random disk i/o speed (IOPS) on HDD is very slow
� 100-200/sec per drive
� Database easily became disk i/o bound, regardless of disk size
� Applications could not handle large data (i.e. 30GB+ per server)
� Lots of database servers were needed
� Per server traffic was not so high because both the number of users and data volume per server were not so high
� Backup and restore completed in short time
� MyISAM was widely used because it’s very space efficient and fast
32bit Linux
HDD RAID
(20GB)
2GB RAM
HDD RAID
(20GB)
2GB RAM
HDD RAID
(20GB)
2GB RAM
Updates
+ Many slaves + Many slaves + Many slaves
20
64bit Linux + large RAM + BBWC
� Memory pricing went down, and 64bit Linux went mature
� It became common to deploy 16GB or more RAM on a single linux machine
� Memory hit ratio increased, much larger data could be stored
� The number of database servers decreased (consolidated)
� Per server traffic increased (the number of users per server increased)
� “Transaction commit” overheads were extremely reduced thanks to battery backed up write cache
� From database point of view,
� InnoDB became faster than MyISAM (row level locks, etc)
� Direct I/O became common
HDD RAID
(120GB)
16GB RAM
+ Many slaves
21
Master
HDD RAID
� After 16-32GB RAM became common, we could run many more users and data per server
� Write traffic per server also increased
� 4-8 RAID 5/10 also became common, which improved concurrency a lot
� On 6 HDD RAID 10, single thread IOPS is around 200, 100 threads IOPS is around 1000-2000
� Good parallelism on both reads and writes on master
Side effect caused by fast server
� Serious replication delay happened (10+ minutes at peak time)
Slave
HDD RAID
� On slaves, there is only one writer thread (SQL thread). No parallelism on writes
� 6 HDD RAID10 is as slow as single HDD for writes
� Slaves became performance bottleneck earlier than master
22
Master
HDD RAID
� Using SSD on master was still risky
� Using SSD on slaves (IOPS: 100+ -> 3000+) was more effective than using on master (IOPS: 1000+ -> 3000+)
� We mainly deployed SSD on slaves
� The number of slaves could be reduced
� IOPS differences between master (1000+) and slave (100+) have caused serious replication delay
� Is there any way to gain high enough IOPS from single thread?
� From MySQL point of view: � Good concurrency on HDD RAID has been required : InnoDB Plugin
� Read IOPS on SATA SSD is 3000+, which should be enough (15 times better than HDD)
� Just replacing HDD with SSD solved replication delay
� Overall read throughput became much better
Slave
SATA SSD
Using SATA SSD on slaves
23
How about PCI-Express SSD?
� Deploying on both master and slaves? � If PCI-E SSD is used on master, replication delay will happen again
– 10,000IOPS from single thread, 40,000+ IOPS from 100 threads
� 10,000IOPS from 100 threads can be achieved with SATA SSD
� Parallel SQL threads should be implemented in MySQL
� Deploying on only slaves? � If using HDD on master, SATA SSD should be enough to handle workloads
– PCI-Express SSD is much more expensive than SATA SSD
� How about running multiple MySQL instances on single server? – Virtualization is not fast
– Running multiple MySQL instances on single OS is more reasonable
� Does PCI-E SSD have enough storage capacity to run multiple instances? � On HDD environments, typically only 100-200GB of database data can be stored because of slow random IOPS on HDD
� FusionIO SLC: 320GB Duo + 160GB = 480GB
� FusionIO MLC: 1280GB Duo + 640GB = 1920GB
� tachIOn SLC: 800GB x 2 = 1600GB
24
Running multiple slaves on single box
� Running multiple slaves on a single PCI-E slave� Master and Backup Server are still HDD based
� Consolidating multiple slaves
� Since slave’s SQL thread is single threaded, you can gain better concurrency by running multiple instances
� The number of instances is mainly restricted by capacity
Before After
M
B S1 S2 S3
M
B S1 S2 S3
MM
S1, S1
S1, S1M
B S1 S2 S3
M
B S1 S2 S3
B B
B
B B
BB
B B
M
BB
B B
MM
BB
B B
MM BB
B B
S2, S2
S2, S2
25
Our environment� Machine
� HP DL360G7 (1U), or Dell R610
� PCI-E SSD
� FusionIO MLC (640GB Duo + 320GB non-Duo)
� tachIOn SLC (800GB x 2)
� CPU
� Two sockets, Nehalem 6-core per socket, HT enabled
– 24 logical CPU cores are visible
– Four socket machine is too expensive
� RAM
� 60GB or more
� Network
� Broadcom BCM5709, Four ports
� Using four network cables + bonding mode 4 + link aggregation
– BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=1"
� HDD
� 4-8 SAS RAID1+0
� For backups, redo logs, relay logs, (optionally) doublewrite buffer
26
Benchmarks on our real workloads� Consolidating 7 instances on FusionIO (640GB MLC Duo + 320GB MLC)
� Let half of SELECT queries go to these slaves
� 6GB innodb_buffer_pool_size
� Peak QPS (total of 7 instances)� 61683.7 query/s
� 37939.1 select/s
� 7861.1 update/s
� 1105 insert/s
� 1843 delete/s
� 3143.5 begin/s
� CPU Utilization� %user 27.3%, %sys 11%(%soft 4%), %iowait 4%
� C.f. SATA SSD:%user 4%, %sys 1%, %iowait 1%
� Buffer pool hit ratio� 99.4%
� SATA SSD (single instance/server): 99.8%
� No replication delay
� No significant (100+ms) response time delay caused by SSD
27
CPU loads
� CPU utilization was high, but should be able to handle more� %user 27.3%, %sys 11%(%soft 4%), %iowait 4%
� Reached storage capacity limit (960GB). Using 1920GB MLC should be fine to handle more instances
� Network became the first bottleneck� Recv: 14.6MB/s, Send: 28.7MB/s
� CentOS5 + bonding is not good for network requests handling (only single CPU core can handle requests) (I got the above result when I tested with normal bond0)
� We are now using link aggregation + bond4 with 4 network cables, then the CPU bottleneck went away
22:10:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s22:11:57 all 27.13 0.00 6.58 4.06 0.14 3.70 0.00 58.40 56589.95…22:11:57 23 30.85 0.00 7.43 0.90 1.65 49.78 0.00 9.38 44031.82
28
Things to consider
� To run multiple MySQL instances in single server,you need to allocate different IP addresses or port numbers� Administration tools are also affected
� We allocated different (virtual) IP addresses because some of existing internal tools depend on “port=3306”
� bind-address=“virtual ip address” in my.cnf
� Creating separated directories and files� Socket files, data directories, InnoDB files, binary log files etc should be stored on different location each other
� Storing some files on HDD, others on SSD� Binary logs, Relay logs, Redo logs, error/slow logs, ibdata0 (files where doublewrite buffer is written), backup files on HDD
� Others on SSD
29
Optimizing for Social Game workloads
� Easily increasing millions of users in a few days
� Database size grows rapidly
– Especially if PK is “user_id + xxx_id” (i.e. item_id)
– Increasing GB/day is not uncommon
� Scaling reads is not difficult
� Adding slaves or adding caching servers
� Scaling writes is not trivial
� Sharding, scaling up
� Solutions depend on what kinds of tables we’re using,
INSERT/UPDATE/DELETE workloads, etc
30
INSERT-mostly tables
� History tables such as access logs, diary, battle history� INSERT and SELECT mostly
� Secondary index is needed (user_id, etc)
� Table size becomes huge (easily exceeding 1TB)
� Locality (Most of SELECT go to recent data)
� INSERT performance in general� Fast in InnoDB (Thanks to “Insert Buffering”. Much faster than MyISAM)
� To modify index leaf blocks, they have to be in buffer pool
� When index size becomes too large to fit in the buffer pool, disk reads happen
� In-memory workloads -> disk-bound workloads
– Suddenly suffering from serious performance slowdown
– UPDATE/DELETE/SELECT also getting much slower
� Any faster storage devices can not compete with in-memory workloads
31
INSERT gets slower
Time to insert 1 million records (InnoDB, HDD)
0
100
200
300
400
500
600
1 13 25 37 49 61 73 85 97 109 121 133 145
Existing records (millions)
Seconds
Sequential order
Random order
� Secondary index size exceeded innodb buffer pool size at 73 million
records for random order test
� Gradually taking more time because buffer pool hit ratio is getting worse
(more random disk reads are needed)
� For sequential order inserts, insertion time did not change.
No random reads/writes
Index size exceeded buffer pool size
10,000 rows/s
2,000 rows/s
32
INSERT performance difference
� In-memory INSERT throughput� 15000+ insert/s from single thread on recent H/W
� Exceeding buffer pool, starting disk reads� Degrading to 2000-4000 insert/s on HDD, single thread
� 6000-8000 insert/s on multi-threaded workloads
� Serious replication delay often happens
� Faster storage does not solve everything� At most 5000 insert/s on fastest SSDs such as tachIOn/FusionIO
– InnoDB actually uses CPU resources quite a lot for disk i/o bound inserts (i.e. calculating checksum, malloc/free)
� It is important to minimize index size so that INSERT can complete in memory
33
Approach to complete INSERT in memory
� Range partition by datetime
� Started from MySQL 5.1
� Index size per partition becomes total_index_size / number_of_partitions
� INT or TIMESTAMP enables hourly based partitions
– TIMESTAMP does not support partition pruning
� Old partitions can be dropped by ALTER TABLE .. DROP PARTITION
Single big physical table(index)Partition 1 Partition 2
Partition 3 Partition 4
34
Optimizing UPDATE, DELETE, SELECT
� Using SSD is really, really helpful� IOPS difference is significant
– Updates in memory: 15,000/s
– On HDD : 300/s
– On SATA SSD: 1,800/s
– On PCI-E SSD : 4,000/s
� We have used SATA SSD with RAID0 on slaves
� Now we are gradually increasing PCI-E SSD (FusionIO and tachIOn), consolidating 6-10 MySQL instances
� If all data fit in memory and traffics are very high, using NoSQL is helpful� We use HandlerSocket on user’s database (pk: user_id)
– Database size is less than InnoDB buffer pool size
� Check Oracle’s memcached API project. Should be very easy to use
35
Large-HDD servers and SSD servers
� “History Shard”
� Putting history data (comments, logs, etc) here
�Using range partitioning
� Large enough HDD with RAID 10
– 900GB (10K RPM) x 8 or 300GB (15K RPM) x 10 HDD
�Data size tends to be huge, but doesn’t matter so much
� “Application Shard”
�Middle range SSD (including SATA SSD), or PCI-E SSD
�Data size matters a lot
36
> > >> > >
Our near-future deployments
� By moving history tables, application data size can be decreased significantly
(less than 30%), so PCI-E servers can consolidate shards a lot
� Mostly in-memory workloads on HDD servers, so they can consolidate good
numbers of shards
� Server crash causes multiple shards failure
� Automated failover is important
Game1_shard1
Game1_shard2
Game1_shard3
Game1_shard4
Game2_shard1
Game2_shard2
>
Large HDD serversPCI-E or SATA/SAS SSD servers
Master
Slave/Backup
Game1_history_shard1
Game1_history_shard2
Game1_history_shard3
Game1_history_shard4
>Master
Slave/Backup
37
Summary
� Automated master failover and easier master maintenance is important to manage hundreds of master servers� Scaling up, scaling down, version up, etc
�Using MHA will help a lot– Configuring MHA does not require MySQL settings changes
– Master failover in 10-30 seconds, without passive server
– Moving master can be done in 0.5-2 seconds of downtime
� Optimizing MySQL for faster H/W�Deploying history tables (insert-mostly tables, hundreds of GBs) on HDD
�Deploying application tables on PCI-E SSD
�Consolidating multiple MySQL instances on single box