MySQL for Large Scale Social Games

transcript

MySQL for large scale

social games

Yoshinori Matsunobu

Principal Infrastructure Architect, Oracle ACE Director at DeNA

Former APAC Lead MySQL Consultant at MySQL/Sun/Oracle

Yoshinori.Matsunobu@gmail.com, Twitter: @matsunobu

http://yoshinorimatsunobu.blogspot.com/

Table of contents

� Easier maintenance and automating failover

�Non-stop master migration

�Automated master failover

�New Open Source Software: “MySQL MHA”

� Optimizing MySQL for faster H/W

Company Introduction: DeNA

� One of the largest social game providers in Japan

�Both social game platform and social games themselves

� Subsidiary ngmoco:) in San Francisco

� Japan localized phone, Smart Phone, and PC games

� 2-3 billion page views per day

� 25+ million users

� 1000+ MySQL servers, 150+ {master, slaves} pairs

� 1.3B$ revenue in 2010

Games are expanding / shrinking rapidly

� It is very difficult to predict social game workloads� Sometimes unexpectedly high traffics, sometimes much lower than expected

� Each social game traffic tends to go down after months / years

� For expanding games� Adding slaves

� Adding more shards– It’s possible to add shards without stopping services

� Scaling up master’s H/W– More RAM, HDD->SSD/PCI-E SSD, Faster NW, etc

� For shrinking games� Decreasing slaves

� Migrating master to lower-spec machine

� Consolidating a few masters/slaves within single machine

Desire for Easier Operations� We want to move master servers more easily

� Scaling-up: Increasing RAM, replacing with faster SSD

� Upgrading MySQL: Results in 10 minute or more downtime to fill in

buffer pool

� Scaling-down: Moving unpopular games to lower spec servers

� Working around for power outage: Moving games to remote datacenter

� If you can allocate maintenance downtime, it’s easy, but we

can’t do so many times

� Announcing to users, coordinating with customer support, etc

� Longer downtime reduces revenue

� Operating staffs will be exhausted by too many midnight work

� Reducing maintenance time is important to manage hundreds

or thousands of MySQL servers

Switching master in seconds

� If we can switch a master in less than 3 seconds, it is

acceptable in most of our cases

� Stopping updates on the master

�Waiting until at least one of the slaves (new master) has

synced with the current master

�Granting writes, allocating virtual ip (etc) to the new master

�All the rest slaves start replication from the new master

Blocking writes on master

� MySQL provides several commands/solutions to block writes, but

not all of them are safe

� FLUSH TABLES WITH READ LOCK

– Clients will wait forever, unless setting timeouts on client side

– Running transactions will be aborted in the end

“Updating master1 -> updating master 2 -> committing master1 -> getting error on

committing master 2” will result in data inconsistency

– Flushing all tables sometimes takes very long time

Run “FLUSH NO_WRITE_TO_BINLOG TABLES” beforehand

� SET GLOBAL read_only = 1

– Getting errors immediately

– Running transactions will be aborted

� Dropping MySQL user (used from applications)

– Can not establish new MySQL connection from applications

– Current sessions are NOT terminated until disconnect

– Current sessions do not encounter errors

– Works with non-persistent connections only

Trade-off between safeness and performance

�What we are now doing at DeNA is..

�Checking there is not any long running updates

– 100 seconds of updates will take 100 seconds on slaves

�Dropping app user -- starting downtime

�Waiting for a while (2 seconds maximum) until all active

application sessions are disconnected

– Ignoring replication threads, sessions sleeping 1 second or more (highly

likely daemon program or unused sessions, which can be killed safely)

– Not killing active sessions immediately

� Executing FLUSH TABLES WITH READ LOCK when there

are no active sessions or 2 seconds have passed

� Starting slave promotion -- ending donwtime

�At most 1 second is enough to do all processes

Our solution

� Developing “MySQL-MHA: Master High

Availability manager and tools”

� http://code.google.com/p/mysql-master-ha

� This is automated failover tool, but can also

be used for fast online master switch

� Switching original master to new master

gracefully

� We have switched 10+ masters so far. We

could switch in 0.5 – 1 second of

downtime

host1 (current master)

+--host2 (backup)

+--host3 (slave)

+--host4 (slave)

+--host5 (remote)

host2 (new master)

+--host3 (slave)

+--host4 (slave)

+--host5 (remote)

Master Failover: What makes it difficult?

id=100

id=101

id=102

master

slave1 slave2

id=100

id=101

MySQL replication is asynchronous.

It is likely that some (or none of) slaves have

not received all binary log events from the

crashed master.

It is also likely that only some slaves have

received the latest events.

In the left example, id=102 is not replicated to

any slave.

slave 2 is the latest between slaves, but

slave 1 and slave 3 have lost some events.

It is necessary to do the following:

- Copy id=102 from master (if possible)

- Apply all differential events, otherwise data

inconsistency happens.

slave3

Writer IP

1. Save binlog events that

exist on master only

2. Identify which events are not sent

id=101id=100

id=101

3. Apply lost events

id=102 id=102 id=102

Current stable HA solutions and issues

� Pacemaker(Heartbeat) + DRBD (or shared disk)� Cost: Additional passive master server (not handing any application traffic)

� Performance: To make HA really work on DRBD replication environments, innodb-flush-log-at-trx-commit and sync-binlog must be 1. But these kill write performance

� Otherwise necessary binlog events might be lost on the master. Then slaves can’t continue replication, and data consistency issues happen

� MySQL Cluster� MySQL Cluster is really Highly Available, but unfortunately we use InnoDB

� Others� Unstable, too complex, too hard to operate/administer, wrong/no document

� Not working with standard MySQL (are you saying we have to migrate all 150+ applications to bleeding edge distributions?)

� not working with remote datacenter, etc

Our solution: Developing MySQL-MHA

� MySQL Master High Availability manager and tools� http://code.google.com/p/mysql-master-ha� Manager pings master availability� When detecting master failure, promoting one of slaves to the new master, fixing consistency issues between slaves

master

slave1 slave2 slave3

Manager

MySQL-MasterHA-Manager

- masterha_manager

- other helper commands

MySQL-MasterHA-Node

- save_binary_logs

- apply_diff_relay_logs

- purge_relay_logs

master

slave1 slave2 slave3

Internals: steps for recovery

Final Relay_Log_File,

Relay_Log_Pos

Master_Log_File

Read_Master_Log_Pos

Latest SlaveDead Master

(i1) Partial Transaction

(i2) Differential relay logs from each slave’s read pos to

the latest slave’s read pos

(X) Differential binary logs from the latest slave’s read pos

to the dead master’s tail of the binary log

Slave(i)

Wait until SQL thread

executes all events

� On slave(i),� Wait until the SQL thread executes events� Apply i1 -> i2 -> X

– On the latest slave, i2 is empty

Advantages of MySQL MHA� Master failover and slave promotion can be done very quickly

� Total downtime can be 10-30 seconds

� Master crash does not result in data inconsistency

� No need to modify current MySQL settings� We use MHA for 150+ normal MySQL 5.0/5.1/5.5 masters, without modifying anything

� Problems of MHA do not result in MySQL failure� You can install/uninstall/upgrade/downgrade/restart without stopping MySQL

� No need to increase lots of servers

� No performance penalty

� Works with any storage engine

� Can also be used for failback (fast online master switch)

MySQL MHA Project Info

� Project top page� http://code.google.com/p/mysql-master-ha/

� Documentation� http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6

� Source tarball and rpm package (stable release)� http://code.google.com/p/mysql-master-ha/downloads/list

� The latest source repository (dev release)� https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager source)

� https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQLserver source)

� SkySQL provides commercial support for MHA

Table of contents

� Easier maintenance and automating failover

�Non-stop master migration

�Automated master failover

�New Open Source Software: “MySQL MHA”

� Optimizing MySQL for faster H/W

Per-server performance is important

� To handle 1 million queries per second..� 1000 queries/sec per server : 1000 servers in total

� 10000 queries/sec per server : 100 servers in total

� Additional 900 servers will cost 10M$ initially, 1M$ every year

� If you can increase per server throughput, you can reduce the total number of servers, which will decrease TCO

� Sharding is not everything

History of MySQL performance improvements

� H/W improvements

�HDD RAID, Write Cache

� Large RAM

� SATA SSD、PCI-Express SSD

�More number of CPU cores

� Faster Network

� S/W improvements

� Improved algorithm (i/o scheduling, swap control, etc)

�Much better concurrency

�Avoiding stalls

� Improved space efficiency (compression, etc)

� Random disk i/o speed (IOPS) on HDD is very slow

� 100-200/sec per drive

� Database easily became disk i/o bound, regardless of disk size

� Applications could not handle large data (i.e. 30GB+ per server)

� Lots of database servers were needed

� Per server traffic was not so high because both the number of users and data volume per server were not so high

� Backup and restore completed in short time

� MyISAM was widely used because it’s very space efficient and fast

32bit Linux

HDD RAID

(20GB)

2GB RAM

HDD RAID

(20GB)

2GB RAM

HDD RAID

(20GB)

2GB RAM

Updates

+ Many slaves + Many slaves + Many slaves

64bit Linux + large RAM + BBWC

� Memory pricing went down, and 64bit Linux went mature

� It became common to deploy 16GB or more RAM on a single linux machine

� Memory hit ratio increased, much larger data could be stored

� The number of database servers decreased (consolidated)

� Per server traffic increased (the number of users per server increased)

� “Transaction commit” overheads were extremely reduced thanks to battery backed up write cache

� From database point of view,

� InnoDB became faster than MyISAM (row level locks, etc)

� Direct I/O became common

HDD RAID

(120GB)

16GB RAM

+ Many slaves

Master

HDD RAID

� After 16-32GB RAM became common, we could run many more users and data per server

� Write traffic per server also increased

� 4-8 RAID 5/10 also became common, which improved concurrency a lot

� On 6 HDD RAID 10, single thread IOPS is around 200, 100 threads IOPS is around 1000-2000

� Good parallelism on both reads and writes on master

Side effect caused by fast server

� Serious replication delay happened (10+ minutes at peak time)

HDD RAID

� On slaves, there is only one writer thread (SQL thread). No parallelism on writes

� 6 HDD RAID10 is as slow as single HDD for writes

� Slaves became performance bottleneck earlier than master

Master

HDD RAID

� Using SSD on master was still risky

� Using SSD on slaves (IOPS: 100+ -> 3000+) was more effective than using on master (IOPS: 1000+ -> 3000+)

� We mainly deployed SSD on slaves

� The number of slaves could be reduced

� IOPS differences between master (1000+) and slave (100+) have caused serious replication delay

� Is there any way to gain high enough IOPS from single thread?

� From MySQL point of view: � Good concurrency on HDD RAID has been required : InnoDB Plugin

� Read IOPS on SATA SSD is 3000+, which should be enough (15 times better than HDD)

� Just replacing HDD with SSD solved replication delay

� Overall read throughput became much better

SATA SSD

Using SATA SSD on slaves

How about PCI-Express SSD?

� Deploying on both master and slaves? � If PCI-E SSD is used on master, replication delay will happen again

– 10,000IOPS from single thread, 40,000+ IOPS from 100 threads

� 10,000IOPS from 100 threads can be achieved with SATA SSD

� Parallel SQL threads should be implemented in MySQL

� Deploying on only slaves? � If using HDD on master, SATA SSD should be enough to handle workloads

– PCI-Express SSD is much more expensive than SATA SSD

� How about running multiple MySQL instances on single server? – Virtualization is not fast

– Running multiple MySQL instances on single OS is more reasonable

� Does PCI-E SSD have enough storage capacity to run multiple instances? � On HDD environments, typically only 100-200GB of database data can be stored because of slow random IOPS on HDD

� FusionIO SLC: 320GB Duo + 160GB = 480GB

� FusionIO MLC: 1280GB Duo + 640GB = 1920GB

� tachIOn SLC: 800GB x 2 = 1600GB

Running multiple slaves on single box

� Running multiple slaves on a single PCI-E slave� Master and Backup Server are still HDD based

� Consolidating multiple slaves

� Since slave’s SQL thread is single threaded, you can gain better concurrency by running multiple instances

� The number of instances is mainly restricted by capacity

Before After

B S1 S2 S3

S1, S1

S1, S1M

B S1 S2 S3

S2, S2

Our environment� Machine

� HP DL360G7 (1U), or Dell R610

� PCI-E SSD

� FusionIO MLC (640GB Duo + 320GB non-Duo)

� tachIOn SLC (800GB x 2)

� CPU

� Two sockets, Nehalem 6-core per socket, HT enabled

– 24 logical CPU cores are visible

– Four socket machine is too expensive

� RAM

� 60GB or more

� Network

� Broadcom BCM5709, Four ports

� Using four network cables + bonding mode 4 + link aggregation

– BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=1"

� HDD

� 4-8 SAS RAID1+0

� For backups, redo logs, relay logs, (optionally) doublewrite buffer

Benchmarks on our real workloads� Consolidating 7 instances on FusionIO (640GB MLC Duo + 320GB MLC)

� Let half of SELECT queries go to these slaves

� 6GB innodb_buffer_pool_size

� Peak QPS (total of 7 instances)� 61683.7 query/s

� 37939.1 select/s

� 7861.1 update/s

� 1105 insert/s

� 1843 delete/s

� 3143.5 begin/s

� CPU Utilization� %user 27.3%, %sys 11%(%soft 4%), %iowait 4%

� C.f. SATA SSD：%user 4%, %sys 1%, %iowait 1%

� Buffer pool hit ratio� 99.4%

� SATA SSD (single instance/server): 99.8%

� No replication delay

� No significant (100+ms) response time delay caused by SSD

CPU loads

� CPU utilization was high, but should be able to handle more� %user 27.3%, %sys 11%(%soft 4%), %iowait 4%

� Reached storage capacity limit (960GB). Using 1920GB MLC should be fine to handle more instances

� Network became the first bottleneck� Recv: 14.6MB/s, Send: 28.7MB/s

� CentOS5 + bonding is not good for network requests handling (only single CPU core can handle requests) (I got the above result when I tested with normal bond0)

� We are now using link aggregation + bond4 with 4 network cables, then the CPU bottleneck went away

22:10:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s22:11:57 all 27.13 0.00 6.58 4.06 0.14 3.70 0.00 58.40 56589.95…22:11:57 23 30.85 0.00 7.43 0.90 1.65 49.78 0.00 9.38 44031.82

Things to consider

� To run multiple MySQL instances in single server,you need to allocate different IP addresses or port numbers� Administration tools are also affected

� We allocated different (virtual) IP addresses because some of existing internal tools depend on “port=3306”

� bind-address=“virtual ip address” in my.cnf

� Creating separated directories and files� Socket files, data directories, InnoDB files, binary log files etc should be stored on different location each other

� Storing some files on HDD, others on SSD� Binary logs, Relay logs, Redo logs, error/slow logs, ibdata0 (files where doublewrite buffer is written), backup files on HDD

� Others on SSD

Optimizing for Social Game workloads

� Easily increasing millions of users in a few days

� Database size grows rapidly

– Especially if PK is “user_id + xxx_id” (i.e. item_id)

– Increasing GB/day is not uncommon

� Scaling reads is not difficult

� Adding slaves or adding caching servers

� Scaling writes is not trivial

� Sharding, scaling up

� Solutions depend on what kinds of tables we’re using,

INSERT/UPDATE/DELETE workloads, etc

INSERT-mostly tables

� History tables such as access logs, diary, battle history� INSERT and SELECT mostly

� Secondary index is needed (user_id, etc)

� Table size becomes huge (easily exceeding 1TB)

� Locality (Most of SELECT go to recent data)

� INSERT performance in general� Fast in InnoDB (Thanks to “Insert Buffering”. Much faster than MyISAM)

� To modify index leaf blocks, they have to be in buffer pool

� When index size becomes too large to fit in the buffer pool, disk reads happen

� In-memory workloads -> disk-bound workloads

– Suddenly suffering from serious performance slowdown

– UPDATE/DELETE/SELECT also getting much slower

� Any faster storage devices can not compete with in-memory workloads

INSERT gets slower

Time to insert 1 million records (InnoDB, HDD)

1 13 25 37 49 61 73 85 97 109 121 133 145

Existing records (millions)

Seconds

Sequential order

Random order

� Secondary index size exceeded innodb buffer pool size at 73 million

records for random order test

� Gradually taking more time because buffer pool hit ratio is getting worse

(more random disk reads are needed)

� For sequential order inserts, insertion time did not change.

No random reads/writes

Index size exceeded buffer pool size

10,000 rows/s

2,000 rows/s

INSERT performance difference

� In-memory INSERT throughput� 15000+ insert/s from single thread on recent H/W

� Exceeding buffer pool, starting disk reads� Degrading to 2000-4000 insert/s on HDD, single thread

� 6000-8000 insert/s on multi-threaded workloads

� Serious replication delay often happens

� Faster storage does not solve everything� At most 5000 insert/s on fastest SSDs such as tachIOn/FusionIO

– InnoDB actually uses CPU resources quite a lot for disk i/o bound inserts (i.e. calculating checksum, malloc/free)

� It is important to minimize index size so that INSERT can complete in memory

Approach to complete INSERT in memory

� Range partition by datetime

� Started from MySQL 5.1

� Index size per partition becomes total_index_size / number_of_partitions

� INT or TIMESTAMP enables hourly based partitions

– TIMESTAMP does not support partition pruning

� Old partitions can be dropped by ALTER TABLE .. DROP PARTITION

Single big physical table(index)Partition 1 Partition 2

Partition 3 Partition 4

Optimizing UPDATE, DELETE, SELECT

� Using SSD is really, really helpful� IOPS difference is significant

– Updates in memory: 15,000/s

– On HDD : 300/s

– On SATA SSD: 1,800/s

– On PCI-E SSD : 4,000/s

� We have used SATA SSD with RAID0 on slaves

� Now we are gradually increasing PCI-E SSD (FusionIO and tachIOn), consolidating 6-10 MySQL instances

� If all data fit in memory and traffics are very high, using NoSQL is helpful� We use HandlerSocket on user’s database (pk: user_id)

– Database size is less than InnoDB buffer pool size

� Check Oracle’s memcached API project. Should be very easy to use

Large-HDD servers and SSD servers

� “History Shard”

� Putting history data (comments, logs, etc) here

�Using range partitioning

� Large enough HDD with RAID 10

– 900GB (10K RPM) x 8 or 300GB (15K RPM) x 10 HDD

�Data size tends to be huge, but doesn’t matter so much

� “Application Shard”

�Middle range SSD (including SATA SSD), or PCI-E SSD

�Data size matters a lot

> > >> > >

Our near-future deployments

� By moving history tables, application data size can be decreased significantly

(less than 30%), so PCI-E servers can consolidate shards a lot

� Mostly in-memory workloads on HDD servers, so they can consolidate good

numbers of shards

� Server crash causes multiple shards failure

� Automated failover is important

Game1_shard1

Game1_shard2

Game1_shard3

Game1_shard4

Game2_shard1

Game2_shard2

Large HDD serversPCI-E or SATA/SAS SSD servers

Master

Slave/Backup

Game1_history_shard1

>Master

Slave/Backup

Summary

� Automated master failover and easier master maintenance is important to manage hundreds of master servers� Scaling up, scaling down, version up, etc

�Using MHA will help a lot– Configuring MHA does not require MySQL settings changes

– Master failover in 10-30 seconds, without passive server

– Moving master can be done in 0.5-2 seconds of downtime

� Optimizing MySQL for faster H/W�Deploying history tables (insert-mostly tables, hundreds of GBs) on HDD

�Deploying application tables on PCI-E SSD

�Consolidating multiple MySQL instances on single box

MySQL for Large Scale Social Games

Technology