Life After Sharding: Monitoring and Management of a Complex Data Cloud

Life After Sharding:Managing a Complex Data Cloud

Boris Livshutz, AppDynamics

Why are you here?

• You already shard, plan to shard, or need to shard your data

• You’re considering a NoSQL solution for production

2 Copyright © AppDynamics. All rights reserved.

About AppDynamics

• Distributed application monitoring for enterprise applications

• Data layer part of any enterprise app, monitored by us too

• Collecting massive amounts of metrics from our customers, store it all on MySQL


About Me

4 Copyright © 2010 AppDynamics. All rights reserved.

• 2 decades of experience building DB kernels, OLAP, server side development

• 4 years at AppDynamics scaling our server and helping our largest customers

What is a Data Cloud?

• Distinct set of data distributed across multiple nodes

• Multiple nodes work together to manage data

• Common examples:

• Sharded RDBMS

• NoSQL

• Data nodes can be part of a rented cloud or on-premise


Before: The Monolithic DB

• Monitoring Tools

• Cacti, Nagios, MySQL Enterprise, Enterprise Manager, Foglight

• Both open source and commercial systems,

• Alerting: Emails to NOC and DBAs, regarding one database in trouble

• Management

• Query one database: SQL shell, Toad, etc.

• Backup: Hot backup tools for each database

• Schema upgrades: Connect to one database and run upgrade script


Why We Need a Data Cloud

• The limits of vertical scale

• One Dell box – 256GB RAM, 32 cores, 36 disks in raid-60

• MySQL wasn’t able to use more then 12-16 cores

• 8 TB of data hard to backup, copy.

• Alter table almost impossible on largest tables

• No more growth option, no 256 core CPU!

• Hardware very expensive ($50K), cannot duplicate in test env

• Replication cannot keep up

• Advantages to horizontal scale

• Commodity hardware, easy to buy and expand

• $4k per box, 8 core, 48GB Ram, 5 disks

• MySQL is able to fully leverage the hardware, easier to tune


Choosing a Data Cloud

• Shard existing RDBMS• Change application logic to be shard-aware (lots of code changes!)

• Use a proxy (Scalebase, DbShards, Spock, HiveDB)

• NoSQL• You are brave!

• Give up on ACID, decades of stability, etc

• Gain failover, auto-resharding, etc OOTB


Dev Complete - Now What ??

• Can you just throw it over the wall to Ops?

• Almost no off the shelf tools to monitor and manage the data cloud

• DIY: only choice is to do it yourself. Sorry


What did we do?

• We had one MySQL that kept growing and growing

• Sharded MySql into 7 replica sets, 2 replicas each.

• We couldn’t release it until Ops was ready to keep it up 24x7

• Built our own “glue” to manage and monitor this beast.

• We ate our own dog food

• We partnered and didn’t re-invent the wheel.


Managing the Data Cloud

• ScaleBase

• Central point of management for data cloud

• The only source of truth: keeps track of each replica, location, naming, heartbeat, load


Instant access to data in the Data Cloud

• Access DB data through the Scalebase LoadBalancer

• Can set mode to send both query and DML to all replicas or just a subset or one

• We send sql to specific replica without knowing its location

• The only location we connect to is the Scalebase LoadBalancer

• Other 3rd party tools can also connect to the Scalebase LoadBalancer without knowing about our Data Cloud


Measure performance across your data cloud


Measure performance – Replica deep dive


Unified Alerting

• System wide alerts all come from single source - Scalebase

• Alerts go to PagerDuty to reach the right people on duty

• Alerts clearly identify replica set and replica node

• Allows quick resolutions by pinpointing problems in the data cloud

• NOC Response: SQL connection to troubleshoot via Scalebase

• Only need to know the replica and replica set from alert and can immediately investigate with SQL queries

• NOC Response: Use monitoring tool for deep dive investigation into the replica


Synchronized maintenance tasks

• Backups

• Synchronized

• Backup is just a “job” in Scalebase engine, Scalebase runs it on every replica

• Scalebase tracks the status of each job execution on each replica

• Schema upgrades: upgrade program doesn't need to know about where things are in the data cloud

• Upgrader just connects to Scalebase and upgrade sql will be sent to the whole data cloud automatically

• Configuration Changes

• global changes can be done in sql by just connecting to Scalebase and executing same change on ALL replicas.

• One sql can be sent to all Replicas by Scalebase. Any errors will be logged


Conclusions

• Lessons Learned

• Development, test and Ops needs to work together.

• Educate more of the team

• Most problems that arise are operational, not code bugs

• The right vendors really make it easier then doing everything yourself

• Future

• Automate failback with hot spare

• Try new technologies like XtraDB Cluster.


Vendors


Questions?

Date post:	10-May-2015
Category:	Technology
Upload:	oscon-byrum
View:	1,882 times
Download:	0 times

Life After Sharding: Monitoring and Management of a Complex Data Cloud

Technology