Scaling up with Aerospike!

Post on 13-Jul-2015

127 views 1 download

Tags:

transcript

Scaling up (and easing) operations at 1 Million TPS @ <1 ms latency.

LSPE, Jun 14, 2014

Agenda of this talk

● Some types of B ig Data?● What are the problems that come with scale?● What is the solution? (Or how Aerospike tackles

these problem and how is Aerospike the solution for the above problems).

● Anshu Prateek● Aerospike Devops Lead● Ex - Yahoo! Search Operations● http://about.me/anshuprateek● anshu@aerospike.com

Big Data Type

● Volume – Hadoop – PB / Hrs of jobs● Variety – ETL – Many data sources, mashup,

analyze● Velocity – Do it fast, do it now!

→ Volume and Variety need Velocity to be useful.

What starts failing at scale?

● Machines / hardware ● Network● Unplanned load● Operator error

Big Data..

● Volume – Hadoop – PB / Hrs of jobs● Variety – ETL – Many data sources, mashup,

analyze● Velocity – Do it fast, do it now!

→ Volume and Variety need Velocity to be useful.

Velocity in Aerospike

● Latency

Page SLA 700ms , Ads SLA 50 ms

→Data store <5ms– Hybrid DRAM + SSD optimized storage

● Throughput– Horizontal scalability (Linear is desirable)

Prod example:

● 20 Nodes● 1.6TB per node● 50GB DRAM usage● 14 Billion objects● 70k TPS (r+w) per node peak

● 98% of queries < 1ms●

Yet another prod graph...

What starts failing at scale?

● Machines / hardware ● Network● Unplanned load● Operator error

Start scaling with Aerospike..

● Machines / hardware – Replication / auto-balancing

● Network– Availability of islands– Auto balancing with eventual consistency

● Unplanned load– Have lot of headroom

● Operator error– What if the system reduces operational needs– Tools

Operational Ease

● Reducing initial setup time– Auto sharding– Auto cluster discovery

● Configuration– People don't read documents

● RTFM!

– Good default value– retain the power to control when needed

● Static configs● Dynamic configs

Tools

● Do all nodes have same config?– asmonitor -e 'compareconfig'

● Whats the cluster status?– asmonitor -e 'info'

● Oops, this needs to be changed!– asinfo -v 'set-

config:context=service;letschangethis=value'

Tools

● Nagios● Graphite● AMC

Capacity Planning

Managing with AMC

Managing with AMC

Managing with AMC

Headroom!

● How many TPS can we do ?

● 330 GCE● 300 x 1TB● Debian, Cassandra 2.2● Median Latency – 10.3 ms● 95% < 23 ms

Aerospike