Post on 21-Jan-2018
transcript
Counting Image Views using Redis Cluster
Seandon MooyDevOps Engineer
@erulabs
Counting Image Views using Redis Cluster
Or…. how I stopped map-reducing and learned to love the stream
Seandon MooyDevOps Engineer
@erulabs
3 Billion!
Delay!
Delay!
Failures!
Delay!
Failures!
Failures!
Also… I may not be the best zookeeper
Challenges with Hbase
Roughly 5% of all requests through THRIFT were failing… So many tunables!
Challenges with Hbase
Roughly 5% of all requests through THRIFT were failing… So many tunables!Optimized timeouts,added circuitbreakers, etc
Trickle of working requests during outage means circuit breakers are hard to design…
Challenges with Hbase
Roughly 5% of all requests through THRIFT were failing… So many tunables!Optimized timeouts,added circuitbreakers, etc
Trickle of working requests during outage means circuit breakers are hard to design…
“Hbase down == Imgur down”Downtime == sadtime :(
3 Billion!
Solution?
Redis Cluster!
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Ingest service
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Ingest service
Parses syslog lines, reports metrics via statsd
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Ingest service
Parses syslog lines, reports metrics via statsd
Redis 3.2 cluster!
Fastly
ViewCount V2 - Real time with less complexity!
Ingest service
Hbase Backfill service
Fastly
ViewCount V2 - Real time with less complexity!
Ingest service
Hbase Backfill service
Internet
API service
ViewCount V2 - Results:
ViewCount V2 - Results:
Request latency: min: 1ms max: 16.9ms median: 1.6ms p95: 2.6ms p99: 4.6ms Codes: 200: 10000
ViewCount V2 - Results:
Request latency: min: 1ms max: 16.9ms median: 1.6ms p95: 2.6ms p99: 4.6ms Codes: 200: 10000
ViewCount V2 - Results:
20 billion commands!> 400GB in memory!
Things to be aware of:
1. Redis Cluster shard maps - redirections, etc.Monitor redirections - gracefully restart workers after shard moves
2. AOF can slow down / fail large “redis-trib.rb” operations.Make sure to disable before / re-enable after!
3. Not all legacy systems support Redis Cluster, and if they do…They might not support it well (PHP-FPM)!
4. Over memory capacity behavior?Previously we would hard-crash - now we’d LRU old 1-view images.
Neither are good, but for us, one is much less painful
ViewCount V3?Approaching the point of minimal gains for man-hours, but what else might be fun?
1. Moving PHP7 off NodeJS API and directly to Redis ClusterDownsides: dealing with shard maps is complex is a stateless / process-per-request environment!
2. Using redis3's BITFIELD or HSet to save on key storage costsDownsides: complicate the system, reduce “hit-by-a-bus” issues - keys are just hashes, values are just counts!
3. Dealing with the nature of TCP Streams (TCP is not HTTP!)One connection to rule them all! - Node’s Cluster module helps,
but perhaps Rust or Golang?Downsides: Vertical scaling is non-obvious on EC2
ViewCount V2 - Results:
Redis is:
Faster - Imgur response time decreased ~50ms
ViewCount V2 - Results:
Redis is:
Faster - Imgur response time decreased ~50ms
Cheaper - EC2 cost reduced by 75%
ViewCount V2 - Results:
Redis is:
Faster - Imgur response time decreased ~50ms
Cheaper - EC2 cost reduced by 75%
Simpler - No Java, no MR, no ZK, no third parties, just INCR + GET!
Redis is:
Faster - Imgur response time decreased ~50ms
Cheaper - EC2 cost reduced by 75%
Simpler - No Java, no MR, no ZK, no third parties, just INCR + GET!
More fun! - I got to talk at RedisConf17!
ViewCount V2 - Results:
Acknowledgment
Imgur DevOps Team
Imgur Platform Team