Streaming Way to Webscale: How We Scale Bitly via Streaming

Post on 14-Jul-2015

193 views 2 download

Tags:

transcript

16 x 9

Streaming Your Way to Web Scale: Scaling Bitly via Stream-Based Processing!All Things Open!October 23, 2014, 4:15pm

85  slides  23  Images  21  diagrams

Peter Herndon!peter@bit.ly!@tpherndon

backend developer for Bitly

Scaling Bitly via Stream-Based Processing

or

Streaming Your Way to Web Scale

http://www.mongodb-is-web-scale.com/!(**NSFW due to language)

Define streaming — public API surface that puts events into message queues that are consumed by web services

Need to be asynchronous

we use Tornado

and Go

we had an older queue, but replaced it with a newer one.

written in Go

command v. event messages

Notification of an event

allows interested listeners to respond as they require

λLambda architecture - batch + stream processing

Part Two, NSQ

NSQ

http://nsq.io

Look, Ma, real code!

https://github.com/bitly/nsq https://github.com/bitly/go-nsq https://github.com/bitly/pynsq https://github.com/jehiah/nsqauth-contrib

Servers!• nsqd!• nsqlookupd!• nsqadmin!• pynsqauthd

Parts is parts! (part 1)

Clients!• go-nsq!• pynsq

Parts is parts! (part 2)

Utilities!• nsq_tail!• nsq_to_file!• nsq_to_nsq!• nsq_stat

Parts is parts! (part 3)

two basic building blocks of distributed NSQ

nsqd & nsqlookupd

put into context, look at evolution of a website

Basic Web App

Web App

Database

Basic web app. In Python, Django + Postgres, Flask + Postgres, Tornado + Postgres

Scaling the Mountain (of Load)

Web App

Database

Web App Web App

First bottleneck: web layer

Cache Rules Everything Around Me

Database

Web AppWeb AppWeb App

Cache

Remove web layer bottleneck, next is DB, so add caching layer

You Want Me to Replicate You

DatabaseDatabase

Web AppWeb AppWeb App

Cache

Works for a while, but DB requests still take too long, so replicate

Shards Here, Shards There

Web AppWeb AppWeb App

Cache

DatabaseDatabaseDatabaseDatabase DatabaseDatabase

…and then shard

It’s Off to Work I Go!

Database

Web App

Cache

Queue

Worker

But individual requests still take too long, because doing too much work. So add message queue and worker. In Python, Celery

Message From a Bottle(.py)

Database

Web App

Cache

Queue

Worker

Web app sends messages to queue

Working the Event

Database

Web App

Cache

Queue

Worker

Worker pulls message off queue and processes

Write Here, Write Now

Database

Web App

Cache

Queue

Worker

Worker writes results to database, file system, etc.

Write Here, Write Now (redux)

Web App

Database

Worker

Queue

Instead, imagine worker writes results

Sending Out an SMS

Web App

Database

Worker

Queue

and web app writes event messages to queue local to the web service

Listen, listen, LISTEN

Web App

Database

Worker

Queue

Queue

but worker is listening to a queue running on another server

Workin’ On a Chain(ed) Gang

Web App

Database

Worker

Queue

Web App

Database

Worker

Queue

Web App

Database

Worker

Queue

Look it up!

Web App

Database

Worker

Queue

Queue

Worker finds queue with topic via nsqlookupd

if  __name__  ==  "__main__":          tornado.options.parse_command_line()          logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')          )          run()

/<service>/queuereader_<service>.py

How do I find things?

Sending Out an SMS

Web App

Database

Worker

Queue

topic: ‘spam_api’

First time app writes to a TOPIC in the local nsqd

Sending Out an SMS

Web App

Database

Worker

Queue

topic: ‘spam_api’

nsqlookupd

nsqd creates the topic and registers it with nsqlookupd

Where Am I Again?

Web App

Database

Worker

nsqdtopic: 'spam_counter'

nsqdtopic: 'spam_api'

nsqlookupdtopic: 'spam_api'?

Worker in another service looking for a topic asks nsqlookupd, replies with address

Talkin’ ‘Bout Something

Web App

Database

Worker

nsqdtopic: 'spam_counter'

nsqdtopic: 'spam_api'

channel: 'spam_counter'

queuereader connects to nsqd, registers a channel

Cross-Town Traffic

nsqdtopic: 'spam_api'

Worker Worker Worker

channel: 'spam_counter'

messages are divided by # of subscribers to a channel; allows horizontal scaling

Channeling the Ghost

nsqdtopic: 'spam_api'

Worker Worker Worker

channel: 'spam_counter'

Worker

channel: 'nsq_to_file''

full copy of all messages to each channel

nsqadmin

nsqadmin

How we manage our message queues

nsqadmin

nsqadmin

pynsqauthd

It’s made of PEOPLE!

https://github.com/jehiah/nsqauth-contrib

pynsq client

It’s still made of PEOPLE!

https://github.com/bitly/pynsq

• settings.py!• <service>_api.py!• queuereader_<service>.py!• README.md

/<service>

Queuereaders are part of streaming architecture

if  __name__  ==  "__main__":          tornado.options.parse_command_line()          logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run()

/<service>/queuereader_<service>.py, 1 of 4

if  __name__  ==  "__main__":          tornado.options.parse_command_line()          logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run()

/<service>/queuereader_<service>.py, 1 of 4

def  validate_message(message):          if  message.get('o')  ==  '+'  and  message.get('l'):                  return  True          if  message.get('o')  ==  '-­‐'  and  message.get('l')\            and  message.get('bl'):                     return  True          return  False

/<service>/queuereader_<service>.py, 2 of 4

if  __name__  ==  "__main__":          tornado.options.parse_command_line()          logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run()

/<service>/queuereader_<service>.py, 1 of 4

def  count_spam_actions(message,  nsq_msg):          key_section  =  statsd_keys[message['o']]          key  =  key_section.get(message['l'],  key_section['default'])          statsd.incr(key)                    if  key  ==  'remove_by_manual':                  key_section  =  statsd_keys['-­‐manual']                  key  =  key_section.get(message['bl'],  key_section['default'])                  statsd.incr(key)                    return  nsq_msg.finish()

/<service>/queuereader_<service>.py, 3 of 4

def  count_spam_actions(message,  nsq_msg):          key_section  =  statsd_keys[message['o']]          key  =  key_section.get(message['l'],  key_section['default'])          statsd.incr(key)                    if  key  ==  'remove_by_manual':                  key_section  =  statsd_keys['-­‐manual']                  key  =  key_section.get(message['bl'],  key_section['default'])                  statsd.incr(key)                    return  nsq_msg.finish()

/<service>/queuereader_<service>.py, 3 of 4

if  __name__  ==  "__main__":          tornado.options.parse_command_line()          logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run()

/<service>/queuereader_<service>.py, 4 of 4

if  __name__  ==  "__main__":          tornado.options.parse_command_line()          logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run()

/<service>/queuereader_<service>.py, 4 of 4

utilities

nsq_tail!nsq_to_file!to_nsq!nsq_to_nsq!nsq_stat

Parts is parts! (part 3, redux)

Features & Guarantees!(aka Trade-Offs)

Distributed, No SPOF || Horizontally Scalable || TLS || statsd integration || Easy to Deploy || Cluster Administration

Messages NOT Durable

Delivered at least once

Un-ordered Delivery

Eventually-Consistent Discovery

A Thousand Points of Light (well, 58)

A Thousand Points of Light (well, 58)

Whoa…

8.2 billion decodes per month

Streaming Architecture

Easy to build new services Easy to scale individual components horizontally Durable in the face of single component failure Distributed

THINGS TO THINK ABOUT Monitoring, monitoring, monitoring Failure modes — how can things fail? How does your application as a whole handle the failure of individual components? Measurement — metrics show the range Timeouts — connection timeouts, DNS timeouts — a slow network is the same as a failed service

NSQ

http://nsq.io

Duke of URL

https://github.com/bitly/nsq https://github.com/bitly/go-nsq https://github.com/bitly/pynsq https://github.com/jehiah/nsqauth-contrib

Web Scale - http://www.mongodb-is-web-scale.com!Waterfall - https://www.flickr.com/photos/desatur8/14949285342!Tornado - https://www.flickr.com/photos/indigente/798304!John de Lancie - https://www.flickr.com/photos/cayusa/1394930005!Ben Whishaw - https://www.flickr.com/photos/rossendalewadey/6032496676!Command Key - https://www.flickr.com/photos/klash/3175479797!iPhone6 Event - https://www.flickr.com/photos/notionscapital/15067798867!Wait for iPhone - https://www.flickr.com/photos/josh_gray/662814907!NSQ Logo - http://nsq.io!!

All other photos by T. Peter Herndon!!

Photo Credits

Questions?

Peter Herndon!peter@bit.ly!@tpherndon