Date post: | 21-Apr-2017 |
Category: |
Engineering |
Upload: | sudhir-tonse |
View: | 6,164 times |
Download: | 3 times |
Sudhir Tonse (@stonse)Danny Yuan (@g9yuayon)
Big Data Pipeline and Analytics Platform
Using NetflixOSS and Other Open Source Software
Data Is the most important asset at Netflix
If all the data is easily available to all teams, it can be leveraged in new and exciting ways
~1000 Device Types~500 Apps/Web Services~100 Billion Events/Day
3.2M messages per second at peak time
3GB per second at peak time
Dashboard
Type of EventsUser Interface EventsSearch Event (Matrix using PS3 )Star Rating Event (HoC : 5 stars, Xbox, US, )
Infrastructural EventsRPC Call (API -> Billing Service, /bill/.., 200, )Log Errors (NPE, Movie is null, , )
Other Events
Making Sense of Billions of Events
http://netflix.github.io+
A Humble Beginning
Evolution Scale!
Application
Application
Application
Application
Application
Application
Application
Application
Application
Application
Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldnt cut it any more.
We Want to Process App Data in Hadoop
Our Hadoop Ecosystem
@NetflixOSS Big Data Tools
Hadoop as a Service
Pig Scripting on Steroids
Pig Married to Clojure
Map-Reduce for Clojure
S3MPER
S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
Efficient ETL with Cassandra
Cassandra
Offline Analysis
Evolution Speed!
We Want to Aggregate, Index, and Query Data in Real Time
Interactive Exploration
For one thing: interactive exploration. Sometimes we want to get data in real time so we can act quickly. Some data is only useful in a small time window after all. Sometimes we want to perform lots of experimental queries just to find the right insights. If we wait too long for a query back, we wont be able to iterate fast enough. Either way, we need to get query results back in seconds.
Lets walk through some use cases
client activity event
*
/name = movieStarts
Pipeline ChallengesApp owners: send and forgetData scientists: validation, ETL, batch processingDevOps: stream processing, targeted search
Message Routing
Here is one example: we process more than 150 thousand events per second about user activities. What if wed like to know the geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds....
But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
We Want to Consume Data Selectively in Different Ways
Message brokerHigh-throughputPersistent and replicated
There Is More
Intelligent Alerts
Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesnt help us catch unexpected errors. When we build an application, we instrument our code diligently, yet its very likely we miss some critical instrumentation points. Theres one thing that we always catch, though: logged errors and unhandled exceptions. Its about The alert provides a precise entrypoint and the right context for people to drill down the right problems
Intelligent Alerts
Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesnt help us catch unexpected errors. When we build an application, we instrument our code diligently, yet its very likely we miss some critical instrumentation points. Theres one thing that we always catch, though: logged errors and unhandled exceptions. Its about The alert provides a precise entrypoint and the right context for people to drill down the right problems
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Ad-hoc query with different dimensionsQuick aggregations and Top-N queriesTime series with flexible filtersQuick access to raw data using boolean queriesWhat We Need
DruidRapid exploration of high dimensional dataFast ingestion and queryingTime series
Real-time indexing of event streamsKiller feature: boolean searchGreat UI: Kibana
The Old Pipeline
The New Pipeline
There Is More
Its Not All About Counters and Time Series
RequestIdParent IdNode IdService NameStatus4965-4a740123Edge Service2004965-4a74123456Gateway2004965-4a74456789Service A2004965-4a74e456abcService B200
Status:200
Distributed Tracing
Distributed Tracing
Distributed Tracing
A System that Supports All These
A Data Pipeline To Glue Them All
Make It Simple
Message ProducingSimple and Uniform APImessageBus.publish(event)
Consumption Is Simple Too consumer.observe().subscribe(new Subscriber() {@Overridepublic void onNext(Ackable ackable) {process(ackable.getEntity(MyEventType.class));ackable.ack();}});
consumer.pause();consumer.resume()
RxJavaFunctional reactive programming modelPowerful streaming APISeparation of logic and threading model
Design DecisionsTop Priority: app stability and throughput Asynchronous operationsAggressive bufferingDrops messages if necessary
Anything Can Fail
Cloud Resiliency
Fault Tolerance FeaturesWrite and forward with auto-reattached EBS (Amazons Elastic Block Storage)disk-backed queue: big-queueCustomized scaling down
Theres More to DoContribute to @NetflixOSS Join us :-)
Summary
http://netflix.github.io+
You can build your own web-scale data pipeline using open source components
Thank You!Sudhir Tonsehttp://www.linkedin.com/in/sudhirtonseTwitter: @stonseDanny Yuan http://www.linkedin.com/pub/danny-yuan/4/374/862Twitter: @g9yuayon