Data Care, Feeding, and Maintenance

Data Care, Feeding, and Maintenance

Mercedes Coyle Data Infrastructure Engineer at

@benzobot

• Online Video Syndication platform

• Connect content providers, video publishers, advertising partners

• 2-3 million streams/day

Where does your data come from?

• One-time use analytics, or continual collection and processing?

• How much control do you have over data content and formatting?

• public datasets (gov, twitter) - little control

• application logging - more control

• Universal data format, or Normalize All the Data

• Pre- vs Post- processing

• Mapping data to a schema, even if it doesn’t have one

How is your data formatted?

Storage and Analytics tools

• Hadoop - distributed map reduce batch processing for large data sets

• Powerful querying tools (SQL-like Hive, Pig)

• Automated processing tasks for data ingestion and processing

• Slow - analyzing large data takes time, so no realtime results

Storage and Analytics tools • Realtime infrastructure - instantly available analytics

and data storage

• Storm, Spark, MongoDB, Logstash & Elasticsearch

• Can create aggregations and analytics jobs on the fly, and get results in seconds

• Quickly detect issues and make informed decisions

• Not always simple to query backwards over time series



• Small datasets? Reach for some more familiar tools

• CSVs can be handy for quick data analysis on a sample set of your data, especially for biz folks

• Don’t forget about command line tools: grep, awk, sort -u, sum


You have data - now what?!• What do you want to

learn from your data?

• How quickly do you need results?

• Is your dataset one time use, or will you add to it over time?

• How accurate do your results need to be?

• Where does your data need to end up?

Data Infrastructure!at! ! ! ! !

• 75-100 million documents per day

• Lambda Architecture

• Batch processing with Hadoop

• Homegrown Realtime Processing system using RSyslog, Logstash, Elasticsearch and Kibana (currently undergoing rewrite with Storm)

Data Infrastructure!at! ! ! ! !

• Alert Fatigue

• Vanity Metrics

• Alerts and metrics can only be intelligent and actionable if they are relatable

Log All the Data, but don’t monitor All the Data

Data Investigation: Rapid Stream Decline

Whoops!

Data Investigation: Rapid Stream Decline

• Our graphs only showed one metric (streams). Why did it decrease so much?

• Two player types, only one was affected.

• System performance metrics and monitoring showed no outages at this time.

Data Investigation: Digging Deeper

• Publishers provided page load data

• Correlated batch summaries of player loads with page load counts

• Cross-checked data in the Speed Layer to rule out batch processing issues



• Further data investigation revealed browser compatibility issues with our players

• Our batch reporting layer visualization highlighted the problem

• Ad-hoc queries in the speed layer allowed quick analysis to determine what caused the issue

Data Investigation: Next Steps

• More intelligent realtime reporting

• Refine our data visualization tools to better represent our metrics

• Better communication with the teams/products we collect data on to inform analytics and dashboards

• Hortonworks Hadoop Sandbox - http://hortonworks.com/products/hortonworks-sandbox/

• Storm Starter - https://github.com/nathanmarz/storm-starter and storm-project.net

• MongoDB Aggregation - http://docs.mongodb.org/manual/core/aggregation-introduction/

• Common Event Expression - http://cee.mitre.org/about/

Resources

http://hortonworks.com/products/hortonworks-sandbox/

https://github.com/nathanmarz/storm-starter

http://storm-project.net

http://docs.mongodb.org/manual/core/aggregation-introduction/

http://cee.mitre.org/about/

• Thanks!

• @benzobot

• [email protected]

Questions?

mailto:[email protected]

Date post:	24-Jun-2015
Category:	Technology
Upload:	mercedes-coyle
View:	110 times
Download:	0 times

Data Care, Feeding, and Maintenance

Technology