Date post: | 24-Jun-2015 |
Category: |
Technology |
Upload: | mercedes-coyle |
View: | 110 times |
Download: | 0 times |
Data Care, Feeding, and Maintenance
Mercedes Coyle Data Infrastructure Engineer at
@benzobot
• Online Video Syndication platform
• Connect content providers, video publishers, advertising partners
• 2-3 million streams/day
Where does your data come from?
• One-time use analytics, or continual collection and processing?
• How much control do you have over data content and formatting?
• public datasets (gov, twitter) - little control
• application logging - more control
• Universal data format, or Normalize All the Data
• Pre- vs Post- processing
• Mapping data to a schema, even if it doesn’t have one
How is your data formatted?
Storage and Analytics tools
• Hadoop - distributed map reduce batch processing for large data sets
• Powerful querying tools (SQL-like Hive, Pig)
• Automated processing tasks for data ingestion and processing
• Slow - analyzing large data takes time, so no realtime results
Storage and Analytics tools • Realtime infrastructure - instantly available analytics
and data storage
• Storm, Spark, MongoDB, Logstash & Elasticsearch
• Can create aggregations and analytics jobs on the fly, and get results in seconds
• Quickly detect issues and make informed decisions
• Not always simple to query backwards over time series
Storage and Analytics tools
Storage and Analytics tools
• Small datasets? Reach for some more familiar tools
• CSVs can be handy for quick data analysis on a sample set of your data, especially for biz folks
• Don’t forget about command line tools: grep, awk, sort -u, sum
Storage and Analytics tools
You have data - now what?!• What do you want to
learn from your data?
• How quickly do you need results?
• Is your dataset one time use, or will you add to it over time?
• How accurate do your results need to be?
• Where does your data need to end up?
Data Infrastructure!at! ! ! ! !
• 75-100 million documents per day
• Lambda Architecture
• Batch processing with Hadoop
• Homegrown Realtime Processing system using RSyslog, Logstash, Elasticsearch and Kibana (currently undergoing rewrite with Storm)
Data Infrastructure!at! ! ! ! !
• Alert Fatigue
• Vanity Metrics
• Alerts and metrics can only be intelligent and actionable if they are relatable
Log All the Data, but don’t monitor All the Data
Data Investigation: Rapid Stream Decline
Whoops!
Data Investigation: Rapid Stream Decline
• Our graphs only showed one metric (streams). Why did it decrease so much?
• Two player types, only one was affected.
• System performance metrics and monitoring showed no outages at this time.
Data Investigation: Digging Deeper
• Publishers provided page load data
• Correlated batch summaries of player loads with page load counts
• Cross-checked data in the Speed Layer to rule out batch processing issues
Data Investigation: Digging Deeper
Data Investigation: Digging Deeper
• Further data investigation revealed browser compatibility issues with our players
• Our batch reporting layer visualization highlighted the problem
• Ad-hoc queries in the speed layer allowed quick analysis to determine what caused the issue
Data Investigation: Next Steps
• More intelligent realtime reporting
• Refine our data visualization tools to better represent our metrics
• Better communication with the teams/products we collect data on to inform analytics and dashboards
• Hortonworks Hadoop Sandbox - http://hortonworks.com/products/hortonworks-sandbox/
• Storm Starter - https://github.com/nathanmarz/storm-starter and storm-project.net
• MongoDB Aggregation - http://docs.mongodb.org/manual/core/aggregation-introduction/
• Common Event Expression - http://cee.mitre.org/about/
Resources