Date post: | 08-Jan-2017 |
Category: |
Technology |
Upload: | dataconomy-media |
View: | 240 times |
Download: | 5 times |
The advantages of real-time monitoring in app development
Zsolt VarnaiPrincipal Software [email protected]
010203040506
The pastIssuesPossible solutionsThe presentThe futureExamples
• Monthly or less frequent releases
• Big release test cycle and bugfixing period
• Analytics data is used occasionally
• Minimal information about what is happening in the production app
The past (1-2 years ago)
• If it worked before release then it will work later as well
=> NOT TRUE
• What can change
• OS update
• New devices
• Server side behavior (any 3rd party tool + internal servers)
• Higher diversity, lots of use cases on real devices
The past (1-2 years ago)
• Bi-weekly release trains
• Feature flag controls feature visibility
• Checking GA and MP data through API on daily basis (with
daily summary)
• React on issues in 1-2 days
• Monitoring app reviews regularly (slack channel feed)
The past (6 months ago)
• What could possibly go wrong?
• Failing network requests
• Looks OK on the server side, remains unnoticed
• Client fails when tries to process it
• 3rd party tools causing crashes
• There is no failure, but the app doesn’t show the relevant content
• Invalid state causing permant crash/error loops
Issues
• Collect, process and monitor as much data as possible from
various sources
• Analytics data (conversion metrics and other metrics for core
functionality)
• Monitor store reviews (manual, but a good source of direct
information)
• Low level application logs, visible and silent errors/warnings
Solutions
• Deep instrumentation throughout the application code
• Stream based real-time metrics from production apps (Kafka)
• Aggregating relevant metrics from the event stream (openTSDB=time series database)
• Alerting on metrics (Bosun)
• Incident management system (VictorOps)
• Dashboards (Grafana)
• Drilling down on detailed events in case of an incident (Elasticsearch)
• Good chance of fixing big issues remotely before new release (feature flag coverage)
Today
• Smarter alerting capabilities
• General error/crash rates are misleading
• Ability to alert on big changes within a specific dimension (app version,
running experiments, different error types/services)
• Proper green flag system to alert relevant people without a dedicated
squad to supervise (“You build it you run it” model)
• Automated staged rollout progression based on real time metrics
• Automated review analysis
Future