Date post: | 08-May-2015 |
Category: |
Technology |
Upload: | devopsdays |
View: | 264 times |
Download: | 1 times |
Alert workflow in Gaming DevOps
Eduardo Saito Director of Engineering - Server Operations GREE International November 2013
Traditional Alert workflow
NOC
Ops
Dev
SME (Network, DBA,…)
Traditional Alert workflow
NOC
Ops
Dev
SME (Network, DBA,…)
Alert workflow – previous
Critical
Alert workflow – previous
Ops Dev
Critical
Alert workflow – previous
Ops Dev
Critical
Ops: where’s the runbook for this? Ops: app bug or system issue?
Ops: who’s the devel of this game? Phone #?
Ops: I can’t find the developer… who’s his manager?
Critical
Non- Critical
Alert workflow 2.0
Ops Dev
Critical
Ops: where’s the runbook for this? Ops: app bug or system issue?
Ops: who’s the devel of this game? Phone #?
Ops: I can’t find the developer… who’s his manager?
Alert Workflow 3.0 - current
Ops
Dev, Project X, Server
Alert Workflow 3.0 - current
Ops
Dev, Project X, Server
Dev, Project Y, Client, Android Dev, … Each alert go directly to
the right team that can resolve it !
Alerts go to the person that can resolve
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
App-level Application issues / bugs
Pingdom Dev and Ops teams
Alerts go to the person that can resolve
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
App-level Application issues / bugs
Pingdom Dev and Ops teams
Alerts go to the person that can resolve
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
App-level Application issues / bugs
Pingdom Dev and Ops teams
Alerts go to the person that can resolve
Type Scope Checked by Who to page?
ELB Load balancer health-check
ELB No one – email alert only
System-level Check cpu / disk / memory / network
Pingdom / Nagios
Ops team
App-level Application issues / bugs
Pingdom Dev and Ops teams
App-level alerts can be triggered by issues in:
• Server-side • Client-side
• iOS • Android
Dev and Ops are responsible
Team On-call
Ops 8
Dev 32, from 20 games (Server-side or client-side Android or iOS)
Analytics 5
Big display dashboard = quick status
Big display dashboard = quick status
IM Bot = better communication
Skype Bot informs in the
game channel that an alert was
triggered
Ops and Dev receive the alert, and
troubleshoot
IM Bot = better communication
Skype Bot detects issue is resolved
and send all-clear
IM Bot = better communication