Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | fwdays |
View: | 343 times |
Download: | 1 times |
Beyond the code. Keep your site healthy and users satisfiedAleksandr MakhometUpwork
https://www.facebook.com/amahomethttp://twitter.com/amahomet
What is Upwork.com
• Formerly odesk.com
• Upwork has 12+ million registered freelancers and 5+ million registered clients. Three million jobs are posted annually, worth a total of $1+ billion USD, making it the world's largest freelancer marketplace.
• Highload (alexa=420). Microservice architecture
➤
What I’m talking about
User Experience is extremely important
Things that matter:
➔ Low errors level ➔ High performance➔ High site availability (no outages)
➤
Apdex
Apdex (Application Performance Index)
[0 - 1]
➤
Importance of DevOps culture
➤
What breaks production? New Features!
Importance of DevOps culture
DevOps (Developers + Operations)Is a culture that emphasizes the cooperation of both software developers and other information-technology (IT) professionals while automating the process of software delivery. It aims at establishing a culture and environment where building, testing, and releasing software can happen rapidly, frequently, and more reliably
Leads to: Faster time to market, lower failure rate of new releases, shortened lead time between fixes, and faster mean time to recovery
➤
Managing errors on production
Errare humanum est
➤
Managing errors on production
Effective logs are important
➔ Follow PSR-3
➔ Write as many logs as possible
➔ Write full logs (user id, visitor id, stack trace, request details, controller/action, instance info ...)
➔ Write Request Id➔ Use meaningful log messages➔ Do not write sensitive data
➤
Logging tools are important (ELK)
Effective tools are importantELK = Logstash + ElasticSearch + Kibana
➤
➔ Logstash - collect, filter and store logs➔ ElasticSearch - powerful fulltext search on top of Apache Lucene➔ Kibana - UI for searching logs
Write logs in json format
Demo
Logging: ELK Alternatives
➔ Graylog
➔ Loggly
➔ Papertrail
➤
Error level
Monitor your error level
➔ Graphite➔ Google Analytics
➤
Performance
➔ Measure
➤
➔ Group by controller/action or pageId
➔ Measure in detailsAny external service, Database, Memcache, Redis, whateverAny important component, like navigation
Performance: StatisticMean can lie10 requests dataset, in ms (2, 3, 5, 6, 6, 7, 9, 9, 26, 37)Mean = (2+3+5+6+6+7+9+9+26+37) / 10 = 11ms
➤
Median = (6+7) / 2 = 6.5 ms
90th percentile dataset (2, 3, 5, 6, 6, 7, 9, 9, 26)
Mean_90 = mean (90th percentile dataset) = 7.3Upper_90 = max (90th percentile dataset) = 26
Performance: Graphite stack
➤
Performance: Graphite
Graphite collects, stores, and displays time-series data in real time. ➔ Carbon - a high-performance service that listens for time-series data➔ Whisper - a simple database library for storing time-series data➔ Graphite-web - Graphite's user interface & API for rendering graphs and
dashboards
Metric format:
Data retention:
➤
<metric path> <metric value> <metric timestamp>
fwdays-demo.performance.pages.index 1 5098232342
retentions = 10:6h,60:14d,600:400d
Performance: Graphite vs StatsD
With StatsD works betterStatsd is a forwarder to Graphite
➔ Non blocking UDP protocol➔ Aggregates data, high performance➔ Supports 4 useful metrics: Counting, Timers, Gauges, Sets
To integrate, build your own simple script or use any open source, most popular
➤
Performance: Graphite graphs
You may combine, modify and filter data to get graph that you need
➤
Demo
Performance: prevent degradation
➤
➔ Make performance degradation check as a part of your definition of done
➔ Add performance degradation check to your code review checklist
➔ Use load testing
Performance: Alternatives
➤
Google Analytics➔ Keeps history for a long time➔ Segments are great, get performance for different types of
users
New Relic➔ Powerful performance analytics from the box➔ Uses magic sometime➔ Has free light account with 1 day data retention
Demo
Performance: ZipkinZipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures
➤
AlertingSetup simple healthcheck at least
➤
Application metrics➔ 5xx / 4xx / 3xx / 2xx rate➔ Errors rate➔ Response time➔ Apdex
Server metrics➔ CPU Usage ➔ Load Average ➔ Memory Usage ➔ Disk space➔ Disk I/O ➔ Network I/O
Notification channels➔ Chat➔ Email➔ SMS/push➔ Phone call
Thresholds➔ Warning➔ Critical
Alerting: Best Practices
➤
➔ Avoid setting thresholds too low. Avoid false positive
➔ Adjust your conditions over time
Alerting: Implementations
On top of Graphite➔ List of free tools (Cabot)
New Relic➔ Advanced in paid version➔ Basic in free version
Cloudwatch (if Amazon)
Zabbix / Nagios / Icinga
➤
Alerting: PagerDuty
➤
PagerDuty is an alarm aggregation and dispatching service for system administrators and support teams. It collects alerts from your monitoring tools, gives you an overall view of all of your monitoring alarms, and alerts an on duty engineer if there’s a problem.
Demo
IncidentsIncident is a critical violation
➤
➔ Create an #incident channel➔ Define incident escalation policy
◆ Define a person who can make decisions◆ Define a duty officer◆ Enable Moratorium for production changes until resolved
➔ Track metrics◆ MMTR - Mean time to resolve◆ MMTD - Mean time to detect◆ MMTE - Mean time to escalate◆ MTBF - Mean time between failures
➔ Do Postmortems➔ Have a visibility on production changes, especially with
microservices
Postmortems
➤
During➔ Do not offend ➔ Do not feel offended
After➔ Create a document with answers and share it➔ File issues
Before➔ What other parts of the site might also have similar
issues?➔ How we can determine root cause faster? ➔ How can we prevent it in future.➔ Lessons learned
Few more tools (Homework)
➔ Prometheus➔ Sentry➔ Pinba➔ Sensu➔ DataDog
➤
Le Fin
Questions?Aleksandr Makhomet
https://www.facebook.com/amahomethttp://twitter.com/amahomet
http://fwdays.comhttp://ergo.place
Upwork is hiring, if you are looking for an remote php senior dev position, ping me