Date post: | 21-Jan-2018 |
Category: |
Technology |
Upload: | vividcortex |
View: | 432 times |
Download: | 2 times |
@xaprb
Logistics...● I’m Baron Schwartz: @xaprb or [email protected]● I will post the slides from this talk● This is a follow-on to What Should I Monitor And How Should I Do It
○ https://youtu.be/zLjhFrUhqxg
@xaprb
What’s The Goal?Assumption: you’re building and operating a service.
You want to instrument it so you can build and operate it better.
You want observability.
● In the present● In the past● In the future? (Predictability)
Observability is how well an external observer can infer a system’s internal state.
@xaprb
What Should I Observe?
There’s a lot to measure in a complex system. What’s important?
● It’s more important to observe the work than the service itself.
● But it’s important to observe how the service responds to the workload.
@xaprb
Some Convenient BlueprintsBrendan Gregg’s USE Method
● Utilization, Saturation, Errors● http://www.brendangregg.com/usemethod.html
Tom Wilkie’s RED Method
● Measure request {Rate, Errors, Duration}● https://www.slideshare.net/weaveworks/interactive-monitoring-for-kubernetes
The SRE Book’s 4 Golden Signals
● latency, traffic, errors, and saturation● https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
@xaprb
Some Formal LawsQueueing Theory
● Utilization, arrival rate, throughput, latency
Little’s Law
● Concurrency, latency, throughput
Universal Scalability Law
● Throughput, concurrency
@xaprb
The Zen of PerformanceThe unifying concept in observing a service is two perspectives on requests.
External (customer’s) view:
● Request (singular), and its latency and success.
Internal (operator’s) view:
● Requests (plural, population), and their latency distribution, rates, and concurrency.
● System resources/components and their throughput, utilization, and backlog.
@xaprb
Much Confusion Comes From One-Sided ViewsMany people, when asked if a service is working well, will look at the service for problems.
But you can only answer that question by looking at the service’s work. From that, you may need to examine the service to see why it isn’t working well.
Both are necessary. You need instrumentation that enables both perspectives.
@xaprb
Metrics That MatterAll of the metrics in all of the methods & laws mentioned are important.
● Throughput, concurrency, latency, utilization, backlog/load/saturation, rates
All of them are time-related, either point-in-time or over-a-duration.
● Time is the zeroth performance metric (perfdynamics.com).
@xaprb
Your Service Must Provide These DataIf your service is to be observable, it needs to be possible to observe these things.
● You can provide the data directly, by instrumenting your service.● An instrumented system (e.g. OS) can implicitly offer a framework.● Or you can use a framework to build your service (e.g. Coda’s Metrics).
@xaprb
Service and Component InstrumentationIt’s not enough to just instrument your service’s input and output.
● You need internal components and subsystems to be observable too.● Common examples: buffers, queues, locks, mutexes, persistence.
It’s easy to see that a clear architecture can help.
● Are subsystems loosely coupled and cohesive, with clear boundaries?● Are they well defined?● Can you draw an architecture/block diagram of them? (c.f. Brendan Gregg)
Metrics on components rarely help much, beyond the basics.
@xaprb
The Process List Is GoldenFocus more on requests/work than components. This is a well-trodden path. Every mature request-oriented service has a process table/list.
● UNIX: process table, visible with `ps`● Apache: ServerStatus● MySQL: SHOW PROCESSLIST● PostgreSQL: pg_stat_activity● MongoDB: db.currentOp()
A process table tracks the existence and state of every process/worker in the system, and tasks/requests that it is executing.
@xaprb
Common Attributes Of Process TablesRequest itself
● E.g. SQL text, commandline+args, verb+url+qparams● Parent request/stage/span, if possible
State of request
● At a minimum: working or waiting (where? func/module/mutex…)● Ideally: stages/states of execution (parsing, planning, checking auth…)
Timings
● Timestamp of start; ideally timestamps of state changes too
@xaprb
One ExampleAt VividCortex, we built github.com/VividCortex/pm for API/service processlists.
● It’s for #golang● HTTP and web browser interface● See every request in-flight● Kill requests● Check request state and timings
This provides observability “now.” But not historical observability.
@xaprb
Extending Observability To Historical Views“Current state” observability is the foundation of historical views.The process list can be the foundation of request history and metrics.
For requests:
● Log every state transition/change a request makes.● Emit metrics on aggregates at these points, or at regular intervals.
○ See previous slides for which metrics to emit!
● Capture traces of requests for distributed tracing.
For components:
● Emit metrics from each component at regular intervals (ditto on prev. slides).
@xaprb
Logging, Metrics, TracesPeter Bourgon drew a diagram that helps illustrate some concepts.
https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
@xaprb
What should you log? I tend to agree with Dave Cheney:
I believe that there are only two things you should log:1. Things that developers care about when they are developing or
debugging software.2. Things that users care about when using your software.
Obviously these are debug and info levels, respectively.
https://dave.cheney.net/2015/11/05/lets-talk-about-logging
Logging
@xaprb
I am not a fan of “sampling” the way it’s commonly done.
● It’s a euphemism for “let’s ignore most things.”● Every request should be measured.
It’s typically implemented in terribly biased ways that cause all kinds of problems (e.g. “slow” query logs ignore fast-but-frequent requests).
● I prefer keeping representative samples of raw data.● But not ignoring/dropping the rest: at least aggregating it into metrics.
Logging and Traces
@xaprb
Representative Sampling Is Possible To Do
https://www.vividcortex.com/resources/sampling-a-stream-with-probabilistic-sketch
@xaprb
Observability CultureObservability is more than a Silicon Valley buzzword. It’s a culture, like DevOps.
How can you build a culture of observability?
● You get what you incentivize. Incentivize the data/metrics, you’ll get it.● Prioritize the end, not the means.● Understand the difference between culture, and visible artifacts of culture.
Many a company has tried to imitate Netflix or Etsy and gotten different results.
● See McFunley’s talk, for example: http://pushtrain.club/
@xaprb
What Should You Reward?● Clarity and intentionality; purposefulness● Empathy● Shared ownership and responsibility● Attendance at DevOpsDays
What should you think twice about rewarding?
● Metrics/data/graphs, in a vacuum for their own sake● Keep in mind Etsy’s “if it moves, graph it” slogan is a means, not an end
○ https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
@xaprb
Parting ThoughtsI’m a fan of defining the problem before working on the solution.
● Clarity of purpose tends to influence decisions for the better.● Explicit goal of observability and intelligibility tends to improve operability.● Clear understanding of performance focuses on KPIs, not vanity metrics.
Some further thoughts at https://www.vividcortex.com/resources/architecting-highly-monitorable-apps