Monitoring and observability

Post on 15-Jan-2015

5,310 views 0 download

Tags:

description

In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end. Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability. You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.

transcript

/

Monitoring and Observability

in Complex Architectures

Tuesday, October 2, 12

Hi! I’m @postwait

I founded @OmniTI and @MessageSystems and @Circonus

Tuesday, October 2, 12

Hi! I’m @postwait

I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.

Tuesday, October 2, 12

Hi! I’m @postwait

I (regrettably) build complex systems.

Tuesday, October 2, 12

Why we are here

We’re here to talk aboutcoping with breakage

Tuesday, October 2, 12

Rule #1

Direct observation of failureleads to quicker rectification.

Tuesday, October 2, 12

Rule #2

You cannot correctwhat you cannot measure.

Tuesday, October 2, 12

Solution Approach #1

Debugging failures requires eithervisibility into theprecipitating state

Tuesday, October 2, 12

Precipitating State

Single threaded applications

✓ Easy

Tuesday, October 2, 12

Precipitating State

Multi-threaded applications

✓ Challenging

Tuesday, October 2, 12

Precipitating State

Distributed applications

here there be dragons

Tuesday, October 2, 12

Solution Approach #2

ordirect observation of a(and likely very many)failing transaction

Tuesday, October 2, 12

Direct Observation

Observing something fail...is priceless.

Tuesday, October 2, 12

Direct Observation

Observation leads tointelligent questioning.

Tuesday, October 2, 12

Direct Observation

Questioning leads to answers...but only through more observation.

Tuesday, October 2, 12

Direct Observation

Questioning leads to answers...but only through more observation.

and herein lies the rub.

Tuesday, October 2, 12

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

Tuesday, October 2, 12

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

... or do you?

Tuesday, October 2, 12

What’s monitoring got to do with it?

Monitoring is all about thepassive observation oftelemetry data.

Tuesday, October 2, 12

Monitoring Telemetry

cannot pinpoint problems

can provides evidence ofthe existence of a problem

Tuesday, October 2, 12

Monitoring

Gives you evidence thatthere is a problem

Tuesday, October 2, 12

Monitoring

Gives you evidence thatyou have fixed a problem(or at least the symptoms)

Tuesday, October 2, 12

Monitoring Tactically

If it could be of interest,measure it andexpose the measurement

Tuesday, October 2, 12

Monitoring: embedded

statsdhttps://github.com/etsy/statsd

resmonhttp://labs.omniti.com/labs/resmon

metricshttps://github.com/codahale/metrics

folsomhttps://github.com/boundary/folsom

metrics.jshttps://github.com/mikejihbe/metrics

metrics-nethttps://github.com/danielcrenna/metrics-net

Tuesday, October 2, 12

Monitoring: collection

reconnoiterhttp://labs.omniti.com/labs/reconnoiter

graphitehttp://graphite.wikidot.com/

OpenTSDBhttp://opentsdb.net/

circonushttp://circonus.com/

libratohttps://metrics.librato.com/

Tuesday, October 2, 12

Monitoring: Bling

visualizing an architecture rollout

Tuesday, October 2, 12

Monitoring: Bling

visualizing the impact on service times

Tuesday, October 2, 12

average API service time latency

Tuesday, October 2, 12

actual API service time latency

http://www.slideshare.net/postwait/atldevops

Tuesday, October 2, 12

Monitoring: Bling

Tuesday, October 2, 12

Repeatability is a Pipe Dream

You production problem is a(hopefully pathological)outcome of circumstance.

A circumstance which oftencannot be repeated.

Tuesday, October 2, 12

Control Groups

Control groups cancompensate for theinability toprecisely repeat an experiment.

Tuesday, October 2, 12

Control Groups

Most architectures have redundancy.

Tuesday, October 2, 12

Control Groups

With the right design,you can turn that redundancyinto a debugging environment.

[1] http://omniti.com/surge/2012/sessions/xtreme-deployment

Tuesday, October 2, 12

Control Groups: Simple Example

I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken

Tuesday, October 2, 12

Control Groups: Seems Easy

Web servers tend to be:• homogeneous• share-(nothing|little)• independent

Tuesday, October 2, 12

Control Groups: Not So Easy

Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues

Tuesday, October 2, 12

Observability

Some might claim thatseeing telemetry data isobservation...

It is doubly indirect at best.

Tuesday, October 2, 12

Observability

I want todirectly seetheerrant behaviour

Tuesday, October 2, 12

Observability is forgiving

In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.

Tuesday, October 2, 12

Observing the network

tcpdump / snoopwireshark

Tuesday, October 2, 12

Observing the network

Looking at just thearrival of new connections

tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'

Tuesday, October 2, 12

Observing the network

Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

Tuesday, October 2, 12

Observing the network

Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }

S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}

Tuesday, October 2, 12

Observing the network

Tuesday, October 2, 12

Observing the network

Tuesday, October 2, 12

Observing user-space

strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]

[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf

Tuesday, October 2, 12

System call tracing

Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`

Tuesday, October 2, 12

System call tracing

An active web server is going to belike a firehose.truss -f -p `pgrep httpd`

Tuesday, October 2, 12

Observing the system

DTrace

Live production demo or GTFO.

Tuesday, October 2, 12

Thank You

Questions?

Tuesday, October 2, 12