Monitoring and observability

transcript

Monitoring and Observability

in Complex Architectures

Tuesday, October 2, 12

Hi! I’m @postwait

I founded @OmniTI and @MessageSystems and @Circonus

Hi! I’m @postwait

I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.

Hi! I’m @postwait

I (regrettably) build complex systems.

Why we are here

We’re here to talk aboutcoping with breakage

Rule #1

Direct observation of failureleads to quicker rectification.

Rule #2

You cannot correctwhat you cannot measure.

Solution Approach #1

Debugging failures requires eithervisibility into theprecipitating state

Precipitating State

Single threaded applications

✓ Easy

Precipitating State

Multi-threaded applications

✓ Challenging

Precipitating State

Distributed applications

here there be dragons

Solution Approach #2

ordirect observation of a(and likely very many)failing transaction

Direct Observation

Observing something fail...is priceless.

Direct Observation

Observation leads tointelligent questioning.

Direct Observation

Questioning leads to answers...but only through more observation.

Direct Observation

Questioning leads to answers...but only through more observation.

and herein lies the rub.

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

... or do you?

What’s monitoring got to do with it?

Monitoring is all about thepassive observation oftelemetry data.

Monitoring Telemetry

cannot pinpoint problems

can provides evidence ofthe existence of a problem

Monitoring

Gives you evidence thatthere is a problem

Monitoring

Gives you evidence thatyou have fixed a problem(or at least the symptoms)

Monitoring Tactically

If it could be of interest,measure it andexpose the measurement

Monitoring: embedded

statsdhttps://github.com/etsy/statsd

resmonhttp://labs.omniti.com/labs/resmon

metricshttps://github.com/codahale/metrics

folsomhttps://github.com/boundary/folsom

metrics.jshttps://github.com/mikejihbe/metrics

metrics-nethttps://github.com/danielcrenna/metrics-net

Monitoring: collection

reconnoiterhttp://labs.omniti.com/labs/reconnoiter

graphitehttp://graphite.wikidot.com/

OpenTSDBhttp://opentsdb.net/

circonushttp://circonus.com/

libratohttps://metrics.librato.com/

Monitoring: Bling

visualizing an architecture rollout

Monitoring: Bling

visualizing the impact on service times

average API service time latency

actual API service time latency

http://www.slideshare.net/postwait/atldevops

Monitoring: Bling

Repeatability is a Pipe Dream

You production problem is a(hopefully pathological)outcome of circumstance.

A circumstance which oftencannot be repeated.

Control Groups

Control groups cancompensate for theinability toprecisely repeat an experiment.

Control Groups

Most architectures have redundancy.

Control Groups

With the right design,you can turn that redundancyinto a debugging environment.

[1] http://omniti.com/surge/2012/sessions/xtreme-deployment

Control Groups: Simple Example

I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken

Control Groups: Seems Easy

Web servers tend to be:• homogeneous• share-(nothing|little)• independent

Control Groups: Not So Easy

Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues

Observability

Some might claim thatseeing telemetry data isobservation...

It is doubly indirect at best.

Observability

I want todirectly seetheerrant behaviour

Observability is forgiving

In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.

Observing the network

tcpdump / snoopwireshark

Looking at just thearrival of new connections

tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'

Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }

S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}

Observing user-space

strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]

[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf

System call tracing

Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`

System call tracing

An active web server is going to belike a firehose.truss -f -p `pgrep httpd`

Observing the system

DTrace

Live production demo or GTFO.

Thank You

Questions?

Monitoring and observability

Technology