+ All Categories
Transcript
Page 1: Monitoring and observability

/

Monitoring and Observability

in Complex Architectures

Tuesday, October 2, 12

Page 2: Monitoring and observability

Hi! I’m @postwait

I founded @OmniTI and @MessageSystems and @Circonus

Tuesday, October 2, 12

Page 3: Monitoring and observability

Hi! I’m @postwait

I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.

Tuesday, October 2, 12

Page 4: Monitoring and observability

Hi! I’m @postwait

I (regrettably) build complex systems.

Tuesday, October 2, 12

Page 5: Monitoring and observability

Why we are here

We’re here to talk aboutcoping with breakage

Tuesday, October 2, 12

Page 6: Monitoring and observability

Rule #1

Direct observation of failureleads to quicker rectification.

Tuesday, October 2, 12

Page 7: Monitoring and observability

Rule #2

You cannot correctwhat you cannot measure.

Tuesday, October 2, 12

Page 8: Monitoring and observability

Solution Approach #1

Debugging failures requires eithervisibility into theprecipitating state

Tuesday, October 2, 12

Page 9: Monitoring and observability

Precipitating State

Single threaded applications

✓ Easy

Tuesday, October 2, 12

Page 10: Monitoring and observability

Precipitating State

Multi-threaded applications

✓ Challenging

Tuesday, October 2, 12

Page 11: Monitoring and observability

Precipitating State

Distributed applications

here there be dragons

Tuesday, October 2, 12

Page 12: Monitoring and observability

Solution Approach #2

ordirect observation of a(and likely very many)failing transaction

Tuesday, October 2, 12

Page 13: Monitoring and observability

Direct Observation

Observing something fail...is priceless.

Tuesday, October 2, 12

Page 14: Monitoring and observability

Direct Observation

Observation leads tointelligent questioning.

Tuesday, October 2, 12

Page 15: Monitoring and observability

Direct Observation

Questioning leads to answers...but only through more observation.

Tuesday, October 2, 12

Page 16: Monitoring and observability

Direct Observation

Questioning leads to answers...but only through more observation.

and herein lies the rub.

Tuesday, October 2, 12

Page 17: Monitoring and observability

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

Tuesday, October 2, 12

Page 18: Monitoring and observability

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

... or do you?

Tuesday, October 2, 12

Page 19: Monitoring and observability

What’s monitoring got to do with it?

Monitoring is all about thepassive observation oftelemetry data.

Tuesday, October 2, 12

Page 20: Monitoring and observability

Monitoring Telemetry

cannot pinpoint problems

can provides evidence ofthe existence of a problem

Tuesday, October 2, 12

Page 21: Monitoring and observability

Monitoring

Gives you evidence thatthere is a problem

Tuesday, October 2, 12

Page 22: Monitoring and observability

Monitoring

Gives you evidence thatyou have fixed a problem(or at least the symptoms)

Tuesday, October 2, 12

Page 23: Monitoring and observability

Monitoring Tactically

If it could be of interest,measure it andexpose the measurement

Tuesday, October 2, 12

Page 24: Monitoring and observability

Monitoring: embedded

statsdhttps://github.com/etsy/statsd

resmonhttp://labs.omniti.com/labs/resmon

metricshttps://github.com/codahale/metrics

folsomhttps://github.com/boundary/folsom

metrics.jshttps://github.com/mikejihbe/metrics

metrics-nethttps://github.com/danielcrenna/metrics-net

Tuesday, October 2, 12

Page 25: Monitoring and observability

Monitoring: collection

reconnoiterhttp://labs.omniti.com/labs/reconnoiter

graphitehttp://graphite.wikidot.com/

OpenTSDBhttp://opentsdb.net/

circonushttp://circonus.com/

libratohttps://metrics.librato.com/

Tuesday, October 2, 12

Page 26: Monitoring and observability

Monitoring: Bling

visualizing an architecture rollout

Tuesday, October 2, 12

Page 27: Monitoring and observability

Monitoring: Bling

visualizing the impact on service times

Tuesday, October 2, 12

Page 28: Monitoring and observability

average API service time latency

Tuesday, October 2, 12

Page 29: Monitoring and observability

actual API service time latency

http://www.slideshare.net/postwait/atldevops

Tuesday, October 2, 12

Page 30: Monitoring and observability

Monitoring: Bling

Tuesday, October 2, 12

Page 31: Monitoring and observability

Repeatability is a Pipe Dream

You production problem is a(hopefully pathological)outcome of circumstance.

A circumstance which oftencannot be repeated.

Tuesday, October 2, 12

Page 32: Monitoring and observability

Control Groups

Control groups cancompensate for theinability toprecisely repeat an experiment.

Tuesday, October 2, 12

Page 33: Monitoring and observability

Control Groups

Most architectures have redundancy.

Tuesday, October 2, 12

Page 34: Monitoring and observability

Control Groups

With the right design,you can turn that redundancyinto a debugging environment.

[1] http://omniti.com/surge/2012/sessions/xtreme-deployment

Tuesday, October 2, 12

Page 35: Monitoring and observability

Control Groups: Simple Example

I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken

Tuesday, October 2, 12

Page 36: Monitoring and observability

Control Groups: Seems Easy

Web servers tend to be:• homogeneous• share-(nothing|little)• independent

Tuesday, October 2, 12

Page 37: Monitoring and observability

Control Groups: Not So Easy

Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues

Tuesday, October 2, 12

Page 38: Monitoring and observability

Observability

Some might claim thatseeing telemetry data isobservation...

It is doubly indirect at best.

Tuesday, October 2, 12

Page 39: Monitoring and observability

Observability

I want todirectly seetheerrant behaviour

Tuesday, October 2, 12

Page 40: Monitoring and observability

Observability is forgiving

In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.

Tuesday, October 2, 12

Page 41: Monitoring and observability

Observing the network

tcpdump / snoopwireshark

Tuesday, October 2, 12

Page 42: Monitoring and observability

Observing the network

Looking at just thearrival of new connections

tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'

Tuesday, October 2, 12

Page 43: Monitoring and observability

Observing the network

Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

Tuesday, October 2, 12

Page 44: Monitoring and observability

Observing the network

Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }

S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}

Tuesday, October 2, 12

Page 45: Monitoring and observability

Observing the network

Tuesday, October 2, 12

Page 46: Monitoring and observability

Observing the network

Tuesday, October 2, 12

Page 47: Monitoring and observability

Observing user-space

strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]

[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf

Tuesday, October 2, 12

Page 48: Monitoring and observability

System call tracing

Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`

Tuesday, October 2, 12

Page 49: Monitoring and observability

System call tracing

An active web server is going to belike a firehose.truss -f -p `pgrep httpd`

Tuesday, October 2, 12

Page 50: Monitoring and observability

Observing the system

DTrace

Live production demo or GTFO.

Tuesday, October 2, 12

Page 51: Monitoring and observability

Thank You

Questions?

Tuesday, October 2, 12


Top Related