Monitoring and observability

/

Monitoring and Observability

in Complex Architectures

Tuesday, October 2, 12

Hi! I’m @postwait

I founded @OmniTI and @MessageSystems and @Circonus


Hi! I’m @postwait

I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.


Hi! I’m @postwait

I (regrettably) build complex systems.


Why we are here

We’re here to talk aboutcoping with breakage


Rule #1

Direct observation of failureleads to quicker rectification.


Rule #2

You cannot correctwhat you cannot measure.


Solution Approach #1

Debugging failures requires eithervisibility into theprecipitating state


Precipitating State

Single threaded applications

✓ Easy


Precipitating State

Multi-threaded applications

✓ Challenging


Precipitating State

Distributed applications

here there be dragons


Solution Approach #2

ordirect observation of a(and likely very many)failing transaction


Direct Observation

Observing something fail...is priceless.


Direct Observation

Observation leads tointelligent questioning.


Direct Observation

Questioning leads to answers...but only through more observation.


Direct Observation

Questioning leads to answers...but only through more observation.

and herein lies the rub.


Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification


Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

... or do you?


What’s monitoring got to do with it?

Monitoring is all about thepassive observation oftelemetry data.


Monitoring Telemetry

cannot pinpoint problems

can provides evidence ofthe existence of a problem


Monitoring

Gives you evidence thatthere is a problem


Monitoring

Gives you evidence thatyou have fixed a problem(or at least the symptoms)


Monitoring Tactically

If it could be of interest,measure it andexpose the measurement


Monitoring: embedded

statsdhttps://github.com/etsy/statsd

resmonhttp://labs.omniti.com/labs/resmon

metricshttps://github.com/codahale/metrics

folsomhttps://github.com/boundary/folsom

metrics.jshttps://github.com/mikejihbe/metrics

metrics-nethttps://github.com/danielcrenna/metrics-net


https://github.com/etsy/statsd

https://github.com/etsy/statsd

http://labs.omniti.com/labs/resmon

http://labs.omniti.com/labs/resmon

https://github.com/codahale/metrics

https://github.com/codahale/metrics

https://github.com/boundary/folsom

https://github.com/boundary/folsom

https://github.com/mikejihbe/metrics

https://github.com/mikejihbe/metrics

https://github.com/danielcrenna/metrics-net

https://github.com/danielcrenna/metrics-net

Monitoring: collection

reconnoiterhttp://labs.omniti.com/labs/reconnoiter

graphitehttp://graphite.wikidot.com/

OpenTSDBhttp://opentsdb.net/

circonushttp://circonus.com/

libratohttps://metrics.librato.com/


http://labs.omniti.com/labs/reconnoiter

http://labs.omniti.com/labs/reconnoiter

http://graphite.wikidot.com/

http://graphite.wikidot.com/

http://opentsdb.net/

http://opentsdb.net/

http://circonus.com/

http://circonus.com/

https://metrics.librato.com/

https://metrics.librato.com/

Monitoring: Bling

visualizing an architecture rollout


Monitoring: Bling

visualizing the impact on service times


average API service time latency


actual API service time latency

http://www.slideshare.net/postwait/atldevops




Monitoring: Bling


Repeatability is a Pipe Dream

You production problem is a(hopefully pathological)outcome of circumstance.

A circumstance which oftencannot be repeated.


Control Groups

Control groups cancompensate for theinability toprecisely repeat an experiment.


Control Groups

Most architectures have redundancy.


Control Groups

With the right design,you can turn that redundancyinto a debugging environment.

[1] http://omniti.com/surge/2012/sessions/xtreme-deployment


http://omniti.com/surge/2012/sessions/xtreme-deployment

http://omniti.com/surge/2012/sessions/xtreme-deployment

Control Groups: Simple Example

I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken


Control Groups: Seems Easy

Web servers tend to be:• homogeneous• share-(nothing|little)• independent


Control Groups: Not So Easy

Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues


Observability

Some might claim thatseeing telemetry data isobservation...

It is doubly indirect at best.


Observability

I want todirectly seetheerrant behaviour


Observability is forgiving

In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.


Observing the network

tcpdump / snoopwireshark



Looking at just thearrival of new connections

tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'



Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'



Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }

S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}






Observing user-space

strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]

[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf


http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf

http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf

http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf

http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf

System call tracing

Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`


System call tracing

An active web server is going to belike a firehose.truss -f -p `pgrep httpd`


Observing the system

DTrace

Live production demo or GTFO.


Thank You

Questions?


Date post:	15-Jan-2015
Category:	Technology
Upload:	theo-schlossnagle
View:	5,310 times
Download:	0 times

Monitoring and observability

Technology