+ All Categories
Home > Technology > Monitoring and observability

Monitoring and observability

Date post: 15-Jan-2015
Category:
Upload: theo-schlossnagle
View: 5,310 times
Download: 0 times
Share this document with a friend
Description:
In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end. Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability. You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.
Popular Tags:
51
/ Monitoring and Observability in Complex Architectures Tuesday, October 2, 12
Transcript
Page 1: Monitoring and observability

/

Monitoring and Observability

in Complex Architectures

Tuesday, October 2, 12

Page 2: Monitoring and observability

Hi! I’m @postwait

I founded @OmniTI and @MessageSystems and @Circonus

Tuesday, October 2, 12

Page 3: Monitoring and observability

Hi! I’m @postwait

I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.

Tuesday, October 2, 12

Page 4: Monitoring and observability

Hi! I’m @postwait

I (regrettably) build complex systems.

Tuesday, October 2, 12

Page 5: Monitoring and observability

Why we are here

We’re here to talk aboutcoping with breakage

Tuesday, October 2, 12

Page 6: Monitoring and observability

Rule #1

Direct observation of failureleads to quicker rectification.

Tuesday, October 2, 12

Page 7: Monitoring and observability

Rule #2

You cannot correctwhat you cannot measure.

Tuesday, October 2, 12

Page 8: Monitoring and observability

Solution Approach #1

Debugging failures requires eithervisibility into theprecipitating state

Tuesday, October 2, 12

Page 9: Monitoring and observability

Precipitating State

Single threaded applications

✓ Easy

Tuesday, October 2, 12

Page 10: Monitoring and observability

Precipitating State

Multi-threaded applications

✓ Challenging

Tuesday, October 2, 12

Page 11: Monitoring and observability

Precipitating State

Distributed applications

here there be dragons

Tuesday, October 2, 12

Page 12: Monitoring and observability

Solution Approach #2

ordirect observation of a(and likely very many)failing transaction

Tuesday, October 2, 12

Page 13: Monitoring and observability

Direct Observation

Observing something fail...is priceless.

Tuesday, October 2, 12

Page 14: Monitoring and observability

Direct Observation

Observation leads tointelligent questioning.

Tuesday, October 2, 12

Page 15: Monitoring and observability

Direct Observation

Questioning leads to answers...but only through more observation.

Tuesday, October 2, 12

Page 16: Monitoring and observability

Direct Observation

Questioning leads to answers...but only through more observation.

and herein lies the rub.

Tuesday, October 2, 12

Page 17: Monitoring and observability

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

Tuesday, October 2, 12

Page 18: Monitoring and observability

Leaning Towards Scientific Process

In production you don’t have• repeatability• control groups• external verification

... or do you?

Tuesday, October 2, 12

Page 19: Monitoring and observability

What’s monitoring got to do with it?

Monitoring is all about thepassive observation oftelemetry data.

Tuesday, October 2, 12

Page 20: Monitoring and observability

Monitoring Telemetry

cannot pinpoint problems

can provides evidence ofthe existence of a problem

Tuesday, October 2, 12

Page 21: Monitoring and observability

Monitoring

Gives you evidence thatthere is a problem

Tuesday, October 2, 12

Page 22: Monitoring and observability

Monitoring

Gives you evidence thatyou have fixed a problem(or at least the symptoms)

Tuesday, October 2, 12

Page 23: Monitoring and observability

Monitoring Tactically

If it could be of interest,measure it andexpose the measurement

Tuesday, October 2, 12

Page 24: Monitoring and observability

Monitoring: embedded

statsdhttps://github.com/etsy/statsd

resmonhttp://labs.omniti.com/labs/resmon

metricshttps://github.com/codahale/metrics

folsomhttps://github.com/boundary/folsom

metrics.jshttps://github.com/mikejihbe/metrics

metrics-nethttps://github.com/danielcrenna/metrics-net

Tuesday, October 2, 12

Page 25: Monitoring and observability

Monitoring: collection

reconnoiterhttp://labs.omniti.com/labs/reconnoiter

graphitehttp://graphite.wikidot.com/

OpenTSDBhttp://opentsdb.net/

circonushttp://circonus.com/

libratohttps://metrics.librato.com/

Tuesday, October 2, 12

Page 26: Monitoring and observability

Monitoring: Bling

visualizing an architecture rollout

Tuesday, October 2, 12

Page 27: Monitoring and observability

Monitoring: Bling

visualizing the impact on service times

Tuesday, October 2, 12

Page 28: Monitoring and observability

average API service time latency

Tuesday, October 2, 12

Page 29: Monitoring and observability

actual API service time latency

http://www.slideshare.net/postwait/atldevops

Tuesday, October 2, 12

Page 30: Monitoring and observability

Monitoring: Bling

Tuesday, October 2, 12

Page 31: Monitoring and observability

Repeatability is a Pipe Dream

You production problem is a(hopefully pathological)outcome of circumstance.

A circumstance which oftencannot be repeated.

Tuesday, October 2, 12

Page 32: Monitoring and observability

Control Groups

Control groups cancompensate for theinability toprecisely repeat an experiment.

Tuesday, October 2, 12

Page 33: Monitoring and observability

Control Groups

Most architectures have redundancy.

Tuesday, October 2, 12

Page 34: Monitoring and observability

Control Groups

With the right design,you can turn that redundancyinto a debugging environment.

[1] http://omniti.com/surge/2012/sessions/xtreme-deployment

Tuesday, October 2, 12

Page 35: Monitoring and observability

Control Groups: Simple Example

I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken

Tuesday, October 2, 12

Page 36: Monitoring and observability

Control Groups: Seems Easy

Web servers tend to be:• homogeneous• share-(nothing|little)• independent

Tuesday, October 2, 12

Page 37: Monitoring and observability

Control Groups: Not So Easy

Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues

Tuesday, October 2, 12

Page 38: Monitoring and observability

Observability

Some might claim thatseeing telemetry data isobservation...

It is doubly indirect at best.

Tuesday, October 2, 12

Page 39: Monitoring and observability

Observability

I want todirectly seetheerrant behaviour

Tuesday, October 2, 12

Page 40: Monitoring and observability

Observability is forgiving

In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.

Tuesday, October 2, 12

Page 41: Monitoring and observability

Observing the network

tcpdump / snoopwireshark

Tuesday, October 2, 12

Page 42: Monitoring and observability

Observing the network

Looking at just thearrival of new connections

tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'

Tuesday, October 2, 12

Page 43: Monitoring and observability

Observing the network

Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

Tuesday, October 2, 12

Page 44: Monitoring and observability

Observing the network

Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }

S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}

Tuesday, October 2, 12

Page 45: Monitoring and observability

Observing the network

Tuesday, October 2, 12

Page 46: Monitoring and observability

Observing the network

Tuesday, October 2, 12

Page 47: Monitoring and observability

Observing user-space

strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]

[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf

Tuesday, October 2, 12

Page 48: Monitoring and observability

System call tracing

Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`

Tuesday, October 2, 12

Page 49: Monitoring and observability

System call tracing

An active web server is going to belike a firehose.truss -f -p `pgrep httpd`

Tuesday, October 2, 12

Page 50: Monitoring and observability

Observing the system

DTrace

Live production demo or GTFO.

Tuesday, October 2, 12

Page 51: Monitoring and observability

Thank You

Questions?

Tuesday, October 2, 12


Recommended