+ All Categories
Home > Technology > How to Properly Blame Things for Causing Latency

How to Properly Blame Things for Causing Latency

Date post: 15-Jan-2017
Category:
Upload: spring-io
View: 125 times
Download: 1 times
Share this document with a friend
37
© 2016 Pivotal 1 An introduction to Distributed Tracing and Zipkin Adrian Cole, Pivotal @ adrianfcole How to Properly Blame Things for Causing Latency
Transcript
Page 1: How to Properly Blame Things for Causing Latency

© 2016 Pivotal!1

An introduction to Distributed Tracing and Zipkin

Adrian Cole, Pivotal @adrianfcole

How to Properly Blame Things for Causing Latency

Page 2: How to Properly Blame Things for Causing Latency

Introduction

introduction

latency analysis

distributed tracing

zipkin

demo

wrapping up

@adrianfcole#zipkin

Page 3: How to Properly Blame Things for Causing Latency

@adrianfcole• spring cloud at pivotal• focused on distributed tracing• helped open zipkin

Page 4: How to Properly Blame Things for Causing Latency

Latency Analysis

introduction

latency analysis

distributed tracing

zipkin

demo

wrapping up

@adrianfcole#zipkin

Page 5: How to Properly Blame Things for Causing Latency

Latency Analysis

Microservice and data pipeline architectures are a often a graph of components, distributed across a network.

A call graph or data flow can become delayed or fail due to the nature of the operation, components, or edges between them.

We want to understand our current architecture and troubleshoot latency problems, in production.

Page 6: How to Properly Blame Things for Causing Latency

Why is POST /things slow?

POST /things

Page 7: How to Properly Blame Things for Causing Latency

When was the event and how long did it take?

First log statement was at 15:31:29.103 GMT… last… 15:31:30.530

Server Received:15:31:29:103

POST /things

Server Sent:15:31:30:530Duration: 1427 milliseconds

Page 8: How to Properly Blame Things for Causing Latency

wombats:10.2.3.47:8080

Server log says Client IP was 1.2.3.4

This is a shard in the wombats cluster, listening on 10.2.3.47:8080

Server Received:15:31:29:103

POST /things

Server Sent:15:31:30:530Duration: 1427 milliseconds

Where did this happen?

peer.ipv4 1.2.3.4

Page 9: How to Properly Blame Things for Causing Latency

wombats:10.2.3.47:8080

Which event was it?

The http response header had “request-id: abcd-ffe”? Is that what you mean?

Server Received:15:31:29:103

POST /things

Server Sent:15:31:30:530Duration: 1427 milliseconds

peer.ipv4 1.2.3.4http.request-id abcd-ffe

Page 10: How to Properly Blame Things for Causing Latency

wombats:10.2.3.47:8080

Is it abnormal?

I’ll check other logs for this request id and see what I can find out.

Server Received:15:31:29:103

POST /things

Server Sent:15:31:30:530Duration: 1427 milliseconds

Well, average response time for POST /things in the last 2 days is 100ms

peer.ipv4 1.2.3.4http.request-id abcd-ffe

Page 11: How to Properly Blame Things for Causing Latency

wombats:10.2.3.47:8080

Achieving understanding

I searched the logs for others in that group.. took about the same time.

Server Received:15:31:29:103

POST /things

Server Sent:15:31:30:530Duration: 1427 milliseconds

Ok, looks like this client is in the experimental group for HD uploads

peer.ipv4 1.2.3.4http.request-id abcd-ffehttp.request.size 15 MiBhttp.url …&features=HD-uploads

Page 12: How to Properly Blame Things for Causing Latency

POST /things

There’s often two sides to the storyClient Sent:15:31:28:500 Client Received:15:31:31:000

Duration: 2500 milliseconds

Server Received:15:31:29:103

POST /things

Server Sent:15:31:30:530Duration: 1427 milliseconds

Page 13: How to Properly Blame Things for Causing Latency

and not all operations are on the critical path

Wire Send Store

Async StoreWire Send

POST /things

POST /things

Page 14: How to Properly Blame Things for Causing Latency

and not all operations are relevant

Wire Send Store Async

Async Store FailedWire Send

POST /things

POST /things

KQueueArrayWrapper.kev

UnboundedFuturePool-2

SelectorUtil.selectLockSupport.parkNan ReferenceQueue.remove

Page 15: How to Properly Blame Things for Causing Latency

Service architecture isn’t this simple anymore

Single-server scenarios aren’t realistic or don’t fully explain latency.

David Vignoni Gnome-fs-server.svg

Page 16: How to Properly Blame Things for Causing Latency

Can we make troubleshooting wizard-free?

We no longer need wizards to deploy complex architectures.

We shouldn’t need wizards to troubleshoot them, either!

Page 17: How to Properly Blame Things for Causing Latency

Distributed Tracing

introduction

latency analysis

distributed tracing

zipkin

demo

wrapping up

@adrianfcole#zipkin

Page 18: How to Properly Blame Things for Causing Latency

Distributed Tracing commoditizes knowledge

Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time.

You can compare traces to understand why certain requests take longer than others.

Page 19: How to Properly Blame Things for Causing Latency

Distributed Tracing Vocabulary

A Span is an individual operation that took place. A span contains timestamped events and tags.

A Trace is an end-to-end latency graph, composed of spans.

Page 20: How to Properly Blame Things for Causing Latency

wombats:10.2.3.47:8080

A Span is an individual operation

Server Received

POST /things

Server SentEvents

Tags

Operation

peer.ipv4 1.2.3.4http.request-id abcd-ffehttp.request.size 15 MiBhttp.url …&features=HD-uploads

Page 21: How to Properly Blame Things for Causing Latency

Tracing is logging important events

Wire Send Store

Async StoreWire Send

POST /things

POST /things

Page 22: How to Properly Blame Things for Causing Latency

Tracers record time, duration and host

Wire Send Store

Async StoreWire Send

POST /things

POST /things

Page 23: How to Properly Blame Things for Causing Latency

Tracers send trace data out of process

Tracers propagate IDs in-band, to tell the receiver there’s a trace in progress

Completed spans are reported out-of-band, to reduce overhead and allow for batching

Page 24: How to Properly Blame Things for Causing Latency

Tracers usually live in your application

Tracers execute in your production apps! They are written to not log too much, and to not cause applications to crash.

- propagate structural data in-band, and the rest out-of-band - have instrumentation or sampling policy to manage volume

- often include opinionated instrumentation of layers such as HTTP

Page 25: How to Properly Blame Things for Causing Latency

Tracing Systems are Observability Tools

Tracing systems collect, process and present data reported by tracers.

- aggregate spans into trace trees - provide query and visualization for latency analysis

- have retention policy (usually days)

Page 26: How to Properly Blame Things for Causing Latency

Tracing is not just for latency

Some wins unrelated to latency

- Understand your architecture - Find services that aren’t used

- Reduce time spent on triage

Page 27: How to Properly Blame Things for Causing Latency

Zipkin

introduction

latency analysis

distributed tracing

zipkin

demo

wrapping up

@adrianfcole#zipkin

Page 28: How to Properly Blame Things for Causing Latency

Zipkin is a distributed tracing system

Page 29: How to Properly Blame Things for Causing Latency

Zipkin has pluggable architecture

Tracers report spans HTTP or Kafka.

Servers collect spans, storing them in MySQL, Cassandra, or Elasticsearch.

Users query for traces via Zipkin’s Web UI or Api.

services: storage: image: openzipkin/zipkin-cassandra:1.6.0 container_name: cassandra ports: - 9042:9042 server: image: openzipkin/zipkin:1.6.0 environment: - STORAGE_TYPE=cassandra - CASSANDRA_CONTACT_POINTS=cassandra ports: - 9411:9411 depends_on: - storage

Page 30: How to Properly Blame Things for Causing Latency

Zipkin has starter architecture

Tracing is new for a lot of folks.

For many, the MySQL option is a good start, as it is familiar.

services: storage: image: openzipkin/zipkin-mysql:1.6.0 container_name: mysql ports: - 3306:3306 server: image: openzipkin/zipkin:1.6.0 environment: - STORAGE_TYPE=mysql - MYSQL_HOST=mysql ports: - 9411:9411 depends_on: - storage

Page 31: How to Properly Blame Things for Causing Latency

Zipkin can be as simple as a single file

$ curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' > zipkin.jar $ SELF_TRACING_ENABLED=true java -jar zipkin.jar

. ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v1.4.0.RELEASE)

2016-08-01 18:50:07.098 INFO 8526 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 8526 (/Users/acole/oss/sleuth-webmvc-example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example) —snip—

$ curl -s localhost:9411/api/v1/services|jq . [ "zipkin-server" ]

Page 32: How to Properly Blame Things for Causing Latency

Zipkin lives in GitHub

Zipkin was created by Twitter in 2012. In 2015, OpenZipkin became the primary fork.

OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images.

https://github.com/openzipkin

Page 33: How to Properly Blame Things for Causing Latency

Demo

introduction

latency analysis

distributed tracing

zipkin

demo

wrapping up

@adrianfcole#zipkin

Page 34: How to Properly Blame Things for Causing Latency

Two Spring Boot (Java) services collaborate over http.

Zipkin will show how long the whole operation took, as well how much time was spent in each service.

https://github.com/adriancole/sleuth-webmvc-example

Distributed Tracing across Spring Boot apps

Page 35: How to Properly Blame Things for Causing Latency

Web requests in the demo are served by Spring MVC controllers. Tracing of these are automatically performed by Spring Cloud Sleuth.

Spring Cloud Sleuth reports to Zipkin via HTTP by depending on spring-cloud-sleuth-zipkin.

https://cloud.spring.io/spring-cloud-sleuth/

Spring Cloud Sleuth Java

Page 36: How to Properly Blame Things for Causing Latency

Wrapping Up

introduction

latency analysis

distributed tracing

zipkin

demo

wrapping up

@adrianfcole#zipkin

Page 37: How to Properly Blame Things for Causing Latency

Wrapping up

Start by sending traces directly to a zipkin server.

Grow into fanciness as you need it: sampling, streaming, etc

Remember you are not alone!

@adrianfcole#zipkin

gitter.im/spring-cloud/spring-cloud-sleuth

gitter.im/openzipkin/zipkin


Recommended