Date post: | 11-Apr-2017 |
Category: |
Technology |
Upload: | weaveworks |
View: | 192 times |
Download: | 1 times |
Loki: a Zipkin/Prometheus Mashup@tom_wilkie, CNCFCon Berlin April 2017
+ =
Why did I write my own tracer?
Debugging a latency performance issue
with Cortex…
Distributor
Ingester Ingester…
Well, thats my rationalisation…
In reality, this is attempt #2
• Prototype “Weave Tracer” circa 2015
• Concept didn’t require application instrumentation
• Used ptrace to intercept syscalls and infer application behaviour
• Kinda worked, for a very limited definition of “worked”
Prometheus = Greek god. Loki = Norse equivalent?
So what makes Loki different?
Prometheus is to Graphite as
Loki is to Zipkin
Push vs Pull
https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push?
https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/
• Jobs must know where monitoring is • Can overwhelm graphite with too many
samples
Graphite Prometheus
scraping
your jobs
Prometheus
pushing
your jobs
Graphite
• Tell Prometheus where jobs are (via service discovery)
• Prometheus can back off when overwhelmed • Prometheus knows the identity of each job
Zipkin Loki
scraping
your jobs
Loki
pushing
your jobs
Zipkin
http://job/traces
your job
Loki client library
spans
scraping
Loki
• Client library keep pending spans in an in-memory ring buffer.
• /traces HTTP handler grabs all the in-memory spans and serialises them using Thrift.
• Spans will be dropped if not collected frequently enough.
• Retrieval library ‘knows’ identity of scraped endpoints, adds that to received spans
• … jobs don’t need to know their own identity
• … can be consistent with identity used in Prometheus
• Naive in memory storage implementation
• … makes queries slow, as its just a loop.
• Zipkin-compatible API endpoints
• UI _is_ the Zipkin UI
LokiPrometheus retrieval library
In memory storage
Zipkin API
Zipkin UI
Its all open source:
https://github.com/weaveworks-experiments/loki
…and it’s written in go
This all sounds great! Where’s the catch?
❌ Client library doesn’t actually support multiple scrapers (yet)
❌ Loki query performance sucks (for now)
❌ Loki single-process architecture limits scalability
❌ Can dropped spans, gets worse through jitter
… that Cortex performance issue
Debugging a latency performance issue
with Cortex…
Distributor
Ingester Ingester…
It was garbage collection…
100ms ➡ 25ms
Demo
Client Library
• Make is support multiple scrapers
• Move away from thrift to protos
• More languages
• Useful HTML /traces
Loki Server
• Local storage with BoltDB
• Make queries faster
• Make it distributed, use cloud storage
TODO
Why did I write my own tracer?
Because with OpenTracing, I can.
Thank you!Questions?