+ All Categories
Transcript
Page 1: Rethinking metrics: metrics 2.0 @ Lisa 2014

rethinking metrics:

metrics 2.0

Page 2: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 3: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 4: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 5: Rethinking metrics: metrics 2.0 @ Lisa 2014

instagram.com/wrongrob

Page 6: Rethinking metrics: metrics 2.0 @ Lisa 2014

vimeo.com/43800150

Page 7: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 8: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 9: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 10: Rethinking metrics: metrics 2.0 @ Lisa 2014

problems

Metrics 2.0 concepts

implementations uses & ideas

Page 11: Rethinking metrics: metrics 2.0 @ Lisa 2014

Mostly

graphite

Page 12: Rethinking metrics: metrics 2.0 @ Lisa 2014

terminology

sync

Page 13: Rethinking metrics: metrics 2.0 @ Lisa 2014

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

Page 14: Rethinking metrics: metrics 2.0 @ Lisa 2014

Problems

Page 15: Rethinking metrics: metrics 2.0 @ Lisa 2014

Vimeo.com pagerequests/s?

server X disk write?

Page 16: Rethinking metrics: metrics 2.0 @ Lisa 2014

stats.hits.vimeo_com

stats_counts.hits.vimeo_com

stats.*.vimeo_requests

collectd.db.disk.sda1.disk_time.write

Page 17: Rethinking metrics: metrics 2.0 @ Lisa 2014

Terminology? Meaning?

Prefix?

Unit?

Aggregation?

Source?

Understanding metrics

Page 18: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 19: Rethinking metrics: metrics 2.0 @ Lisa 2014

Unclear, inconsistent terminology, format

tightly coupled

lack information

Page 20: Rethinking metrics: metrics 2.0 @ Lisa 2014

http://litlquest.com/forest-trees/see-forest-trees-2

Page 21: Rethinking metrics: metrics 2.0 @ Lisa 2014

O(S*P*A*C) S = # Sources

P = # People

A = # Aggregators

C = #Complexity

Page 22: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 23: Rethinking metrics: metrics 2.0 @ Lisa 2014

Graphs and dashboards are a huge time sink.

Page 24: Rethinking metrics: metrics 2.0 @ Lisa 2014

metrics 2.0

concepts

Page 25: Rethinking metrics: metrics 2.0 @ Lisa 2014

Self-describing

Standardized

Orthogonal dimensions

Page 26: Rethinking metrics: metrics 2.0 @ Lisa 2014

stats.timers.dfs5.proxy-server.object.GET.200.

timing.upper_90

Page 27: Rethinking metrics: metrics 2.0 @ Lisa 2014

{

server: dfvimeodfsproxy5,

http_method: GET,

http_code: 200,

unit: ms,

metric_type: gauge,

stat: upper_90,

swift_type: object

}

Page 28: Rethinking metrics: metrics 2.0 @ Lisa 2014

SI + IEC

B Err Warn ConnFile Req …

MB/s Err/dReq/h ...

Page 29: Rethinking metrics: metrics 2.0 @ Lisa 2014

allow more characters

unit: Req/s, site: vimeo.com, ...

Page 30: Rethinking metrics: metrics 2.0 @ Lisa 2014

Metadata

meta: {

src: proxy.py:458,

from: diamond

}

Page 31: Rethinking metrics: metrics 2.0 @ Lisa 2014

metrics20.org

Page 32: Rethinking metrics: metrics 2.0 @ Lisa 2014

Immediate understanding

of metrics

Minimize time to graphs,

alerting, troubleshooting

compatibility & flexibility

in tooling

Page 33: Rethinking metrics: metrics 2.0 @ Lisa 2014

Implementations getting the data

Page 34: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 35: Rethinking metrics: metrics 2.0 @ Lisa 2014

Source formats

…service=foo instance=host unit=B 123 1234567890

{s}foo.{i}host.{u}B 123 1234567890

<uuid> 125 1234567890 #seperate data

Page 36: Rethinking metrics: metrics 2.0 @ Lisa 2014

Carbon-tagger

…stats.gauges.host.foo 125 1234567890

service=foo instance=host target_type=gauge unit=B 123 1234567890

Page 37: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 38: Rethinking metrics: metrics 2.0 @ Lisa 2014

Statsdaemon

unit=B

unit=B

...

unit=ms

unit=ms

...

unit=B/s

unit=ms stat=meanunit=ms stat=upper_90...

Page 39: Rethinking metrics: metrics 2.0 @ Lisa 2014

Keep metric

tags in sync with data

Page 40: Rethinking metrics: metrics 2.0 @ Lisa 2014

Implementations

Graphing & dashboarding

Visualization

Alerting

Page 41: Rethinking metrics: metrics 2.0 @ Lisa 2014

Graphing &Dashboarding

Page 42: Rethinking metrics: metrics 2.0 @ Lisa 2014

GraphExplorer

Page 43: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 44: Rethinking metrics: metrics 2.0 @ Lisa 2014

Graph-Explorer queries 101

proxy-server swift server:regex unit=ms

(AND)

Page 45: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 46: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 47: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 48: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 49: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 50: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 51: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 52: Rethinking metrics: metrics 2.0 @ Lisa 2014

upper_90 (or stat=upper_90)

from <datetime>to <datetime>

avg over <timespec>(5M, 1h, 3d, ...)

Page 53: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 54: Rethinking metrics: metrics 2.0 @ Lisa 2014

Compare object put/get

stack …

http_method:(PUT|GET)

swift_type=object

avg by http_code,server

Page 55: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 56: Rethinking metrics: metrics 2.0 @ Lisa 2014

Comparing servers

http_method:(PUT|GET)

group by unit,target_type

avg by http_code,swift_type,http_method

Page 57: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 58: Rethinking metrics: metrics 2.0 @ Lisa 2014

transcode unit=Job/savg over <time>

from <datetime> to <datetime>

Page 59: Rethinking metrics: metrics 2.0 @ Lisa 2014

Note: data is obfuscated

Page 60: Rethinking metrics: metrics 2.0 @ Lisa 2014

Bucketing

sum by zone:eu-west|us-east|ap-southeast|us-west|

sa-east|vimeo-df|vimeo-lv

group by state

Page 61: Rethinking metrics: metrics 2.0 @ Lisa 2014

Note: data is obfuscated

Page 62: Rethinking metrics: metrics 2.0 @ Lisa 2014

Compare job states per region

group by zone

Page 63: Rethinking metrics: metrics 2.0 @ Lisa 2014

Note: data is obfuscated

Page 64: Rethinking metrics: metrics 2.0 @ Lisa 2014

Unit conversion

unit=Mb/s network server:regexsum by server

Page 65: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 66: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 67: Rethinking metrics: metrics 2.0 @ Lisa 2014

Integration

Metric unit=B/s Query unit=TB

Page 68: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 69: Rethinking metrics: metrics 2.0 @ Lisa 2014

Deriving

Metric unit=BQuery unit=GB/d

Page 70: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 71: Rethinking metrics: metrics 2.0 @ Lisa 2014

Highly extensible

Equal rights for all tags

→ real world use drives spec

Page 72: Rethinking metrics: metrics 2.0 @ Lisa 2014

SI + IEC

B Err Conn File Req

Anything

Err/s Anything/sMAnything/d

Page 73: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 74: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 75: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 76: Rethinking metrics: metrics 2.0 @ Lisa 2014

Minimize time from information need toinsights.

Page 77: Rethinking metrics: metrics 2.0 @ Lisa 2014

Future Work

Page 78: Rethinking metrics: metrics 2.0 @ Lisa 2014

Faced-based suggestions

Custom hierachies

Page 79: Rethinking metrics: metrics 2.0 @ Lisa 2014

Tag insights

Page 80: Rethinking metrics: metrics 2.0 @ Lisa 2014

● Storage aggregation rules

● graphite API functions such as cumulative, summarize and smartSummarize

●consolidateBy & Graph renderers

Page 81: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 82: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 83: Rethinking metrics: metrics 2.0 @ Lisa 2014

stat=upper/lower/mean/...(assume avg otherwise)

Page 84: Rethinking metrics: metrics 2.0 @ Lisa 2014

Visualizations

Page 85: Rethinking metrics: metrics 2.0 @ Lisa 2014

From: dygraphs.com

Page 86: Rethinking metrics: metrics 2.0 @ Lisa 2014

bin=10

bin=20

bin=30

bin=40

bin=50

bin=100

Page 87: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 88: Rethinking metrics: metrics 2.0 @ Lisa 2014

Alerting

Page 89: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 90: Rethinking metrics: metrics 2.0 @ Lisa 2014

unit=Err/s

Page 91: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 92: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 93: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 94: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 95: Rethinking metrics: metrics 2.0 @ Lisa 2014

Classifying clusters of

cause & effect

Page 96: Rethinking metrics: metrics 2.0 @ Lisa 2014

Different algos for different

metric categories

Page 97: Rethinking metrics: metrics 2.0 @ Lisa 2014

Alert criticality & routing based

on tags

Page 98: Rethinking metrics: metrics 2.0 @ Lisa 2014

integrating logs & metrics

Page 99: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 100: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 101: Rethinking metrics: metrics 2.0 @ Lisa 2014

Algorithms leverage both

logs and metrics

Page 102: Rethinking metrics: metrics 2.0 @ Lisa 2014

Conclusion

structuredself-describing standardized

metrics = enabler

Page 103: Rethinking metrics: metrics 2.0 @ Lisa 2014

Conclusion

Concerns? Ideas? Advice?

Ready for early adopters!

Work with me on next-gen telemetry!

Page 104: Rethinking metrics: metrics 2.0 @ Lisa 2014

Seen in this presentation:

metrics20.org

vimeo.github.io/graph-explorer

github.com/vimeo/timeserieswidget

github.com/vimeo/carbon-tagger

github.com/vimeo/statsdaemon

github.com/graphite-ng/carbon-relay-ng

github.com/Dieterbe/anthracite

Page 105: Rethinking metrics: metrics 2.0 @ Lisa 2014

You might also like:

github.com/vimeo/graphite-influxdbgithub.com/vimeo/graphite-api-influxdb-dockerGithub.com/vimeo/whisper-to-influxdb

github.com/Dieterbe/influx-cli

github.com/graphite-ng/graphite-ng

Github.com/vimeo/smoketcpGithub.com/vimeo/tailgate

Page 106: Rethinking metrics: metrics 2.0 @ Lisa 2014

Stay in touch!

Metrics20 google groupit-telemetry google group

twitter.com/[email protected]@vimeo.com

Lisa labs office hours after lunch

Q & A

Page 107: Rethinking metrics: metrics 2.0 @ Lisa 2014

Bonus round

Page 108: Rethinking metrics: metrics 2.0 @ Lisa 2014

Dashboard definition

queries = [

'cpu usage sum by core',

'mem unit=B !total group by type:swap',

'stack network unit=Mb/s',

'unit=B (free|used) group by =mountpoint'

]

Page 109: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 110: Rethinking metrics: metrics 2.0 @ Lisa 2014

Catchall plugins

stats.dfvimeocliapp2.twitter.error

{

“n1”: “dfvimeocliapp2”,

“n2”: “twitter”,

“n3”: “error”,

“plugin”: “catchall_statsd”,

“source”: “statsd”,

“target_type”: “rate”,

“unit”: “unknown/s”

}

Page 111: Rethinking metrics: metrics 2.0 @ Lisa 2014

Equivalence

servers.host.cpu.total.iowait → “core” : “_sum_”

servers.host.cpu.<core-number>.iowait

servers.host.loadavg.15


Top Related