1 Mean Time to Innocence Your Dashboards are Green – but your end users are still complaining. Now...

Post on 14-Jan-2016

216 views 2 download

Tags:

transcript

1

Mean Time to Innocence

Your Dashboards are Green – but your end users are still complaining. Now What?

Phil StanhopeOctober 2015

22

30B Real-Time Steering Decisions per day

6B trace route and RUM latency measurements per day

That’s over 6 Light years!

13 Hops per traceroute

Traffic covering 80% of ASNs on the internet seen every few minutes

52K ASN monitored

200M BGP updates per day

No major CDN can deliver 99.9 uptime – from the end users perspective. But is it fault.

Real Time Feeds

Cooked Time Series Data – Near Real Time

Pre-Cooked across ~1000 dimensions every 5 minutes (Geography, Mobile Network, Fixed Line Networks, Target Markets Cities and IPSets)

Outages & Hijacks

Pairwise Comparisons

Performance Alarms

Some Numbers

3

● Major Outages • Major Impact• Rare

● Regional Outages and Degradations

• Variable Impact• Always Happening

“We experienced an Internet connectivity issue with a provider outside of our network which affected traffic from some end-user networks.” AWS

Business Impacting

3

4

● Consolidated view across your Internet Infrastructure

● Determine the impact to Cloud, CDN and Hosting Infrastructure globally

● Immediate time to information

What is Internet Intelligence?

4

5

Leverage Currently Deployed Dyn Assets

● Global Monitoring Infrastructure

● Custom Cloud Monitoring Infrastructure

● Real User Monitoring data

● Global Routing Infrastructure Monitors

How is it Done?

5

6

Global Monitoring Infrastructure

6

77

Reachability Markets

88

What is being Monitored?

99

Waterfalls & RUM – Where do you start?

1010

Rather than focus on entire page RUM and waterfall – focus on what happens OUTSIDE of normal your span of control as a cloud, content & security consumer:

Monitor the critical content servers (CDNs both public and private)

Monitor the cloud providers, DNS providers & core SaaS providers

Give you the tooling to get to start answering mean time to innocence questions

Is it a problem you have ability to address? Not if it’s your cloud provider’s transit. Or the ISPs recursive DNS.

Is your CDN provider overloaded? Is there a more generalized congestion problem on the internet?

Are the network paths to your users suboptimal – maybe even hijacked?

Can you see a micro-outage? Can you see patterns with providers?

Did a user come via a proxy gateway? Does the gateway fail to forward websockets?

Let’s Dive in – Some Context

1111

NOTE: This is a fake URL – it won’t work for you. Sorry.

A single web page that shows combination of real-time and near-real-time forensic data

Intentionally unbranded – what can you do with our datasets?

Covers the internal APIs that we use – they are all becoming public. Talk to me!

Common set of UX controls can be used to a variety of real-time and batch data:

GeoViews, Sunburst, Matrix & Long-Term Trending

Under the covers: ReactJS, D3, GeoJson/Topojson, jQuery, Go, Varnish, Nginx, Websockets

Live Demo

1212

Telemetry Data Cooking Pipeline

Users

Cover 80%Of the ASNs

On the internetEvery

minute

Relays

Globally distributed network. Handling

50K/sec per relayORI

GINS

DNS

RECURSIVES

ProbersNetwork of

300+ probers

performing10K

traces/second AND

synthetic DIG &

HTTP[S]

Geo annotated

real-time API

Time Series analyzed API

Gatherers

Real-time geo annotation,

data transformation

& filtering.Handling 100K/sec events

Cookers

Statistical analysis and aggregation

services

13

Browser Recursive AuthoritiveInjector &Beacon

GET - http://dyninsight.com/inject/CUST_ID/CUST_DATA

beacon = HMAC(secret, token)Javascript “injection” – just like injecting an advertisement into a page

Writes a transparent iframe into the pageLoading the iframe requires resolving beaconGuaranteed to cause recursive DNS cache miss

time, client_ip, beacon

time, recursive_ip,beacon

HTTP DNS LOG & ANALYZE

Collect

GET - http://beacon.dyninsight.com/CUST_ID/CUST_DATA/token

time, client_ipDynamic HTML - containing customer resources to test

Resources 1 .. N @ target origins tested

Resource timing Information sent to collector Per resource timing info

1

2

3

4

5

KEY:

Gatherer

token = encode(cust_id, client_ip, time, nodeid, referer)

Time – 2 - AuthoritativeTime – 2 – Recursive (inferred)

1414

Aggregated @ 5min, 1H & 1D

Cooking – What’s going on in our Data Kitchen?

MHD

Raw MHD formatted data at one minute

granularity

Client IP STATSHistograms

5 minute timing histograms

across 6 latency features

DNS IP STATSHistograms

5 minute timing histograms

across 6 latency features

IP MapsClient

Recursive

Recursive Client

Client IP SetsTyped Label IP Sets

LatenciesCountry

CityContinent

ASN

DNS IP Sets

Typed Label IP Sets

LatenciesCountry

CityContinent

ASN

Correlation

Scores and Ranks

Daily by Origin for every TYLIP

feature

All data is GEO RedundantGathering, Raw, Intermediates & Aggregates

Geo annotated

real-time API

Gatherers

15

QUESTIONS?