Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be...

transcript

Flexible Network Analytics in the Cloud

Jon Dugan & Peter MurphyESnet Software Engineering GroupOctober 18, 2017TechEx 2017, San Francisco

Introduction● Harsh realities of network analytics● netbeam● Demo● Technology Stack● Alternative Approaches● Lessons Learned

Architecture

The Harsh Realities of Network Analytics

1. It’s a mess

2. Things change

3. There’s always more

4. It’s never really done

● Your data isn’t neat and tidy

● Time and money are limited

● More devices & more telemetry

● What you need today may not be what you need tomorrow.

Coping strategies

1. It’s a mess

2. Things change

3. There’s always more

4. It’s never really done

● Design knowing things won’t be tidy

● “What” not “How”

● Rely on the cloud for scaling

● Keep raw data to keep your options open

netbeam

Network Analytics in Google Cloud

Three Pillars

1. Real time analytics ○ Low latency, incomplete

2. Offline analytics ○ High latency, complete

3. Flexible data model○ Changing needs? Recompute from raw data!

Secret sauce: Apache Beam

What is Apache Beam?

1. The Beam Programming Model

2. SDKs for writing Beam pipelines

3. Runners for existing distributed processing backends

○ Apache Apex

○ Apache Flink

○ Apache Spark

○ Google Cloud Dataflow

○ Local runner for testing

Slide courtesy of the Apache Beam Project 7

The Evolution of Apache Beam

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

Slide courtesy of the Apache Beam Project 8

Architecture DiagramApache Beam

(Stream Processing)

BigQuery(immutable)

SNMP collection system

Client

Bigtable(realtime)

Apache Beam(Batch Processing)

BigQuery(historical)

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates

Percentiles

● Google Pubsub● Uses Python outside

of Google Cloud to poll devices and write to Pubsub topic

● Code within Google Cloud subscribes to topic to process data

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● Apache Beam / Google Dataflow

● Stream processing● Subscribes to

Pubsub topic

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

Pubsub topic● Raw data is written to

BigQuery

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

Pubsub topic● Raw data is written to

BigQuery● Real time

transformed data (e.g. aligned data rates) written to Bigtable

● Writes and makes use of meta data in BigTable (not shown)

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● Cloud Bigtable● Like HBase● Write to cells in rows,

indexed by keys● We write 1 day of

data to a single row (columns are the time of day, key is metric and day)

● Fast access to row by key, can serve data from here

● Store one year

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● BigQuery● Data warehousing

solution● Cheap storage, SQL

access, but not suitable for real-time access

● Allows SQL queries for ad hoc investigation

● We store our source of truth here

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● BigQuery● Data warehousing

solution● Cheap storage, SQL

access, but not suitable for real-time access

● Allows SQL queries for ad hoc investigation

● We store our source of truth here

● Also store historical data (7 years), imported via avro files

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● Batch processing● Run with cron job

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● Batch processing● Run with cron job● Recalculate Bigtable

data each night from source of truth in BigQuery

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations

Old SNMP system

(Stream)

BigQuery(immutable)

Client

Bigtable(realtime)

Align/rates

Percentiles

● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations

● Additional pre-computed views e.g. percentiles for traffic distribution over a month

Old SNMP system

(Stream)

BigQuery(immutable)

Dataserver API(node.js)

Client

Bigtable(realtime)

Align/rates

Percentiles

Old SNMP system

● API● Currently runs on

App Engine● Node.js● Serves data out of

Bigtable● Timeseries data is

served as ‘tiles’, each tile is one row

● Would like to use Cloud Endpoints and provide a gRPC service

● Looking forward to grpc-web solution

Use case example: Historical Trends

Use case example: Historical TrendsStream to BQ

Client

Bigtable

Per-month totals

Per-dayInterface totals

Old SNMP system avro

snmp-daily::2017-08::$interface

Jan 1 Jan 2

1.8 Pb 1.9 Pb

... Dec31

3.1 Pb...

snmp-monthly-totals

Jan 1991

Feb 1991

BigQuery

Sep 2017

Bigtable rows

Use case: real time anomaly detectionStream to BQ

Client

Bigtable

Baseline generation

baseline::5m::avg::$interface

Mon12am

Mon1am

2.1 1.9

... Sun11pm

0.5...

anomaly::5m::avg

iface-1

iface-2

BigQuery

iface-n

Anomaly detection

Mon2am

Generates avg for each interface over the past 3 months for that hour/day

Compares baseline to real time values to generate current deviation from normal

Use case example: Percentiles

Stream to Bigtable

Client

Bigtable

Percentiles

Daily rollups5m avg

rollup-month-5m::2017-08::$interface::in

6Gbps 5Gbps

... 8640

2Gbps...

percentiles::2017-08::$interface::in

0.1 Gbps

0.3 Gbps

99 pct

22.1Gbps

Bigtable rows

Use case example: Percentiles

Example: Computing Total Traffic# Python Beam SDKpipeline = beam.Pipeline('DirectRunner')

(pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp'))

pipeline.run()

Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 28

Our Stack● Apache Beam using Scio● Google Cloud Platform

○ Dataflow○ Bigtable○ BigQuery○ Pub/Sub○ App Engine

● Languages○ Scala○ Javascript / Typescript○ Python

Cloud Dataflow

BigQuery Cloud Bigtable

Cloud Endpoints

App Engine

Cloud Pub/Sub

Current Status & Future PlansCurrent

Alpha version for SNMP data:

● Ingest to BigQuery is working● Migration of historical data is

implemented. Awaiting final details before full conversion

● Streaming ingest to Bigtable still in process

● Early version of utilization visualization● Simple data server can provide data to

clients, but gRPC API coming● Interface timeseries charts functional

Future

More types of data:

● Flow data● perfSONAR

Machine Learning

Anomaly Detection

“Mash up” various data sources

Why not InfluxDB, Elastic or ${FAVORITE_DB}● We have a data processing problem, not a data storage problem per se.

○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us

○ Ability to move to different platform components○ machine learning (TensorFlow and others)

● InfluxDB & Elastic ○ require care and feeding -- have to think about disks and machines, etc.○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this

may have changed but other benefits outweigh that.○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier

Why the cloud? Why Google Cloud Platform?Why the cloud?

● Focus on our problems not on infrastructure● Scalability without needing to own lots of systems● Managed services for databases and compute

Why Google Cloud?

● Apache Beam was Google Dataflow when we first encountered it● More cohesive ecosystem than AWS in our experience

Lessons learned / Life in the cloud / Good & Bad● This approach is not a silver bullet, but definitely makes many things easier● Scaling is pretty sweet: we processed 4,005,271,066 points in 13 hours● GCP Tech support could be better● Despite early indications Python streaming support in Beam has been slow to

appear. Python is a second class citizen. Fortunately Scio and Scala allow working with the Java SDK at a high level of abstraction.

● Scala is powerful but challenging at times● Focus on developing your services, not on setting up machines to run them

○ Nice options for decomposing services (Endpoints/esp, load balancing, etc)○ Service oriented○ Battle tested software stacks

Thank you!Peter Murphy <pmurphy@es.net>Jon Dugan <jdugan@es.net>

● MyESnet: https://my.es.net● ESnet Open Source: http://software.es.net/

○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/

● Scio: https://github.com/spotify/scio ● Beam: https://beam.apache.org

Flexible Network Analytics in the Cloud · 2017-10-11 · Your data isn’t neat and tidy ... be...

Documents