Post on 30-Jul-2020
transcript
Flexible Network Analytics in the Cloud
Jon Dugan & Peter MurphyESnet Software Engineering GroupOctober 18, 2017TechEx 2017, San Francisco
Introduction● Harsh realities of network analytics● netbeam● Demo● Technology Stack● Alternative Approaches● Lessons Learned
2
Architecture
3
The Harsh Realities of Network Analytics
1. It’s a mess
2. Things change
3. There’s always more
4. It’s never really done
● Your data isn’t neat and tidy
● Time and money are limited
● More devices & more telemetry
● What you need today may not be what you need tomorrow.
4
Coping strategies
1. It’s a mess
2. Things change
3. There’s always more
4. It’s never really done
● Design knowing things won’t be tidy
● “What” not “How”
● Rely on the cloud for scaling
● Keep raw data to keep your options open
5
netbeam
Network Analytics in Google Cloud
Three Pillars
1. Real time analytics ○ Low latency, incomplete
2. Offline analytics ○ High latency, complete
3. Flexible data model○ Changing needs? Recompute from raw data!
Secret sauce: Apache Beam
6
What is Apache Beam?
1. The Beam Programming Model
2. SDKs for writing Beam pipelines
3. Runners for existing distributed processing backends
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Local runner for testing
Slide courtesy of the Apache Beam Project 7
The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
MillwheelApache Beam
Google Cloud Dataflow
Slide courtesy of the Apache Beam Project 8
Architecture DiagramApache Beam
(Stream Processing)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Apache Beam(Batch Processing)
BigQuery(historical)
...
Old SNMP system
avro
9
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Google Pubsub● Uses Python outside
of Google Cloud to poll devices and write to Pubsub topic
● Code within Google Cloud subscribes to topic to process data
Old SNMP system
avro
10
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Stream processing● Subscribes to
Pubsub topic
Old SNMP system
avro
11
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Stream processing● Subscribes to
Pubsub topic● Raw data is written to
BigQuery
Old SNMP system
avro
12
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Stream processing● Subscribes to
Pubsub topic● Raw data is written to
BigQuery● Real time
transformed data (e.g. aligned data rates) written to Bigtable
● Writes and makes use of meta data in BigTable (not shown)
Old SNMP system
avro
13
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Cloud Bigtable● Like HBase● Write to cells in rows,
indexed by keys● We write 1 day of
data to a single row (columns are the time of day, key is metric and day)
● Fast access to row by key, can serve data from here
● Store one year
Old SNMP system
avro
14
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● BigQuery● Data warehousing
solution● Cheap storage, SQL
access, but not suitable for real-time access
● Allows SQL queries for ad hoc investigation
● We store our source of truth here
Old SNMP system
avro
15
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● BigQuery● Data warehousing
solution● Cheap storage, SQL
access, but not suitable for real-time access
● Allows SQL queries for ad hoc investigation
● We store our source of truth here
● Also store historical data (7 years), imported via avro files
Old SNMP system
avro
16
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job
Old SNMP system
avro
17
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job● Recalculate Bigtable
data each night from source of truth in BigQuery
Old SNMP system
avro
18
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job● Recalculate Bigtable
data each night from source of truth in BigQuery
● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations
Old SNMP system
avro
19
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job● Recalculate Bigtable
data each night from source of truth in BigQuery
● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations
● Additional pre-computed views e.g. percentiles for traffic distribution over a month
Old SNMP system
avro
20
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
Dataserver API(node.js)
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
Old SNMP system
avro
...
● API● Currently runs on
App Engine● Node.js● Serves data out of
Bigtable● Timeseries data is
served as ‘tiles’, each tile is one row
● Would like to use Cloud Endpoints and provide a gRPC service
● Looking forward to grpc-web solution
21
Use case example: Historical Trends
22
Use case example: Historical TrendsStream to BQ
Dataserver API(node.js)
SNMP collection system
Client
Bigtable
Per-month totals
Per-dayInterface totals
BigQuery(historical)
Old SNMP system avro
snmp-daily::2017-08::$interface
Jan 1 Jan 2
1.8 Pb 1.9 Pb
... Dec31
3.1 Pb...
snmp-monthly-totals
Jan 1991
28 Gb
Feb 1991
29 Gb
...
...
BigQuery
Sep 2017
56 Pb
Bigtable rows
23
Use case: real time anomaly detectionStream to BQ
Dataserver API(node.js)
SNMP collection system
Client
Bigtable
Baseline generation
baseline::5m::avg::$interface
Mon12am
Mon1am
2.1 1.9
... Sun11pm
0.5...
anomaly::5m::avg
iface-1
+0.1
iface-2
+2.0
...
...
BigQuery
iface-n
-1.5
Anomaly detection
Mon2am
0.3
Generates avg for each interface over the past 3 months for that hour/day
Compares baseline to real time values to generate current deviation from normal
24
Use case example: Percentiles
25
Stream to Bigtable
Dataserver API(node.js)
SNMP collection system
Client
Bigtable
Percentiles
Daily rollups5m avg
rollup-month-5m::2017-08::$interface::in
1 2
6Gbps 5Gbps
... 8640
2Gbps...
percentiles::2017-08::$interface::in
1 pct
0.1 Gbps
2 pct
0.3 Gbps
...
...
99 pct
22.1Gbps
Bigtable rows
Use case example: Percentiles
26
Demo
27
Example: Computing Total Traffic# Python Beam SDKpipeline = beam.Pipeline('DirectRunner')
(pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp'))
pipeline.run()
Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 28
Our Stack● Apache Beam using Scio● Google Cloud Platform
○ Dataflow○ Bigtable○ BigQuery○ Pub/Sub○ App Engine
● Languages○ Scala○ Javascript / Typescript○ Python
Cloud Dataflow
BigQuery Cloud Bigtable
Cloud Endpoints
App Engine
Cloud Pub/Sub
29
Current Status & Future PlansCurrent
Alpha version for SNMP data:
● Ingest to BigQuery is working● Migration of historical data is
implemented. Awaiting final details before full conversion
● Streaming ingest to Bigtable still in process
● Early version of utilization visualization● Simple data server can provide data to
clients, but gRPC API coming● Interface timeseries charts functional
30
Future
More types of data:
● Flow data● perfSONAR
Machine Learning
Anomaly Detection
“Mash up” various data sources
Why not InfluxDB, Elastic or ${FAVORITE_DB}● We have a data processing problem, not a data storage problem per se.
○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us
○ Ability to move to different platform components○ machine learning (TensorFlow and others)
● InfluxDB & Elastic ○ require care and feeding -- have to think about disks and machines, etc.○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this
may have changed but other benefits outweigh that.○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier
31
Why the cloud? Why Google Cloud Platform?Why the cloud?
● Focus on our problems not on infrastructure● Scalability without needing to own lots of systems● Managed services for databases and compute
Why Google Cloud?
● Apache Beam was Google Dataflow when we first encountered it● More cohesive ecosystem than AWS in our experience
32
Lessons learned / Life in the cloud / Good & Bad● This approach is not a silver bullet, but definitely makes many things easier● Scaling is pretty sweet: we processed 4,005,271,066 points in 13 hours● GCP Tech support could be better● Despite early indications Python streaming support in Beam has been slow to
appear. Python is a second class citizen. Fortunately Scio and Scala allow working with the Java SDK at a high level of abstraction.
● Scala is powerful but challenging at times● Focus on developing your services, not on setting up machines to run them
○ Nice options for decomposing services (Endpoints/esp, load balancing, etc)○ Service oriented○ Battle tested software stacks
33
Thank you!Peter Murphy <pmurphy@es.net>Jon Dugan <jdugan@es.net>
● MyESnet: https://my.es.net● ESnet Open Source: http://software.es.net/
○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/
● Scio: https://github.com/spotify/scio ● Beam: https://beam.apache.org
34