+ All Categories
Home > Documents > High-Volume Data Collection and Real Time Analytics Using Redis

High-Volume Data Collection and Real Time Analytics Using Redis

Date post: 09-May-2015
Category:
Upload: cacois
View: 10,339 times
Download: 0 times
Share this document with a friend
Description:
In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics. * See more of my work at http://www.codehenge.net
75
Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University
Transcript
Page 1: High-Volume Data Collection and Real Time Analytics Using Redis

Large-Scale Data Collection Using Redis

C. Aaron Cois, Ph.D. -- Tim PalkoCMU Software Engineering Institute

© 2011 Carnegie Mellon University

Page 2: High-Volume Data Collection and Real Time Analytics Using Redis

Us

C. Aaron Cois, Ph.D.

Software Architect, Team LeadCMU Software Engineering InstituteDigital Intelligence and Investigations Directorate

Tim Palko

Senior Software EngineerCMU Software Engineering InstituteDigital Intelligence and Investigations Directorate

© 2011 Carnegie Mellon University

@aaroncois

Page 3: High-Volume Data Collection and Real Time Analytics Using Redis

Overview

• Problem Statement• Sensor Hardware & System Requirements• System Overview– Data Collection– Data Modeling– Data Access– Event Monitoring and Notification

• Conclusions and Future Work

Page 4: High-Volume Data Collection and Real Time Analytics Using Redis

The Goal

Critical infrastructure/facility protection

via

Environmental Monitoring

Page 5: High-Volume Data Collection and Real Time Analytics Using Redis

Why?

Stuxnet• Two major components:

1) Send centrifuges spinning wildly out of control2) Record ‘normal operations’ and play them back to operators during the attack 1

• Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound

1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&

Page 6: High-Volume Data Collection and Real Time Analytics Using Redis

The Broader Vision

Quick, flexible out-of-band monitoring

• Set up monitoring in minutes• Versatile sensors, easily repurposed • Data communication is secure (P2P VPN) and

requires no existing systems other than outbound networking

Page 7: High-Volume Data Collection and Real Time Analytics Using Redis

A CMU research project called Sensor Andrew

• Features: – Open-source sensor platform– Scalable and generalist system supporting a

wide variety of applications– Extensible architecture• Can integrate diverse sensor types

The Platform

Page 8: High-Volume Data Collection and Real Time Analytics Using Redis

Sensor Andrew

Page 9: High-Volume Data Collection and Real Time Analytics Using Redis

Gateway

Gateway

Server

End Users

Sensor Andrew Overview

Nodes

Page 10: High-Volume Data Collection and Real Time Analytics Using Redis

What is a Node?

Environment Node Sensors• Light• Audio• Humidity• Pressure• Motion• Temperature• Acceleration

Power Node Sensors• Current• Voltage• True Power• Energy

A node collects data and sends it to a collector, or gateway

Radiation Node Sensors• Alpha particle

count per minute

Particulate Node Sensors• Small Part. Count• Large Part. Count

Page 11: High-Volume Data Collection and Real Time Analytics Using Redis

What is a Gateway?

• A gateway receives UDP data from all nodes registered to it

• An internal service:– Receives data continuously– Opens a server on a specified

port– Continually transmits UDP data

over this port

Gateway

Page 12: High-Volume Data Collection and Real Time Analytics Using Redis

Requirements

1. Collect data from nodes once per second2. Scale to 100 gateways each with 64 nodes3. Detect events in real-time4. Notify users about events in real-time5. Retain all data collected for years, at least

We need to..

Page 13: High-Volume Data Collection and Real Time Analytics Using Redis

What Is Big Data?

Page 14: High-Volume Data Collection and Real Time Analytics Using Redis

What Is Big Data?

“When your data sets become so large that you have to start

innovating around how to collect, store, organize, analyze and share it.”

Page 15: High-Volume Data Collection and Real Time Analytics Using Redis

Problems

Size Transmission

StorageRate

Page 16: High-Volume Data Collection and Real Time Analytics Using Redis

Problems

Size Transmission

StorageRate

Page 17: High-Volume Data Collection and Real Time Analytics Using Redis

Problems

Size Transmission

StorageRate

Page 18: High-Volume Data Collection and Real Time Analytics Using Redis

Problems

Size Transmission

StorageRate

Page 19: High-Volume Data Collection and Real Time Analytics Using Redis

Problems

Size Transmission

StorageRate

Page 20: High-Volume Data Collection and Real Time Analytics Using Redis

Problems

Size Transmission

StorageRateRetrieval

Page 21: High-Volume Data Collection and Real Time Analytics Using Redis

Collecting DataProblem:

Data cannot remain on the nodes or gateways due to security concerns.Limited infrastructure.

Constraints:

Store and retrieve immense amounts of data at a high rate.

?Gateway

8 GB / hour

Complex Analytics

Page 22: High-Volume Data Collection and Real Time Analytics Using Redis

We Tried PostgreSQL…

• Advantages:– Reliable, tested and scalable– Relational => complex queries => analytics

• Problems:– Performance problems reading while writing at a

high rate; real-time event detection suffers– ‘COPY FROM’ doesn’t permit horizontal scaling

Page 23: High-Volume Data Collection and Real Time Analytics Using Redis

Q: How can we decrease I/O load?

Page 24: High-Volume Data Collection and Real Time Analytics Using Redis

Q: How can we decrease I/O load?

A: Read and write collected data directly from memory

Page 25: High-Volume Data Collection and Real Time Analytics Using Redis

Enter Redis

Commonly used as a web application cache or pub/sub server

Redis is an in-memory NoSQL database

Page 26: High-Volume Data Collection and Real Time Analytics Using Redis

Redis

• Created in 2009• Fully In-memory key-value store– Fast I/O: R/W operations are equally fast– Advanced data structures

• Publish/Subscribe Functionality– In addition to data store functions– Separate from stored key-value data

Page 27: High-Volume Data Collection and Real Time Analytics Using Redis

Persistence

• Snapshotting– Data is asynchronously transferred from memory

to disk• AOF (Append Only File)– Each modifying operation is written to a file– Can recreate data store by replaying operations– Without interrupting service, will rebuild AOF as

the shortest sequence of commands needed to rebuild the current dataset in memory

Page 28: High-Volume Data Collection and Real Time Analytics Using Redis

Replication

• Redis supports master-slave replication• Master-slave replication can be chained• Be careful: – Slaves are writeable!– Potential for data inconsistency

• Fully compatible with Pub/Sub features

Page 29: High-Volume Data Collection and Real Time Analytics Using Redis

Redis Features Advanced Data Structures

List Set Sorted Set Hash

[A, B, C, D]

“A”

“B”

“C”

“D”

D

C

B

AA:3

C:1

D:2

B:4

{A, B, C, D} {C:1, D:2, A:3, D:4}

“A”

“B”

“C”

“D”

field1

field2

field3

field4

{field1:“A”, field2:“B”…}

{value:score} {key:value}

Page 30: High-Volume Data Collection and Real Time Analytics Using Redis

Our Data Model

Page 31: High-Volume Data Collection and Real Time Analytics Using Redis

Constraints

Our data store must:

– Hold time-series data

– Be flexible in querying (by time, node, sensor)

– Allow efficient querying of many records

– Accept data out of order

Page 32: High-Volume Data Collection and Real Time Analytics Using Redis

Tradeoffs: Efficiency vs. Flexibility

MotionAudioLight

PressureHumidity

AccelerationTemperature

MotionVS

Light

Audio

Pressure

Temperature

Humidity

Acceleration

One record per timestamp

One record per sensor data type

A

Page 33: High-Volume Data Collection and Real Time Analytics Using Redis

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Page 34: High-Volume Data Collection and Real Time Analytics Using Redis

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Page 35: High-Volume Data Collection and Real Time Analytics Using Redis

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Page 36: High-Volume Data Collection and Real Time Analytics Using Redis

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Page 37: High-Volume Data Collection and Real Time Analytics Using Redis

Sorted Set

1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}

1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..}…

Page 38: High-Volume Data Collection and Real Time Analytics Using Redis

Sorted Set

1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542006000: {“temp”:527,..} <- fits nicely1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..}…

Page 39: High-Volume Data Collection and Real Time Analytics Using Redis

Know your data structure!A set is still a set…

Score

Value

Datapoint1357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Page 40: High-Volume Data Collection and Real Time Analytics Using Redis

Requirement Satisfied

RedisGateway

Page 41: High-Volume Data Collection and Real Time Analytics Using Redis

There is a disturbance in the Force..

Page 42: High-Volume Data Collection and Real Time Analytics Using Redis

Collecting Data

RedisGateway

Page 43: High-Volume Data Collection and Real Time Analytics Using Redis

“In Memory” Means Many Things

• The data store capacity is aggressively capped – Redis can only store as much data as the server

has RAM

Page 44: High-Volume Data Collection and Real Time Analytics Using Redis

Collecting Big Data

RedisGateway

Page 45: High-Volume Data Collection and Real Time Analytics Using Redis

We could throw away data…

• If we only cared about current values• However, our data– Must be stored for 1+ years for compliance– Must be able to be queried for historical/trend

analysis

Page 46: High-Volume Data Collection and Real Time Analytics Using Redis

We Still Need Long-term Data Storage

Solution? Migrate data to an archive with expansive storage capacity

Page 47: High-Volume Data Collection and Real Time Analytics Using Redis

Winning

Redis

Gateway

PostgreSQL

Archiver

Page 48: High-Volume Data Collection and Real Time Analytics Using Redis

Winning?

Redis

Gateway

PostgreSQL

Archiver

??

?Some Poor Client

Page 49: High-Volume Data Collection and Real Time Analytics Using Redis

Yes, Winning

Redis

Gateway

PostgreSQL

ArchiverAPI

Some Happy Client

Page 50: High-Volume Data Collection and Real Time Analytics Using Redis

Gateway

Redis

PostgreSQL

ArchiverAPI

Best of both worlds

Redis allows quick access to real-time data, for monitoring and event detection

PostgreSQL allows complex queries and scalable storage for deep and historical analysis

Page 51: High-Volume Data Collection and Real Time Analytics Using Redis

We Have the Data, Now What?

Incoming data must be monitored and analyzed, to detect significant events

Page 52: High-Volume Data Collection and Real Time Analytics Using Redis

We Have the Data, Now What?

Incoming data must be monitored and analyzed, to detect significant events

What is “significant”?

Page 53: High-Volume Data Collection and Real Time Analytics Using Redis

We Have the Data, Now What?

Incoming data must be monitored and analyzed, to detect significant events

What is “significant”?

What about new data types?

Page 54: High-Volume Data Collection and Real Time Analytics Using Redis

Gateway

Django App

App DB

API

New guy: provide a way to read the data andcreate rules

motion > x && pressure < y&& audio > z

Redis

PostgreSQL

Archiver

Page 55: High-Volume Data Collection and Real Time Analytics Using Redis

Gateway

Event MonitorEvent

MonitorDjango

AppApp DB

Redis

PostgreSQL

ArchiverAPI

New guy: read the rules and

data, trigger alarms

motion > x pressure < yaudio > z

All true?

Page 56: High-Volume Data Collection and Real Time Analytics Using Redis

Gateway

Event MonitorEvent

MonitorDjango

AppApp DB

Redis

PostgreSQL

ArchiverAPI

Event monitor services can be scaled independently

Page 57: High-Volume Data Collection and Real Time Analytics Using Redis

Getting The Message Out

Page 58: High-Volume Data Collection and Real Time Analytics Using Redis

Getting The Message Out

Considerations

• Event monitor already has a job, avoid re-tasking as a notification engine

Page 59: High-Volume Data Collection and Real Time Analytics Using Redis

Getting The Message Out

Considerations

• Event monitor already has a job, avoid re-tasking as a notification engine

• Notifications most efficiently should be a “push” instead of needing to poll

Page 60: High-Volume Data Collection and Real Time Analytics Using Redis

Getting The Message Out

Considerations

• Event monitor already has a job, avoid re-tasking as a notification engine

• Notifications most efficiently should be a “push” instead of needing to poll

• Notification system should be generalized, e.g. SMTP, SMS

Page 61: High-Volume Data Collection and Real Time Analytics Using Redis

If only…

Page 62: High-Volume Data Collection and Real Time Analytics Using Redis

Gateway

Event MonitorEvent

MonitorDjango

AppApp DB

ArchiverAPI

Redis Data

Redis Pub/Sub

WorkerWorkerNotification

Worker

SMTP

Pub/Sub with synchronized workers is an optimal solution to real-time event notifications.

No need to add another system, Redis offers pub/sub services as well!

PostgreSQL

Page 63: High-Volume Data Collection and Real Time Analytics Using Redis

Conclusions

• Redis is a powerful tool for collecting large amounts of data in real-time

• In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system

• Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub

Page 64: High-Volume Data Collection and Real Time Analytics Using Redis

Things to watch

• Data persistence– if Redis needs to restart, it takes 10-20 seconds

per gigabyte to re-load all data into memory 1

– Redis is unresponsive during startup

1 http://oldblog.antirez.com/post/redis-persistence-demystified.html

Page 65: High-Volume Data Collection and Real Time Analytics Using Redis

Future Work

• Improve scalability through:– Data encoding– Data compression– Parallel batch inserts for all nodes on a gateway

• Deep historical data analytics

Page 66: High-Volume Data Collection and Real Time Analytics Using Redis

Acknowledgements

• Project engineers Chris Taschner and Jeff Hamed @ CMU SEI

• Prof. Anthony Rowe & CMU ECE WiSE Labhttp://wise.ece.cmu.edu/

• Our organizationsCMU https://www.cmu.eduCERT http://www.cert.orgSEI http://www.sei.cmu.eduCylab https://www.cylab.cmu.edu

Page 67: High-Volume Data Collection and Real Time Analytics Using Redis

Thank You

Page 68: High-Volume Data Collection and Real Time Analytics Using Redis

Thank You

Questions?

Page 69: High-Volume Data Collection and Real Time Analytics Using Redis

Slides of Live Redis Demo

Page 70: High-Volume Data Collection and Real Time Analytics Using Redis

A Closer Look at Redis Data

redis> keys *

1)"sensor:environment:f80”2)"sensor:environment:f81”3)"sensor:environment:f82"4)"sensor:environment:f83"5)"sensor:environment:f84"6)"sensor:power:f85"7)"sensor:power:f86"8)"sensor:radiation:f87"9)"sensor:particulate:f88"

Page 71: High-Volume Data Collection and Real Time Analytics Using Redis

A Closer Look at Redis Data

redis> keys sensor:power:*

1)"sensor:power:f85"2)"sensor:power:f86”

Page 72: High-Volume Data Collection and Real Time Analytics Using Redis

A Closer Look at Redis Data

redis> zcount sensor:power:f85 –inf +inf

(integer) 3565958(45.38s)

Page 73: High-Volume Data Collection and Real Time Analytics Using Redis

A Closer Look at Redis Data

redis> zcount sensor:power:f85 1359728113000 +inf

(integer) 47

Page 74: High-Volume Data Collection and Real Time Analytics Using Redis

A Closer Look at Redis Dataredis> zrange sensor:power:f85 -1000 -1

1)"{\"long_energy1\": 73692453, \"total_secs\": 6784, \"energy\": [49, 175, 62, 0, 0, 0], \"c2_center\": 485, \"socket_state\": 1, \"node_type\": \"power\", \"c_p2p_low2\": 437, \"socket_state1\": 0, \"mac_address\": \"103\", \"c_p2p_low\": 494, \"rms_current\": 6, \"true_power\": 1158, \"timestamp\": 1359728143000, \"v_p2p_low\": 170, \"c_p2p_high\": 511, \"rms_current1\": 113, \"freq\": 60, \"long_energy\": 4108081, \"v_center\": 530, \"c_p2p_high2\": 719, \"energy1\": [37, 117, 100, 4, 0, 0], \"v_p2p_high\": 883, \"c_center\": 509, \"rms_voltage\": 255, \"true_power1\": 23235}”

2)…

Page 75: High-Volume Data Collection and Real Time Analytics Using Redis

Redis Python APIimport redis

pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)r = redis.Redis(connection_pool=pool)

byindex = r.zrange(“sensor:env:f85”, -50, -1) # ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…

byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000)

# ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…

size = r.zcount(“sensor:env:f85”, "-inf", "+inf") # 237327L


Recommended