GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING Sorabh Gandhi, UC Santa Barbara...

GAMPS COMPRESSING MULTI SENSOR DATA BY

GROUPING & AMPLITUDE SCALING

Sorabh Gandhi, UC Santa Barbara Suman Nath, Microsoft Research Subhash Suri, UC Santa Barbara

Jie Liu, Microsoft Research

Fine Grained Sensing & Data Glut

Advances in sensing technology fine grained ubiquitous sensing of environment

Many applications, but the issue is data glutAutomated Data Center Cooling: [MSFT DCGenome project] physical parameters ex. humidity, temperature etc 1000s of sensors, 10 bytes/sensors/sec 10s of GBs/dayServer Performance Monitoring: [MSFT server farm monitoring] performance counters ex. cpu utilization, memory usage etc 100s of counters, 1000s of servers, few bytes/counter/sec TBs/day

Focus and Objectives

Data archival + (reliable and fast) query processing Centralized setting Point query: report value for sensor x, time t Similarity query: report sensors ‘similar’ to sensor x in time range

Obvious solution: compression, data is set of time series

Initial idea: approximate every time series individually Many approximation techniques known ex. DFT, DCT, piecewise linear Focus: L1 error [guarantee on point queries]

ex techniques wavelets, piecewise constant/linear approximations

Compression not enough!! Gives upto an order of magnitude improvement, we want more

Signals are Correlated!

Server dataset: 40 signals, 1 day, sampling once every 30 seconds, counter: # of connected users

# Co

nnec

ted

Use

rs

Time

Similar signals in a group

Shifted/Scaled groups Dynamic groups

We propose GAMPS, which exploits linear

correlations among multiple signals while compressing them together, and gives L1 guarantees

Compression both along time and across signals

We propose an index structure for compressed data which can give fast responses to a lot of relevant queries

Through simulations on real data, we show that on large datasets, GAMPS can achieve upto an order of magnitude improvement over state of the art compression techniques

Contributions

State of the art: Single Signal Optimal L1 approximations

Problem: Given a time series S and input parameter ² approximate S with piecewise constant segments such that the L1 error is <= ²

Greedy algorithm (PCGreedy(S, ²))

Problem: Given a time series S and input parameter ², approximate S with piecewise constant segments such that the L1 error is <= ²

Greedy algorithm (PCGreedy(S, ²))

2²

Original Time SeriesApproximation

ICDE’03 Lazardis et al.

State of the art: Single Signal Optimal L1 approximations

GAMPS Overview

GAMPS take as input, the set of time series and approximation parameter ²

Compression Partition phase: partitions the data into contiguous time

intervals Group phase: divides a given partition into groups of similar

signals Amplitude scaling phase: compression happens with

sharing of representations

Amplitude Scaling Phase

Partition Phase

DataGrouping

PhaseIndex

Structure

COMPRESSION INDEXING

DataCompressed

Compression by Amplitude Scaling

Given a group of k ‘similar’ signals Let the signals be denoted by set X = {X1, X2, …,

Xk}

Key idea: express all signals Xi as scaled function of some signal Xj: Xi = AiXj

Ai is the ratio/amplitude signal and Xj is the base signal

If signal Xi is a perfectly scaled version of Xj then Ai = constant

To reconstruct Xi, we only need to store the constant and Xj

In reality, no perfect correlation

However, we found that if there are enough linearly correlated signals smartly approximating Ais and Xj can give very good compression factors!

Illustration: Amplitude Scaling on Real Dataset

DataCenter dataset 6 signals shown for ~3 days each, parameter: relative humidity

Input: X = {X1, X2, …, X6}, ² = 1% Need to choose base signal and divide ² among

base signal (²b) and ratio signal approximations (²r) Oracle: X4 is base signal, also provides values ²b and ²r

Run PCGreedy(X4, ²b) and PCGreedy(Ai, ²r) for signals other than the base signal

DataCenter Dataset

Illustration: Amplitude Scaling on Real Dataset

Leftmost figure, all signals use PCGreedy() with ² = 1.0% Middle figure, higher fidelity base signal, ²b =0.4% Rightmost figure: Ratio signals

Very sparse (small number of segments to represent)

Base signal approx

Y-axis: Relative Humidity

Ratio signal approx

Y-axis: Ratio

Individual approx

Y-axis: Relative Humidity

Compression factor = M1/M2

M1 = number of segments in individual signal approximations M2 = number of segments in (base signal + ratio signal) approximations

For this illustrative dataset, compression factor (1% error) is 1.9

Quantitative Comparison for Amplitude Scaling

Comparison with optimal individual approximations

Facility location problem Problem is modeled as a graph G(V, E) Opening a facility at node j costs c(j) Serving a demand point j using facility i costs w(i,j) Objective is to choose F µ V

Minimize j 2 F c(i) + i 2 V w(i,j)

Grouping & amplitude scaling is modeled as facility location

Complete graph, every signal is represented by a node Cost opening a facility: # segments needed to represent base

signal Cost of serving a demand point: # segments needed to represent

the ratio signal

Grouping and Amplitude Scaling by Facility Location

Graph

Implementation Setup

We set ²b = 0.4² [error allocation for base signal]

Facility location : NP hard We show results with exact solution (integer linear program) Approximation solutions are with 90% of the results shown Time taken to solve the linear program is <= few seconds

We use three different datasets Server dataset: 240 signals, 1 day data [CPU utilization

counter] DataCenter dataset: 24 signals, 3 days of data [humidity

sensors] IBT dataset: 45 signals, 1 day of data [temperature sensors

in a building in Berkeley]

Quantitative Evaluation: GAMPS

Figure on the left shows compression factor over raw data

For 1.5% error, 300 for server data, 50 for the other two

Figure on the right: compression factor over individual approximations

For 1.5% error, between factor 2-10

Compression factor high for Server dataset Average group size is highest (60 as compared to 4.5 & 6)

Scaling versus Group size

We extracted 60 signals in the same group for the Server dataset

Compression factor (versus individual approximations) increases as group size increases

Advantage of Grouping

Demonstrate the advantage of having multiple groups Datasets IBT and Server Hybrid: algorithm which allows only 1 group

Every signal is either in the group or approximated individually For both datasets, for all errors, grouping gives great

advantage Compression Factor: 1.5 (IBT) - 9 (Server) [Error 1.5%]

Grouping: Geographical Locality

IBT dataset, 1 day, error = 1.5%

GAMPS runs the grouping on entire days data Picture on left shows sensor layout in the Intel Berkeley lab Hexagons are sensor positions, crosses are sensors without data

for the one day, rectangles are outliers (individual approximations)

Simple region boundaries conform our intuition

Sensor Layout Group Layout

Indexing Compressed Data

1

2

3

4

5

Skip-list of groups

Ptr. to base signalSkip-list of approx. lines for ratio signal

Propose Skip list based index structure Point query: log(n) Range query : log(n) + range Similarity query : log(n) + #groups in range

Future Work

How to distribute error among base and ratio signals ?

How about generic linear transformations ? We use only ratio signal (scaling) : Xi = AiXj

Maybe we can get much better compression by using Xi = AiXj + Bi

How about piecewise linear signals ? Underlying algorithm is not so trivial (convex hulls)

Can we apply this technique to 2D signals ? Consider a video, every pixel value in time time series Every pixel-time-series, correlated with neighboring pixel-time-

series

Thanks for your attention

Example Query: Similarity Query

Based on grouping we can define similarity coefficient for a given time range (t1, t2)

= 1, if signals Si and Sj are in the same group at time t

Part of IBT dataset Similarity Query

Compression by Interval Sharing

Key Idea: If two sensors have near overlapping time series they can share a part of the approximation

Let number of signals be k and desired error be ² (®, ¯) approximation algorithm

For given error ² say optimal algorithm taken OPT (®, ¯) algorithm has error no more than ®² and uses no more

than ¯OPT segments

We propose polynomial time (5, log k + log OPT) approximation algorithm for approximation with PC segments using interval sharing

Signal 1Signal 2

Representation can be shared

Multiple Correlated Signals: Example 1

Instant messaging service – Server dataset 240 servers, 2 weeks, >= 100 performance counters 40 signals shown (normalized) for one day, counter: #connected

users, sampling rate once in 30 seconds

Signals are correlated (almost overlapping) with each other, can we exploit this in compression ?

Server Dataset

Multiple Correlated Signals: Example 2

Data center monitoring 24 sensors, 2 years, 2 parameters: humidity, temperature 6 signals shown for ~3 days each, parameter: relative humidity,

sampling rate once every 30 seconds Signals not overlapping, but still correlated

Shifting or scaling may help Question: Can we exploit this correlation ? We propose a technique to compress multiple

signals along both time and across signals

DataCenter Dataset

Partition Determination

Use double-half-same size heuristic Start with some initial batch size (say 100 data points) For next batch run group and compress with 200, 100 & 50 data

points For 200, compare with two batches of size 100, whichever one

takes less memory is chosen Similarly for 50, compare two batch sizes of 50 with one batch size

100

Memory taken = # segments + Cluster delta Cluster delta: Every time clusters change, we need to update the

base signals and base-ratio signal relationships

Base signals

Ratio signals

2

4

135

GAMPS Illustration

12

3

45

12

34

5

Partition

12

3

45

(Similar signals together)Grouping

Select Base and Ratio Signals

GAMPS Compression Illustration

12

3

45

12

34

5

Partition

12

3

45

(Similar signals together)Grouping

Compress by Amplitude

Scaling

(To overcome varying correlations)

Date post:	27-Mar-2015
Category:	Documents
Upload:	connor-mahoney
View:	213 times
Download:	0 times

GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING Sorabh Gandhi, UC Santa Barbara...

Documents