Monitoring General Functions in Distributed Systems with ... · 2.1 Non Convex SZs example: Plotted...

Monitoring General Functions inDistributed Systems withMinimal Communication

Amir Abboud

Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012


Monitoring General Functions inDistributed Systems withMinimal Communication

Research Thesis

Submitted in partial fulfillment of the requirements

for the degree of Master of Science in Computer Science

Amir Abboud

Submitted to the Senate

of the Technion — Israel Institute of Technology

Tammuz 5772 Haifa July 2012



This research was carried out under the supervision of Prof. Assaf Schuster and Prof.

Daniel Keren, in the Faculty of Computer Science.

Acknowledgements

I would first of all like to extend my sincere gratitude to my supervisors, Prof. Assaf

Schuster and Prof. Daniel Keren , for their enthusiastic encouragement, and useful

critiques of this research work. Their wisdom, knowledge, and commitment to the

highest standards have certainly motivated me.

Expressions of gratitude are also in order for my colleagues in the research group,

Tsachi, David, Guy and Mario, whose useful comments and constructive ideas have

contributed largely to this work. Our discussions have always been stimulating, engaging

and highly enjoyable.

I would also like to thank my parents, my grandparents, my brother and my sister

who have always, supported, encouraged and believed in me.

Finally, I would like to thank my friends for standing by me, and cheering me up

through it all.

The Technion’s funding of this research is hereby acknowledged.



Contents

List of Figures

Abstract 1

1 Safe Zones: An Efficient Approach to Distributed Monitoring 3

1.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Preliminaries: Minkowski Average . . . . . . . . . . . . . . . . . 8

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.1 Safe-Zone Allocation as an Optimization Problem . . . . . . . . 11

1.4.2 The Parametric Family of Allowable Safe Zone Shapes . . . . . . 13

1.4.3 Convexity of S and the Safe Zones . . . . . . . . . . . . . . . . . 13

1.5 Safe Zone Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5.1 Computing the Target Function . . . . . . . . . . . . . . . . . . . 15

1.5.2 Checking the Constraints . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 The Complexity of Computing Optimal Safe Zones . . . . . . . . . . . . 18

1.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8.1 Data, Methods, and Monitored Functions . . . . . . . . . . . . . 21

1.8.2 Ratio Queries with Triangular Safe Zones . . . . . . . . . . . . . 22

1.8.3 Improvement over GM Algorithm . . . . . . . . . . . . . . . . . . 23

1.8.4 Ratio Queries: Hierarchical Implementation . . . . . . . . . . . 23

1.8.5 Chi-square monitoring in 5 dimensions with axis-aligned box-

shaped safe zones . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.8.6 3-Dimensional Data, Quadratic Function, Polygonal Safe Zones . 26

1.8.7 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.8.8 Improvement Factor and Dimensionality . . . . . . . . . . . . . . 28

1.9 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


2 Discrete Safe Zones: Biclique Approach 29

2.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Biclique Formalization - k = 2 . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.1 Greedy Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.2 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 Generalized Biclique Formalization . . . . . . . . . . . . . . . . . . . . . 34

2.6 Hierarchical Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.6.1 Classes of functions . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.2 Pruning nodes in the Biclique problem . . . . . . . . . . . . . . . 38

2.7 Advantages over the geometric Safe Zones . . . . . . . . . . . . . . . . . 39

3 Violation Resolution in Distributed Stream Networks 41

3.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Violation Resolution and Minimum Resolving Set . . . . . . . . . . . . . 44

3.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 Resolving Local Violations . . . . . . . . . . . . . . . . . . . . . 45

3.4.3 Running Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Generic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5.1 Minimum Resolving Set is NP-Hard . . . . . . . . . . . . . . . . 51

3.5.2 Probabilistic Analysis of the Algorithm . . . . . . . . . . . . . . 52

3.6 Homogeneous Data Instance . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7 Heterogeneous Data Instance . . . . . . . . . . . . . . . . . . . . . . . . 57

3.7.1 The Heterogeneous Data Challenge . . . . . . . . . . . . . . . . . 57

3.7.2 Maximum Matching Tree Algorithm . . . . . . . . . . . . . . . . 57

3.7.3 Maximum Matching Tree Construction . . . . . . . . . . . . . . 58

3.7.4 Distributed Variant of MMT . . . . . . . . . . . . . . . . . . . . 61

3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.8.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.8.2 Scoring Functions and Local Constraints . . . . . . . . . . . . . . 62

3.8.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8.4 Compared Instances . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 65

3.9 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Conclusions 69

Hebrew Abstract i


List of Figures

1.1 Air quality example with four sensors, two pollutants, function max, and

threshold 1. The local data vectors are depicted by red circles, their

convex hull is outlined by a dashed line, and the global average vector is

depicted by a cross sign. The region of vectors whose value corresponds

to good air quality is colored gray. The maximal average value for

either pollutant is 0.5125; hence it is easy to construct constraints that

completely avoid communication (outlined for each pollutant by dotted

lines along the respective axis). However, as the figure clearly shows,

any attempt to cover the convex hull of the sensor readings will result in

constraint violations and unnecessary communication. . . . . . . . . . . 7

1.2 Illegal and legal safe zones. Top: left and right depict two-dimensional

data at two nodes A and B. The p.d.f at both nodes is a Gaussian

(normal) distribution. S, which must contain the Minkowski average of

the two safe zones, is the dotted ellipse in the middle. The point cloud in

the middle is a sample of the global data vectors, obtained by averaging

the data vectors at both nodes. The allowable family of safe zone shapes

consists of four-vertex polygons. The depicted safe zones (outlined in

black) fit the local data well, but are alas illegal, since their Minkowski

average (continuous dark line in the middle) is not inside S. Bottom:

This time the safe zones are legal: they satisfy the constraint (1.2). . . . 12

1.3 The half-plane approach. Top: S is equal to H, which is defined by θ

and b. Bottom: a supporting hyper-plane is used for every safe zone

candidate Si. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


1.4 Hierarchical clustering example. Data is taken from [web]. The allowable

family of safe zone shapes consists of pentagons. Each node’s data is

represented by a scatter diagram at the bottom. The yellow safe zones

are those computed for the original nodes. Clusters of the data of node

pairs are represented as supernodes in the middle row. The two safe zones

corresponding to the supernodes are colored green (as is their Minkowski

average, depicted inside S). The root is the diagram at the top row. S is

the blue ellipse in the root (the root node was scaled for better view). An

ellipse corresponds to monitoring a quadratic function (see also Section

1.8.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5 A schematic example of the proof of equivalence of the safe zone and

the biclique problems (Theorem 1.3). The bipartite graph (top), nodes

(middle), and S (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.6 Top: two of the GMM elements super-imposed on the data. Middle:

typical local concentrations of NO and NO2 as a function of time. Bottom:

typical behavior of the local ratio between NO and NO2 as a function of

time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.7 Triangular safe zones used for ratio monitoring. . . . . . . . . . . . . . . 22

1.8 Example of optimal safe zones with four nodes. S is the dark triangle;

safe zones are outlined in green. . . . . . . . . . . . . . . . . . . . . . . . 23

1.9 Comparison of safe zones (green line) to GM (blue line) in terms of the

number of violations, up to 10 nodes. For more nodes (up to 200 were

used in this experiment), the average improvement of safe zones over GM

was by a factor of 17.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.10 Comparison of the safe zone method to GM in terms of points which

cause a violation. At each node, the set S is depicted (dark triangle), the

safe zone (green triangles), and a sample of the data points (red dots).

The points which satisfy the GM constraints are depicted in blue. The

advantage of the safe zone method over GM is clear. . . . . . . . . . . . 24

1.11 Clustering example. Each row depicts two nodes from one cluster. . . . 24

1.12 Running time (in logarithmic scale) for “flat” – direct optimization over

all the nodes (blue) vs. hierarchical clustering (green). . . . . . . . . . . 25

1.13 Plots of the chi-square function for two nodes, an “oscillating” one (highly

varying data) in green, and a more stable node (in blue). Horizontal axis

stands for time (in hours), vertical axis for the chi-square value. . . . . . 26

1.14 The safe zones assigned to the two nodes in Fig. 1.13. The “oscillating”

node (top) is assigned a much larger safe zone, to account for its higher

variability. Since the data was 5-dimensional, only a 3-dimensional

projection is depicted, corresponding to the pollutants NO, NO2, and

SO2. Pink dots denote samples from the data, safe zones are in green. . 27


1.15 Comparing the number of violations between GM and the safe zone

method, for a period of 1,000 hours. The allowable family of safe zone

shapes used here consisted of 5-dimensional axis-aligned boxes. Horizontal

axis is the threshold for the chi-square function, vertical axis is the ratio

between numbers of violations. . . . . . . . . . . . . . . . . . . . . . . . 27

1.16 3D example. S is the pink ellipsoid, the safe zones are polyhedra with

eight vertices each (in pale blue), their Minkowski average is in green.

The axes stand for concentrations of NO,NO2,SO2. . . . . . . . . . . . . 27

2.1 Non Convex SZs example: Plotted are the data samples of both

nodes (right and left clouds), and the ”legal” global data points (center)

which are the averages of every pair of points, one from each node, such

that the average resides in the admissible region S. This S corresponds

to a function s.t. f(x, y) = β ⇐⇒ c1 ≤ (x− a1)2 + (y − a2)2 ≤ c2. The

data points colored in green, at the nodes, are points that were in the

resulting SZ. The SZs are non-convex sets. The green points in the center

are all the averages of a pair of points, one from each SZ. . . . . . . . . 40

3.1 A monitoring system, consisting of 3 monitoring nodes and a coordinator,

in a 2-dimensional space. The safe zones are given as rectangles in the

plane and the vectors are marked by dots. It’s easy to see that the average

of every 3 vectors, taken respectively from the local safe zones, resides

within the global safe zone (S). At t = 0, all the local vectors reside

within their respective safe zones and, consequently, the global vector is

also inside the global safe zone. In this case, none of the monitoring nodes

reports its local vector to the coordinator. At t = 1, a local violation has

occurred at N1 (v11 /∈ S1). N1 would now report its local vector to NC to

seek resolution. NC must poll N2 or N3 (or both) for their local vectors

in order to verify that v1 ∈ S. . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Local violations in homogeneous and heterogeneous systems . . . . . . . 48

3.3 Hoeffding’s lower bound in a homogeneous system . . . . . . . . . . . . 54


3.4 A maximum matching tree over an 8-node system in a 2-dimensional

space. Every level of the tree defines a partition of N1, . . . , N8. The

distribution of the average vector and the safe zone are marked by a

cloud of dots and a rectangle, respectively, for every partition set. The

root node represents the distribution of the global vector and the global

safe zone. Note that, indeed, global violations rarely occur. There are

two types of nodes: type-1 nodes, which have high variance in the 1st

dimension and low variance in the 2nd dimension, and type-2 nodes,

which have high variance in the 2nd dimension and low variance in the 1st

dimension. 4 nodes of each type comprise the leaves of the tree, denoted

by the double outlined ellipses. As expected, MMT first pairs nodes of

the same type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5 An execution example of MMT over an 8-node system. At t = 1, a

snapshot of the system is given at the bottom. First, the violating nodes

(N2, N7) report their local vectors to the coordinator. Upon failure to

resolve the violation, the resolving set is extended with the non-violating

nodes from the sets containing N2, N7 in the 1st level of the MMTree. If

we consider the MMTree in Figure 3.4, nodes N4, N6 are polled for their

local vectors. At this point the violations are resolved and the algorithm

terminates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Data distributions and safe zones of 2 randomly chosen nodes from

each data set. In Syn-HT and RCV-HT the data are projected to a

3-dimensional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.7 Experimental results over Syn-HM (left) and Air-HM (right) homogeneous

data sets. The vertical axes of the line graphs are in logarithmic scale.

In the average communication cost, all algorithms (except the Naive)

approach the minimum, as denoted by the number of violations. The

average latency reflects the differences between the algorithms in the

expansion rate of the resolving set. DMMT outperforms the centralized

algorithms in reducing the average maximum communication load. . . . 64

3.8 Experimental results over over Syn-HT (left) and RCV-HT (right) hetero-

geneous data sets. The vertical axes of the line graphs are in logarithmic

scale. The clear advantage of MMT and DMMT over the random al-

gorithms is apparent in each of the metrics. DMMT outperforms the

centralized algorithms in reducing the average maximum communication

load. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65


Abstract

In today’s connected, data-driven world, traditional database systems are replaced

by data stream systems which are fundamentally distributed. These large-scale and

widespread networked systems generate high-volume streams of data that often require

processing in real time. Examples for such include: network traffic monitoring systems,

real-time analysis of financial data, distributed intrusion detection systems and sensor

networks. A principal concern within these distributed systems is threshold monitoring:

Determining whether the value of a certain function, evaluated over network-wide data,

crosses a certain threshold that may indicate a global phase change which calls for some

action.

Formally, the threshold monitoring problem over a data stream system, consisting of

n nodes, can be described as follows: Given are a function f , vector v1, v2, ..., vn, where

vi is a data tuple (vector) at the ith node, and a threshold τ . The vectors are dynamic,

and the system needs to alert whenever the value of f(v1, v2, ..., vn) crosses τ (that is,

either changes from a value larger than τ to a value smaller than τ , or vice-versa). This

innocuous-looking condition of threshold crossing covers a very wide range of alerts that

complex, distributed, dynamic systems must trigger.

A common approach to the monitoring problem is to continuously or periodically

centralize all data or data summaries, thus transforming a distributed problem into

a centralized one. However, such centralization, may simply be infeasible in realistic

settings, due to the sheer volume and dynamic nature of the data (implying huge

communication overheads and latencies, as well as rapid energy drain, in the case of

sensors).

In this thesis we will present three works which propose techniques for reducing

communication when monitoring a distributed network. The techniques are based on

the following simple observation: nodes in the system should not send a message every

time new data arrives, but rather send messages only when ”interesting” things happen.

We formulate this observation using an idea of Safe Zones (SZs). Each node of the

system gets a Safe Zone (SZ), and is asked to communicate only when the data it

observes drifts out of this SZ. We will make sure that as long as each node is in its SZ,

the global ”bad” event (threshold crossing) may not have happened.

A great deal of work exists for the limited cases in which the threshold function (f)

is either linear or monotonic. However, many functions of interest are neither linear nor

1


monotonic and it is impossible to extend work on these cases to general functions. We

propose novel generic techniques for monitoring arbitrary functions.

In Chapter 1 we introduce and formalize the idea of SZs, define the problem of

finding the optimal SZs, and try to solve this problem with Geometric tools. In chapter

2 we present a different approach for the problem of finding the optimal SZs, which

is more appropriate when the data sampled at the nodes is of a Discrete nature. In

chapter 3 we present communication efficient algorithms for determining whether a

global ”bad” event happened, when a node drifts out of its SZ.

The applicability of these techniques to various real problems is provided in experi-

ments, which also demonstrate their advantage, by orders of magnitude, over previous

work. Both the experiments and theoretical analysis show that this advantage increases

with the dimension of the data.

2


Chapter 1

Safe Zones: An Efficient

Approach to Distributed

Monitoring

1.1 Chapter Summary

Many monitoring tasks over distributed data streams can be formulated as a continuous

query using a function that is defined over the global average of data vectors derived

from the streams. The query will typically produce an alert when the value of the

function crosses a predefined threshold. A fundamental problem in efficient scalable

implementation of such threshold queries is that the data streams are distributed,

sometimes over a wide geographical region. Moving all the data to a centralized data

center for query processing may incur infeasible communication overheads and inflated

data center resource costs. In some cases it may be prohibited altogether by the sheer

aggregated size of the data, or by privacy laws. The goal is thus to enhance scalability

by processing the query locally, using as little communication and global coordination

as possible.

We present a novel scheme for communication reduction in distributed monitoring

using local constraints. Communication and global coordination are required only in the

event that the local constraints are violated by the incoming data. Our work improves

on previous work in a few critical aspects. First, whereas previous work required

constructing a “distributed cover” of the entire convex hull of the local data vectors, our

work compiles constraints that are designed to cover only the global average; further,

they are directly matched and tailored to fit the local data distribution at each stream.

The result is a dramatic decrease in the required volume of communication compared to

previous state of the art, up to two orders of magnitude in our experiments with real-life

data. Both the experiments and theoretical study suggest that the improvement factor

increases with the dimension of the data. Also, in contrast to previous work, which

3


necessitated complicated constraints and required enormous computational effort over

each of the streams, our scheme can use very simple constraints which incur negligible

local overhead. This latter advantage makes our new approach applicable to thin,

battery-operated sensors and cellular devices.

4


1.2 Introduction

The need for scalable processing of distributed data streams arises in many important

applications, such as network traffic analysis, sensor networks and complex event

processing. These systems consist of sets of (geographically distributed) nodes where

each node receives a stream of data. The task of interest is formulated as a continuous

query, whose output may change as new data arrives on the stream.

In many problems of interest, a vector is derived from the data arriving on each

stream, and the query continuously monitors a global function that is defined over the

average of the current vectors. The query produces an alert every time the function

crosses a predefined threshold. Such queries, called threshold queries, are the building

blocks for important data processing and data mining tools, including top-k queries,

anomaly detection, feature selection, decision tree construction, association rule mining,

data classification, correlation monitoring, and system monitoring [HNG+07]. Consider,

for example, the analysis of frequency moments over the union of distributed data

streams [CG05]. Here the local data are the local histograms of the incoming stream

data, and the monitored global function, defined over the average of the histograms, is

typically the Lp norm for some p.

As a simple running example which will allow to demonstrate both an application

scenario and the improvement over previous work, assume that sensors are deployed at

various locations in a city, measuring the concentration of air pollutants. Each sensor

maintains a vector of its readings, such as the concentrations of CO2, NO, and NO2.

We use a function over the vector of measurements in order to determine the overall air

quality, and we are interested in detecting when the air quality drops below a certain

threshold. Since the air quality in a city may change rapidly from one point to another

and abruptly over short periods of time, we are not interested in determining the air

quality at the individual sensor locations. Rather, we want to determine the relatively

stable measure of overall air quality, to which end we apply the scoring function over

the average of the measurement vectors taken at the sensors. We shall later return to

this example.

In these applications and others, the problem can theoretically be solved by moving

all the data to a central location, computing the average vector, and testing whether

the value of the monitored function crossed the threshold. Alas, that is often impossible,

due to the volume and transmission rate of the distributed streams. The goal we pursue

here is to allow scalability by reducing the global query processing to a set of local

queries, using as little communication as possible.

Many previous studies solve the problem for restricted sets of query functions, see

section 1.3. Several works attempt to determine when the sum of a set of distributed

counts exceeds a given threshold [KCR06] and to detect frequently occurring items

known as “heavy hitters” [YZ09, MSDO05]. The use of sketches has been proposed for

reducing communication in the construction of wavelets and histograms [CG05, CG09],

5


and in determining quantiles [CGMR05].

In contrast to many previous works, we present a generic method that provides a

solution for any threshold function defined over the average of current stream vectors.

When processing a threshold query, communication can be reduced by breaking down the

processing task into a set of constraints on the local values held at the nodes. As long as

none of these constraints is violated, it is guaranteed that the global function value has

not crossed the threshold, and therefore no communication is necessary. Communication

is required only in the event that a constraint is breached.

In [SSK08, SSK06] a generic “geometric” scheme was proposed for constructing local

constraints. Using only local information, each node constructs a sphere, where the

union of all spheres contains the convex hull of the local vectors. The local constraint at

each node consists of verifying that the value of the function at every vector in its sphere

does not cross the threshold. Since the spheres cover the convex hull of the local vectors,

they also cover their average vector. Thus, if all the constraints are satisfied, namely,

the value of the function is below the threshold on all the spheres, then the value of the

function does not cross the threshold at any point of the convex hull, including at the

average vector.

The method described in [SSK08, SSK06] has several significant drawbacks. Most

importantly, the constraints guarantee that the value of the function on the entire

convex hull of the local data vectors has not crossed the threshold, whereas we are

interested only in the value on the average vector. In fact, [SSK08] uses the convex hull

area as an optimal lower bound on the size of its sphere cover, thus inducing an upper

bound on the performance of this method. In this sense the constraints proposed in

[SSK08, SSK06] are too conservative, leading to unnecessary communication.

To illustrate this problem, we take a closer look at the air quality example given

above. Say the network consists of four sensors. Two of them are at the center of

the city, where cars are the main source of pollution, and the dominant pollutant is

pa. The other two sensors are located outside the city center, where industrial plants

are the main source of pollution, and the dominant pollutant is pb. We consider the

quality of air to be good if the concentrations of both pollutants pa and pb are below

one unit; otherwise, we consider the quality of air to be bad. In other words, if (ca, cb)

denotes the average measurement vector, where ca is the concentration of pa and cb is

the concentration of pb, then our scoring function is max(ca, cb) and our threshold value

is 1. In Section 1.8 we shall deal with real air-pollution data and more complicated

functions which are not linear, monotonic, or convex.

In the early morning hours all sensors register no pollution (a concentration of 0

units) of any type. As the day progresses, the concentration of pa at the city center

sensors rises to 0.5 and 1.5, and the concentration of pb rises to 0.05. Similarly, the

concentration of pb at the other two sensors rises to 0.5 and 1.5, and the concentration

of pa rises to 0.05. Figure 1.1 depicts this example. Note that even though the score

of the average vector has not crossed the threshold, there are significant parts of the

6


Figure 1.1: Air quality example with four sensors, two pollutants, function max, andthreshold 1. The local data vectors are depicted by red circles, their convex hull isoutlined by a dashed line, and the global average vector is depicted by a cross sign.The region of vectors whose value corresponds to good air quality is colored gray. Themaximal average value for either pollutant is 0.5125; hence it is easy to constructconstraints that completely avoid communication (outlined for each pollutant by dottedlines along the respective axis). However, as the figure clearly shows, any attempt tocover the convex hull of the sensor readings will result in constraint violations andunnecessary communication.

convex hull of the local vectors which have. Therefore, using the constraints defined

in [SSK06, SSK08] will result in unnecessary communication. In fact, any method that

verifies that the value of the function over the entire convex hull has not crossed the

threshold will require unnecessary communication.

It is easy to come up with a more appropriate set of constraints for this example –

constraints which do not attempt to cover the convex hull: the two city center sensors

verify that ca does not exceed 1.9 and that cb does not exceed 0.1, and the other two

sensors verify that ca does not exceed 0.1 and that cb does not exceed 1.9. The regions

corresponding to these two types of constraints are outlined by dotted lines in Figure

1.1. Note that as long as all the constraints are upheld, the score of the average vector

is guaranteed not to cross the threshold (the average concentration of each pollutant

does not exceed one unit). Therefore, these constraints are valid. In fact, in the scenario

described above, none of these constraints is ever violated.

The new set of constraints is defined by regions which are optimized for each sensor

(or, in this example, for each pair of sensors), and are called the sensors’ safe zones.

The local operation of deciding whether communication is necessary consists merely

of verifying that the local vector is contained in this region. As opposed to the large

majority of work on distributed monitoring, with the safe zone approach proposed here

we monitor the domain of the function, as opposed to its range. This not only allows a

great deal of freedom in determining local constraints, it also allows to tailor these local

constraints to the behaviors of the data at the different nodes.

Returning to the constraints proposed in [SSK06, SSK08], we see an additional

major disadvantage: the complexity of checking whether they are upheld. Checking

these constraints requires determining whether the maximum (or minimum) value of the

7


scoring function inside a sphere exceeds a given threshold. This optimization problem

can be very complex and computationally demanding even for relatively simple functions.

Furthermore, it is performed continuously throughout the lifetime of the query for every

newly introduced data vector. Evidently, the high complexity translates into high power

and strong processing requirements. Thus, this drawback is particularly prohibitive for

ubiquitous battery-operated nodes, such as sensor networks and cellular devices.

1.2.1 Contribution

In this work we propose a novel approach to constructing local constraints by means

of simple forms of safe zones. These constraints focus directly on the average of

the data vectors rather than on covering the entire convex hull of the local vectors,

therefore dramatically reducing communication in comparison to previous algorithms.

Experiments performed on real-world data show that this improvement reaches two

orders of magnitude.

Further, in contrast to the high computational effort required to verify that the

previously proposed constraints are upheld, negligible computational effort is required

in our scheme due to the simplicity of the safe zones. This makes our scheme applicable

to thin clients, such as battery-operated devices and sensors.

The safe zones proposed in this work (Section 1.4) are computed by solving an

optimization problem, whose precise solution can provide optimal local constraints. We

develop a set of algorithmic solutions that significantly relax the complexity of the

problem and provide approximations to the exact solution (Section 1.5 and 1.6). The

main goal of the proposed method is to minimize the overall probability for the local

data vectors to breach their safe zones. The safe zone concept also enables us to define

an algorithm for efficiently recovering from safe zone breaches, which will be presented

in Chapter 3 of this thesis. We then discuss the complexity of solving the optimization

problem and show that it is NP-hard (Section 1.7). An experimental evaluation on

real-life data is provided in Section 1.8, which demonstrate that the method reduces

communication by up to two orders of magnitude over the stat-of-the-art.

Related work is reviewed (Section 1.3) and conclusions drawn (Section 1.9).

1.2.2 Preliminaries: Minkowski Average

Recall that the monitored functions are defined over the average of local vectors. The

safe zones we use are vector sets in Euclidean space for which the value of the function

on any average of vectors, one from each set, is still below the threshold. To manipulate

averages of vectors taken from sets, we use a well-known geometric operator, called the

Minkowski sum, and denoted by ⊕ [Ser82]. Given n sets S1, S2, ..., Sn, their Minkowski

sum is the set

S1 ⊕ ...⊕ Sn = v1 + ...+ vn |v1 ∈ S1...vn ∈ Sn .

8


Their Minkowski average is their Minkowski sum where every element is divided by n. See

also Wikipedia for examples and illustrations http://en.wikipedia.org/wiki/Minkowski addition.

For a descriptive example of how the Minkowski average notion relates to the safe zones

defined here, see Fig. 1.2.

1.3 Related Work

Methods for reducing communication in distributed systems include sketching [AM04,

CG05, CG09]. Other research concerns detecting “heavy hitters” [MSDO05], [YZ09],

computing quantiles [CGMR05], and counting distinct elements [CMZ06]. Distributed

computation was also addressed in the context of top-k problems [MTW05, BO03a], set-

expression cardinality estimation [DGGR04], clustering [CMZ07], distributed verification

of logical expressions [ADNR07], optimal sampling [CMYZ10], choosing local thresholds

[AKT09], and ranking [LYJ09]. Theoretical analysis of the monitoring problem is

provided in [CMY08], and some non-monotonic functions of frequency moments are

treated in [ABC09].

The lion’s share of the work addresses the limited case in which the threshold function

is linear (e.g., aggregate, average) [KCR06]. In [SR08] the value of a polynomial in one

variable is monitored. A great deal of work was dedicated to distributed monitoring of

monotonic functions, usually weighted averages, max and min operators, etc. [MTW05].

[GT01, GT02] present algorithms for estimating aggregate functions over a sliding

window of the N most recent data items.

In [JW04], assigning appropriate local thresholds at each node is proposed. It

also touches on the importance and difficulty of the problem of monitoring non-linear

functions: ”Standard database languages offer other aggregates including AVERAGE,

STDEV [standard deviation], MAX and MIN. Given a constraint on one of these global

aggregates (e.g. ’ensure that the STDEV of latency is ≤ l second’.), it is not immediately

clear what local ’event’ should trigger global constraint checks”.

In [WDS09], a gossiping-based algorithm is presented, but it does not cover general

functions.

[HNG+07] suggests a distributed paradigm to decide on the dimension of an ap-

proximating subspace for distributed data, with the application of detecting system

anomalies such as a DDoS attack. [CMY08] discussed functional approximation in

a distributed setting but it only deals with obtaining lower bounds for vector norm

functions.

Monitoring non-monotonic functions by representing them as a difference of mono-

tonic functions is presented in [SKSS10], but for the static case only. Aggregative

ratio queries over streams are treated in [GRM10]. (notice the difference from the

instantaneous, non aggrerative, ratio, which we treat here.)

A geometric method for monitoring threshold functions was studied in [SSK06,

SSK08]. We have already discussed the drawbacks of this method in Section 1.2 above.

9


Each node is assigned a subset of the data space such that as long as the local vectors

are inside their respective subsets, it is guaranteed that the function’s value did not

cross the threshold. However, GM suffers from the following drawbacks, which are

solved by the safe zone method:

1. The shape of the subsets at different nodes is identical. This means that if data in

different nodes obey different distributions, GM will perform poorly, as it is based

on the assumption that data at all nodes is idetically distributed. For example, if

the distribution at some nodes is elongated along the x-direction and at others in

the y-direction, it makes sense that the subsets at the respective nodes will be

elongated along the x(y) directions, thus allowing to ”capture more probability”.

In this thesis we present an algorithm which allows to assign different SZs to

different nodes. In Appendix IX-D, a simple theoretical analysis is presented

which proves that this freedom allows the SZ method, even for relatively simple

data distributions, to improve over GM by a factor which is 1) unbounded from

above, and 2) rapidly increases with the dimension of the data. This theoretical

argument is supported by the experiments in Section 1.8.

2. With the GM method, there is no optimality criterion in the definition of the

subsets. Here we define the SZs as the solution of an optimization problem which

is defined so as to minimize communication during the monitoring task.

3. In previous work only a heuristic solution was proposed for local violation recovery

(a violation is defined to occur whenever a local data vector exits its SZ). Here,

building upon the novel SZ concept, we define a rigorous algorithm to overcome

violations with minimal communication overhead.

1.4 Overview of the Algorithm

We commence with an outline of the algorithm; its distinct stages will be described

throughout the thesis.

Recall that data streams arrive at distributed nodes, and that each node derives a

dynamic d-dimensional data vector from its stream. Denote the number of nodes by n

and the data vector at the i-th node at time t by vti . The global vector, which represents

the entire system at time t, equals vt , vt1+vt2+...v

tn

n . We are interested in determining

when the value of the monitored function f() evaluated at vt crosses a given threshold

T .

In addition to these n nodes, there is an additional coordinator node. The nodes

communicate exclusively with the coordinator. As discussed above, our goal is to set

local constraints at the nodes. To this end, upon initialization of the monitoring process,

or when determined by the algorithm, the coordinator collects data from the nodes,

10


determines the local constraints for each node, and sends them to the nodes. This

process is referred to as synchronization.

Assume w.l.o.g that f(v0) ≤ T , so we must submit an alert when f(vt) > T . Next,

we define a set referred to as the admissible region, and denoted by S , v|f(v) ≤ T.As discussed in the Introduction, the constraints are defined by subsets of Rd; the

i-th node is assigned a subset, Si, referred to as its safe zone. These safe zones are chosen

from a suitable parametric family of shapes (e.g. polyhedra with a certain number of

vertices).

We assume that the probability distribution function (p.d.f hereafter) of the data

vectors at each node is known to the coordinator. These may either be known in advance

or approximated from the values arriving at the nodes and sent to the coordinator

upon synchronization. Denote the p.d.f at the i-th node by pi. These p.d.f’s can be e.g.

Gaussian [SSK08], random walk, uniform, or other.

As long as vtk ∈ Sk, the node does not initiate communication. Whenever, vtk /∈ Sk -

an event we call a local violation - the node reports to the coordinator, which attempts

to resolve the violation by employing the violation recovery algorithms described in

Cahpter 3, and if unsuccessful, initiates synchronization which determines new safe

zones at all nodes.

1.4.1 Safe-Zone Allocation as an Optimization Problem

At the heart of our approach is the allocation of safe zones by the coordinator. Obviously,

in order for the algorithm to be correct, the safe zones must adhere to the following

condition:

(v1 ∈ S1) ∧ ... ∧ (vn ∈ Sn)⇒ v1 + ...+ vnn

∈ S (1.1)

In other words, we require the Minkowski average of the safe zones to be contained

in the admissible region. Furthermore, in order for the safe zones to be efficient, we

would like to maximize the probability of the data vectors at all nodes to remain within

their zones.

Assuming probability distributions p1, ..., pn on the data at the respective nodes

(pi is the probability density function of the data seen at node i), we can formulate a

constrained optimization problem as follows:

Maximize

∫S1

p1dv1

∫S2

p2dv2...

∫Sn

pndvn (1.2)

subject toS1 ⊕ S2...⊕ Sn

n⊂ S .

The maximization of the target function∫S1

p1dv1...∫Sn

pndvn means that, under the

constraint, the expected time it will take one of the local data vectors to wander out of

11


Figure 1.2: Illegal and legal safe zones. Top: left and right depict two-dimensional dataat two nodes A and B. The p.d.f at both nodes is a Gaussian (normal) distribution. S,which must contain the Minkowski average of the two safe zones, is the dotted ellipsein the middle. The point cloud in the middle is a sample of the global data vectors,obtained by averaging the data vectors at both nodes. The allowable family of safe zoneshapes consists of four-vertex polygons. The depicted safe zones (outlined in black) fitthe local data well, but are alas illegal, since their Minkowski average (continuous darkline in the middle) is not inside S. Bottom: This time the safe zones are legal: theysatisfy the constraint (1.2).

its safe zone is maximal.1

The very general formulation of the optimization problem (Eq. 1.2) allows to assign

to each node a safe zone tailored to match its data distribution; this in contrast to

previous geometry-based monitoring algorithms [SSK06, SSK08], in which the local

conditions at all nodes are identical. Consequently, as demonstrated in the experiments,

the advantage of our approach increases with the diversity of the data distributions

across nodes. The safe zones try to match the shape of the p.d.f of the data at each node,

to maximize the probability of the local data falling inside its safe zone. For example,

a distribution which is wide along the x-axis and narrow along the y-axis should be

assigned a safe zone which is wide horizontally and narrow vertically. In Section 1.8.8

a brief theoretical analysis is provided which demonstrates that even in a very simple

setup, the improvement factor of the proposed method over [SSK06, SSK08] increases

exponentially with the dimensionality of the data vectors.

The Minkowski average of the safe zones should tightly approximate S (from the

inside). If it fills a relatively small part of S, this means that the safe zones can be

enlarged and the value of the optimized function increased.

Note that the geometric constraint and the target function to be maximized have to

reach a “compromise”: figuratively speaking, the Minkowski average constraint forces

the safe zones to be small, while the probability increases as the safe zones become larger.

Fig. 1.2 demonstrates the trade-off, central to the solution of the optimization problem

1Here we assume that data is not correlated between nodes, as is the case with the data used forthe experiments in Section 1.8; if data is correlated, the algorithm is essentially the same, with theexpression for the probability that data at some node breaches its safe zone modified accordingly.

12


(1.2), between the fit of the safe zones to the data and the necessity of maintaining their

Minkowski average inside S. Note how the Minkowski average “sticks” to S, which is a

result of maximizing the safe zones to contain as much p.d.f weight as possible.

1.4.2 The Parametric Family of Allowable Safe Zone Shapes

As mentioned above, we suggest that the optimization be restricted to a parametric

family of shapes, denoted by P . The safe zones will be chosen from the members of P .

Note that every node has to continuously test whether its local data vector is inside

its safe zone; if the shape of the safe zone is complicated, the test will consume time

and energy; therefore it is desirable to apply safe zones which are as simple as possible.

In addition, if computing the integrals of the p.d.f or the Minkowski average is very

time consuming, the optimization process may be lengthy, rendering the algorithm

impractical. Optimization time will also increase with the number of parameters required

to define a safe zone.

The above considerations lead to the following requirements for the family P :

• P should be sufficiently rich, so that its members can reasonably approximate

every given p.d.f with a viable candidate for a safe zone. If this does not hold, the

solution may be grossly sub-optimal.

• The shapes of members of P should not be too complicated, in order to allow to

quickly verify containment of the data vectors in the safe zone.

• It should not be too difficult to compute the integral of the p.d.f on members of

P .

• It should be relatively easy to compute, or bound, the Minkowski average of any

members of P .

In our experiments (Section 1.8) we applied various types of polygons and polyhedra

as members of P . Good results were obtained for real-life data even when relatively

simple families were used. The questions of how to choose the exact parameters (e.g.,

the number of polyhedra vertices) and how to use other parametric families of shapes

are beyond the scope of this work, and their complete solution will require further study.

As noted, here we chose to work with safe zones which are both simple and provide

good performance in terms of reducing communication.

1.4.3 Convexity of S and the Safe Zones

The following theorem is useful when S is convex; it shows that in this case we can,

without loss of generality, restrict ourselves to convex safe zones.

Theorem 1.1. If S is convex, the optimal safe zones are convex.

13


Proof. For the sake of simplicity, assume there are two safe zones (extension to more is

straightforward), so S1⊕S22 ⊂ S. We will prove that C(S1)⊕C(S2)

2 ⊂ S, where for every

set X, C(X) is the convex hull of X. This means that if S1, S2 are legal safe zones, so

are C(S1), C(S2). But Si ⊂ C(Si), which means that if Si were not convex they could

have been enlarged to legal safe zones, thus contradicting their optimality.

Now we return to the proof that S1⊕S22 ⊂ S =⇒ C(S1)⊕C(S2)

2 ⊂ S. Let x1, y1 ∈S1, x2, y2 ∈ S2, and 0 ≤ λ1, λ2 ≤ 1. We need to prove that v = λ1x1+(1−λ1)y1+λ2x2+(1−λ2)y2

2 ∈S. Assume without loss of generality that λ2 ≤ λ1. Since S1⊕S2

2 ⊂ S, it follows thatx1+x2

2 , x1+y22 , y1+y22 ∈ S. Since S is convex, we have that v = λ2x1+x2

2 + (λ1−λ2)x1+y22 +

(1− λ1)y1+y22 ∈ S. But this last expression is a convex combination of elements in S,

hence it is also in S.

If the safe zones are convex polygons with k vertices (as used in this chapter), the

following argument proves that they provide a reasonable approximation to general safe

zones in the case of two variables.

Let C be a convex set in the plane, and k a fixed integer. Denote by Ck the maximal

value of the ratio A(Pk)A(C) , where A is area and Pk is any convex polygon with k vertices

inscribed in C. Then [MGR95] show that the minimal value of Ck is obtained when C

is a disk. Intuitively, this means that disks (and spheres in higher dimensions), are the

convex sets which are hardest to approximate by inscribed polyhedra.

Using this result we get:

Theorem 1.2. Ck behaves as 1− αk2

for a small constant α.

1.5 Safe Zone Optimization

We now turn to the algorithmic core of the proposed method – solving the optimization

problem which defines the safe zones (Eq. 1.2). In a series of theorems deferred to

Section 1.7, we prove that generally the problem is NP-hard. Still, efficient solutions

can be found by applying computational techniques which are presented here and in

the following section.

The parameters which determine the difficulty of the search for good safe zones are:

1. The complexity of computing the target function.

2. The complexity of testing whether the constraints hold.

3. The complexity of the safe zone shapes: when that increases, so does the number

of variables to optimize over.

4. The number of nodes: when that increases, the number of variables to optimize

over increases linearly, which typically results in a super-linear increase in the

optimization complexity.

14


To solve problems 1 and 2 we applied tools from the realm of analysis and computational

geometry, as described in Sections 1.5.1, 1.5.2. The solution to problem 3 is by restricting

the parametric family of allowable shapes P , as discussed in Section 1.4.2. The solution

to problem 4 is described in Section 1.6.

1.5.1 Computing the Target Function

The target function is defined as the product of integrals of the respective p.d.f on the

candidate safe zones. Typically, data is provided as discrete samples (Section 1.8). The

integral can be computed by first approximating the discrete samples by a continuous

model, and then integrating it over the safe zone.

We have used this approach for 2- and 3-dimensional data, fitting a GMM (Gaussian

Mixture Model) and integrating the GMM over the safe zone, which was defined by

a polygon or polyhedra; see Fig. 1.6. To compute the integral, we used Monte-Carlo

methods and Green’s theorem, which allows to reduce the dimension of the integration

domain.

Alternatively, the integral can be approximated from the discrete samples. The

simplest approximation is to estimate the integral by the number of points in the

safe zone. In order to improve accuracy (as well as to make the target function

continuous and thus more amenable to optimization), the integral was approximated byk∑i=1

exp(−λd2(qi, SZ)),

where k is the number of sample points, SZ the safe zone, λ a positive constant, qi the

i-th data point, and d(qi, SZ) the distance of qi from the SZ (which is defined to be

zero if qi ∈ SZ).

Computation of the target function can be accelerated by using range searching

algorithms [Mat92].

1.5.2 Checking the Constraints

To implement a constrained optimization routine, a function is required which checks

whether the current parameters satisfy the constraints. It should return zero if the

constraints are satisfied, and a positive number if they are not. It should behave

“smoothly”: if the constraint violation is small, it should return a small value, and vice

versa.

In our case, the constraint is that the Minkowski average of the safe zones is contained

in S. One way to check the constraint is to compute the Minkowski average and check

whether it is inside S, and if not, determine some measure of its deviation from S. This

may entail very high computational complexity, especially in high dimensions in which

computing the Minkowski sum is computationally extensive.

The following method allows us to test the constraint without computing the

Minkowski average. For the sake of simplicity we start with two-dimensional data.

15


Figure 1.3: The half-plane approach. Top: S is equal to H, which is defined by θ andb. Bottom: a supporting hyper-plane is used for every safe zone candidate Si.

Assume first that S is a half-plane, denoted H, and its boundary is denoted lH , so H is

the set of points which are below lh (the algorithm proceeds similarly if lH is H’s lower

boundary); see Fig. 1.3. lH is defined by the angle θ and by b, its distance from the

origin (in higher dimensions similar definitions hold, with the direction of the b vector

replaced by a unit vector perpendicular to the hyperplane H).

Then, in order to determine whether the Minkowski average of the candidate safe

zones S1...Sk is contained in H, one has to find for each Si the upper supporting line

(in higher dimensions, upper supporting hyperplane) in the same direction as that of

lh. For a polytope Si, this requires rather low computational complexity – only the

vertices need to be considered, and line sweep algorithms can be applied to further

reduce running time. In order for the Minkowski average of the Si’s to be contained in

S, it is sufficient that∑bik ≤ b. This algorithm also allows the measure of constraint

violation, which depends on the value of∑bik − b, to be estimated. This measure also

allows to easily apply standard optimization routines, such as Matlab’s fmincon, to

solve the optimization problem, with the above measure of constraint violation and the

target function as defined in Section 1.5.1.

If S is a convex polytope it is equal to the intersection of half-planes, and the target

function is the sum, or maximum, of the target functions corresponding to the individual

half-planes. A general convex S can be efficiently approximated by an inscribed convex

polytope [MGR95]. Non-convex shapes can be represented as a union of convex ones

[Cha87].

1.6 Hierarchical Clustering

As mentioned above, the complexity of the safe zone optimization problem (1.2) increases

very quickly with the number of nodes. Assume, for example, that 100 nodes are present,

16


Figure 1.4: Hierarchical clustering example. Data is taken from [web]. The allowablefamily of safe zone shapes consists of pentagons. Each node’s data is represented bya scatter diagram at the bottom. The yellow safe zones are those computed for theoriginal nodes. Clusters of the data of node pairs are represented as supernodes in themiddle row. The two safe zones corresponding to the supernodes are colored green (asis their Minkowski average, depicted inside S). The root is the diagram at the top row.S is the blue ellipse in the root (the root node was scaled for better view). An ellipsecorresponds to monitoring a quadratic function (see also Section 1.8.6).

the data is 3-dimensional, and we wish to use polyhedral safe zones with eight vertices

in each node. Since each vertex has three coordinates, the total number of parameters

to optimize over is 100 · 8 · 3 = 2400, which is quite high for a non-convex optimization

problem. To overcome this, we organize the data in a hierarchical structure, which

allows the problem to be solved recursively while reducing it to sub-problems with a

much smaller number of nodes.

The algorithm commences by performing hierarchical clustering on the nodes. Note

that we do not cluster the data in each node separately, but the nodes themselves;

that is, we form clusters of nodes. To achieve this, a distance measure needs to be

defined between clusters. This can be done in various ways – for example, GMMs or

other distributions can be fit to the data in the nodes, and the distance between nodes

can then be defined by some distance between the respective distributions (e.g., the

Kullback-Leibler divergence). A more direct method is to use some distance measure

between the data moments [MEA01]; the clustering results for the two methods were

quite similar for the data we tested. Some typical results are provided in Section 1.8.4.

The hierarchical clustering proceeds as follows: we start by partitioning the entire

set of nodes into a small number of clusters, which can be thought of as supernodes, each

containing the union of data of the nodes in the respective cluster. We fit safe zones to

the supernodes, and continue recursively, by partitioning each supernode into clusters,

and so on. This process constructs (top-down) a tree of supernodes. The leaves of the

tree can be either individual nodes, or node clusters which are uniform enough that

they do not require further partitioning into smaller clusters, and can all be assigned

safe zones with identical shapes.

17


To illustrate the hierarchical clustering algorithm, we present in Fig. 1.4 an example

with four nodes, where the safe zones are convex pentagons. The nodes are first clustered

into supernodes, depicted in the middle row. Each supernode is generated by sampling

and averaging from the data of two nodes which it represents, and the entire data (top

row) is generated by sampling and averaging from the two supernodes; it is the root of

the cluster tree.

Note that the Minkowski sum of the two safe zones of the left (right) node pair

is constrained to lie inside the safe zone of the left (right) supernode, depicted in the

middle row. Thus, assigning four safe zones at the nodes was achieved by solving

three optimization problems, each constructing two safe zones only. This leads to a

considerably faster solution than solving for all four nodes simultaneously. In general,

the time complexity of solving an optimization problem increases very rapidly with the

number of parameters, so optimizing three times over two pentagons is much faster

than optimizing once over four. Importantly, this approach also allows parallelization.

This example also demonstrates that nodes with similar data distributions are

typically clustered into the same supernodes, and are assigned similar safe zones. When

many nodes are present, it is usually possible to prune the cluster tree and assign the

same safe zone to all nodes in a supernode, given that it is sufficiently uniform. This,

too, can save a great deal of computation.

1.7 The Complexity of Computing Optimal Safe Zones

In this section we study the complexity of the optimization problem (1.2). We prove

that generally, solving the safe zone problem is NP-hard.

Recall that the input to the safe zone problem consists of the subset S (determined

by the monitored function f() and the threshold T ), and the probability distributions

pi (which may be given in closed form or as sampled data).

Theorem 1.3. Even for two nodes and one-dimensional data, the safe zone problem is

NP-Hard.

Proof. We will show the equivalence of the safe zone problem to the biclique problem.

Given a bipartite graph G with sides L,R, the goal of the biclique problem is to find

the biclique (a complete bipartite subgraph of G) with the maximal number of edges.

Assume R has nodes r1...rn, and L nodes l1...lm. Let the set of edges, E, be a subset

of i, j, where i ∈ 1...n, j ∈ 1...m. Associate with this graph the distributions PR

having delta function (pointwise) probability distributions at locations xi, i = 1..n, and

the same for PL at locations yj , j = 1..m (narrow Gaussians can also be used). The

only restriction on xi, yj is that xi + yj = xi′ + yj′ ⇒ i = i′, j = j′, which is trivial to

achieve.

Now, define S = xi+yj |(i, j) ∈ E. Note that optimal safe zones must be subsets of

xi and yj (including other points will not add any probability, as all the probability

18


Figure 1.5: A schematic example of the proof of equivalence of the safe zone and thebiclique problems (Theorem 1.3). The bipartite graph (top), nodes (middle), and S(bottom).

mass resides in xi, yj). Note also that Sx, Sy satisfy the Minkowski sum constraint iff

the respective subsets of L,R form a biclique, and that the target function for the two

safe zones is proportional to the number of edges in that biclique.

We conclude that the two problems are equivalent.

A schematic drawing illustrating the reduction of the biclique problem to the problem

of computing the safe zones is provided in Fig. 1.5.

One may suspect that the difficulty of the general problem follows from allowing

such a discrete, disconnected S as in the proof of Theorem 1.3. The following theorems

prove that is not the case.

Theorem 1.4. If the dimension of the data vectors is at least 4, the safe zone problem

is NP-complete for two nodes even when S is convex.

Proof. The same idea (and notations) are used as for Theorem 1.3. The tricky part is

to construct a convex S having the property that “makes the proof work”, i.e., such

that xi + yj = xi′ + yj′ ⇒ i = i′, j = j′ and that xi + yj ∈ S ⇐⇒ (i, j) ∈ E.

Since S has to be convex, we choose it to equal the convex hull of xi + yj , (i, j) ∈ E.

In order to guarantee that (i, j) /∈ E ⇒ xi+yj /∈ S, we construct the sets of points xi, yj

such that xi0 + yj0 is not in the convex hull of the points xi + yj |i 6= i0 OR j 6= j0(such a construct, obviously, is not possible in one dimension).

Note that for any set of points on the unit circle in R2, none is in the convex hull

of the others (as the unit circle is strictly convex). Take uini=1, vjmj=1 to be any

such two sets, and define xi as (ui, 0, 0) ∈ R4 and yj as (0, 0, vj) ∈ R4 (there are four

coordinates since ui,vj ∈ R2). The points xi, yj satisfy the required property, since if∑i,j|i 6=i0 OR j 6=j0

λi,j(xi + yj) = xi0 + yj0 (for λi,j ≥ 0 and∑i,jλi,j = 1), the equality holds

separately in the two first and the two last coordinates, which means it holds separately

for the ui and vj , violating the strict convexity of the ui and vj sets.

19


For two nodes and S a one-dimensional interval, if the numbers of points at the

nodes are O(n), a polynomial algorithm exists for computing the optimal safe zones:

a trivial solution, running at time O(n4), tests all safe zone pairs which are intervals

with data points as endpoints, and the running time can be lowered to O(n2 log(n)) by

a binary search on the endpoints. However, even in this case, the general problem is

NP-complete.

Theorem 1.5. If more than two nodes are allowed, the safe zone problem is NP-

complete for the case in which S is a one-dimensional interval.

Proof. We will show that the knapsack problem (which is known to be NP-complete)

can be reduced to the safe zone problem in this case. Given a knapsack problem, that

is, n objects O1, ..., On with value vi and weight wi for Oi, and a knapsack which can

carry a maximal weight of W , we reduce the problem to a safe zone problem whose

optimal solution can be used to construct an optimal solution to the knapsack problem.

First, we create n nodes N1, ..., Nn, with Ni corresponding to Oi, with the following

data distribution:

pi(x) =

1C if x = 0evi−1C if x = wi

C−eviC if x = W + 1

0 otherwise

(1.3)

for C = maxievi. Note that the overall probability in each node equals 1. Now, we

define the global safe zone S to be the interval[0, Wn

].

Assume we can solve the safe zone problem and obtain an optimal solution, i.e.,

an interval Si = [ai, bi] for each node Ni. Note that in this optimal solution we may

assume that ai = 0 and 0 ≤ bi ≤W for all i (it is not possible to take bi > W as this

would violate the Minkowski average constraint, and it will not add anything to take

ai < 0, as all the probability mass is in the region x ≥ 0).

There are two types of possible safe zones at each node: those which contain only

the origin, and those which equal [0, wi]. Denote by S the subset of nodes in which

[0, wi] is taken. The solution is legal iff the Minkowski average of [0, wi]|i ∈ S is

inside S = [0, Wn ], but this is equivalent to demanding∑i∈S

wi ≤ W – which is exactly

the legality condition for the knapsack problem. Also, the product of the probability

volumes at the nodes (which the safe zone problem attempts to maximize) clearly

equals

∏i∈S

evi

Cn , so up to a constant factor it is e

∑i∈S

vi. Therefore, the safe zone problem is

equivalent to maximizing∑i∈S

vi, under the constraint∑i∈S

wi ≤W , which proves that it

is equivalent to the knapsack problem.

20


Figure 1.6: Top: two of the GMM elements super-imposed on the data. Middle: typicallocal concentrations of NO and NO2 as a function of time. Bottom: typical behavior ofthe local ratio between NO and NO2 as a function of time.

1.8 Experiments

The proposed safe zone method was implemented and compared with the algorithm

proposed in [SSK06, SSK08], which we call the GM algorithm (Geometric Method

algorithm). We chose to compare to GM as it, too, constitutes a general approach to

monitoring arbitrary functions. We are not aware of other algorithms which can be

applied to monitor the functions treated here.

1.8.1 Data, Methods, and Monitored Functions

The data we used consists of air pollutant measurements taken from “AirBase – The

European air quality database” [web]. Concentrations were measured in micrograms per

cubic meter. Nodes correspond to sensors at different geographical locations. The data

at different nodes greatly varies in size and shape and is highly irregular as a function

of time; see Fig. 1.6.

Computing the target function in the optimization requires computing the integral

of the p.d.f on the respective safe zones. This is done by approximating the data

with a Gaussian Mixture Model (GMM), using a Matlab routine (see Fig. 1.6), or

by calculating a discrete approximation, as discussed in Section 1.5.1. The quality of

the results was measured by the reduction in safe zone violations, which is roughly

proportional to the reduction in communication operations.

To emphasize the generality of the safe zone method, it was applied to monitor non-

21


Figure 1.7: Triangular safe zones used for ratio monitoring.

linear, non-monotonic functions. In Section 1.8.2 results are presented for monitoring the

ratio of NO to NO2, which is known to be an important indicator in air quality analysis

[KG07]. In Section 1.8.5 the chi-square distance between histograms was monitored

for 5-dimensional data. Section 1.8.6 presents an example of monitoring a quadratic

function in three variables. Quadratic functions are important in numerous applications.

For example, variance is a quadratic function. Normal distribution is the exponent

of a quadratic function; consequently, its thresholding is equivalent to thresholding a

quadratic function.

1.8.2 Ratio Queries with Triangular Safe Zones

This set of experiments concerned monitoring the ratio between two pollutants, NO

and NO2, measured in distinct sensors. Formally, each of the n node holds a vector

(xi, yi) (the two concentrations), and the monitored function is∑yi∑xi

(in [GRM10] ratio

is monitored but over aggregates, while here the monitoring is of the instantaneous

ratio). An alert must be sent whenever this function is above a threshold T . The safe

zones tested were triangles of the form depicted in Fig. 1.7, a choice motivated by

their simplicity and by their suitability to the data and to the definition of the queried

function. Note that we allow some of the βi to be positive, in order to cover nodes in

which the ratio yixi

is high.

Fig. 1.8 shows an example on four nodes. Note how nodes with more compact

distributions are assigned smaller safe zones, and how nodes with high values of the

monitored function (NO/NO2 ratio) are assigned safe zones which are translated to the

left in order to cover more data. This is especially evident in the upper right node, in

which the safe zone is shifted to the left so it can cover almost all the data points. In

order to satisfy the Minkowski sum constraint, the safe zone of the upper left node is

shifted to the right, which in that node hardly sacrifices any data points. Note that

the safe zone method allows safe zones which are larger than S, as opposed to the GM

method, in which the safe zones are restricted to translates of subsets of S.

22


Figure 1.8: Example of optimal safe zones with four nodes. S is the dark triangle; safezones are outlined in green.

1.8.3 Improvement over GM Algorithm

Here we compare ratio monitoring with safe zones to the GM method. In Fig. 1.9, the

number of safe zone violations is compared for various numbers of nodes, and in Fig.

1.10 some of the safe zones for both methods are compared.

1.8.4 Ratio Queries: Hierarchical Implementation

Hierarchical clustering of the nodes was applied in order to reduce running time (Section

1.6). In Fig. 1.11 a typical result is depicted: 92 nodes were clustered into four groups.

Two representatives from each of the three largest groups are shown, which correspond

to three typical data types: small, indicating low concentrations of NO/NO2 (top), drift,

with many measurements near the origin but also a sizable number of measurements

with high NO concentrations (middle), and vertical, where most measurements are

concentrated in a vertical stripe near the origin and fewer have high NO (bottom). The

Matlab routine kmeans was used for clustering the moment vectors.

In order to test running time and performance, we ran the ratio monitoring algo-

rithm for n = 30 to 240 nodes with various thresholds, both in a “flat” mode (direct

optimization over 2n variables, see Section 1.8.7) and the hierarchical method using the

clustering and tree structure (Section 1.6). Table 1.1 summarizes the results for n = 60

and various values of the threshold T .

In each table entry the first number stands for running time (seconds) and the

second number for the value of the target function. The running time is higher for

the “flat” mode, as the number of parameters to optimize over is much higher, but the

23


Figure 1.9: Comparison of safe zones (green line) to GM (blue line) in terms of thenumber of violations, up to 10 nodes. For more nodes (up to 200 were used in thisexperiment), the average improvement of safe zones over GM was by a factor of 17.5.

Figure 1.10: Comparison of the safe zone method to GM in terms of points which causea violation. At each node, the set S is depicted (dark triangle), the safe zone (greentriangles), and a sample of the data points (red dots). The points which satisfy the GMconstraints are depicted in blue. The advantage of the safe zone method over GM isclear.

Figure 1.11: Clustering example. Each row depicts two nodes from one cluster.

24


Figure 1.12: Running time (in logarithmic scale) for “flat” – direct optimization overall the nodes (blue) vs. hierarchical clustering (green).

Table 1.1: Optimization time (in seconds), and target function for ratio queries, i.e.,the average integral of the p.d.f covered by the safe zones.

T 4 3 2

Tree 23.9 s. — 0.980 23.4 s. — 0.985 23.8 s. — 0.962

Flat 243.6 s. — 0.999 244.9 s.— 0.994 177.4 s. — 0.970

performance is slightly better. In Fig. 1.12 the running times of “flat” vs. “hierarchical”

are compared for various numbers of nodes; note that running times for “hierarchical”

only increase linearly with the number of nodes.

1.8.5 Chi-square monitoring in 5 dimensions with axis-aligned box-

shaped safe zones

Another important example of a non-linear, non-monotonic function is the chi-square

distance between histograms, defined by χ(f, g) =∑ (fi−gi)2

fi+gifor histograms f, g. The

histogram was defined as the concentration levels of five pollutants, and the monitored

function was the chi-square distance between the hourly average of two nodes and their

average calculated over the previous week (i.e., a measure of how much the hourly

distribution deviates from last week’s average).

When the data distributions in two nodes substantially differ, the advantage of the

safe zone method over GM is very clear, since it can adapt its safe zones to fit the

distinct distributions at the nodes, allowing a much larger safe zone to the node with

the more varying data. In Figs. 1.13,1.14 the different behavior of the nodes’ data is

demonstrated and the safe zones allocated to them depicted. In Fig. 1.15 the advantage

over GM for various thresholds is shown. As the threshold increases, so does safe zone’s

superiority to GM. For the low thresholds, 0.5 to 0.6, there are many actual (global)

violations, but as the threshold increases, GM still suffers from many “false alarms”

(local violations which are not associated with a global violation), while the safe zone

25


Figure 1.13: Plots of the chi-square function for two nodes, an “oscillating” one (highlyvarying data) in green, and a more stable node (in blue). Horizontal axis stands fortime (in hours), vertical axis for the chi-square value.

method performs well.

1.8.6 3-Dimensional Data, Quadratic Function, Polygonal Safe Zones

Another example consists of monitoring a quadratic function with more general polygonal

safe zones in three variables (Fig. 1.16). The data consists of measurements of three

pollutants (NO,NO2,SO2), and the safe zones are polyhedra with eight vertices. Since

each vertex has three degrees of freedom, the number of parameters to optimize over per

node is (8 vertices) · (3 degrees of freedom per vertex) = 24. S is the ellipsoid depicted

in pink. As the extent of the data is far larger than S, the safe zones surround the

regions in which the data is denser. The two safe zones contain 89.5 and 89 percent

of the data in the nodes. To check the constraints, the method in Section 1.5.2 was

applied, with bounding planes instead of lines. The ellipsoid was bounded by planes

defined as its tangent planes at a set of uniformly sampled points on its surface.

1.8.7 Optimization

For the triangular safe zones (Section 1.8.2) we must solve a constrained optimization

problem, with the target function evaluated as described in Section 1.5.1, and the

Minkowski sum constraints which can be checked as explained in Section 1.5.2. These

safe zones have two degrees of freedom each (Mi and βi). Hence, for n nodes, we have

2n parameters to optimize over. For the chi-square monitoring (Section 1.8.5), each

safe zone is an axis aligned box in R5 and therefore is defined by ten parameters. The

safe zones in Section 1.8.6 require 24 parameters each. In all cases we used the Matlab

routine fmincon to solve the optimization problem.

26


Figure 1.14: The safe zones assigned to the two nodes in Fig. 1.13. The “oscillating”node (top) is assigned a much larger safe zone, to account for its higher variability. Sincethe data was 5-dimensional, only a 3-dimensional projection is depicted, correspondingto the pollutants NO, NO2, and SO2. Pink dots denote samples from the data, safezones are in green.

Figure 1.15: Comparing the number of violations between GM and the safe zonemethod, for a period of 1,000 hours. The allowable family of safe zone shapes used hereconsisted of 5-dimensional axis-aligned boxes. Horizontal axis is the threshold for thechi-square function, vertical axis is the ratio between numbers of violations.

Figure 1.16: 3D example. S is the pink ellipsoid, the safe zones are polyhedra witheight vertices each (in pale blue), their Minkowski average is in green. The axes standfor concentrations of NO,NO2,SO2.

27


1.8.8 Improvement Factor and Dimensionality

In the experiments presented here, it can be seen that the improvement of the proposed

safe zone method over the GM approach in [SSK06, SSK08] increases with the dimen-

sionality of the data vectors. The following simple analysis indicates why the freedom

in assigning different safe zones to distinct nodes yields such an improvement. Assume

a very simple setup: two nodes are present, a ”small node” and a ”large node”. The

p.d.f in the small node is uniform over a solid ball of radius 1− ε, and same over the

large node with a radius of 1 + ε. The admissible region S is a ball of radius 1. Since

the Minkowski average of balls of radii 1 − ε, 1 + ε is a ball of radius 1, the method

proposed here can assign the small node a safe zone consisting of a ball of radius 1− ε,and the large node a ball with radius (1 + ε). Thus is will incur no false alarms at all

(zero communication). However, the GM method cannot assign any node with a safe

zone larger than S. Hence – since the volume of a d-dimensional ball is proportional to

the d-th power of its radius – the part of the p.d.f volume covered by the safe zone at

the large node will be at most(

11+ε

)d≈ exp(−εd). Thus, even if the p.d.f’s at the two

nodes are quite similar – e.g. ε = 0.1 – then for 20-dimensional data vectors, the safe

zone approach will incur zero communication, while with GM the large node will have

to submit an alert in about 86% of data updates.

1.9 Chapter Conclusions

In this chapter, a general method for monitoring threshold queries on functions over

distributed streams was presented. In contrast to previous solutions which involved

a cover of the entire convex hull of the local data vectors, the new approach focuses

on direct computation of safe zones for the nodes. Consequently, safe zones are more

flexible than constraints introduced in previous work, as they fit the data distributions

much better.

While the optimization problem involved is proved to be computationally challenging,

approximate solutions are proposed, and are shown to be efficient and practical. More-

over, safe zones can be selected from families of very simple shapes and still outperform

previous methods. As a result, not only does the complexity of selecting safe zones

become reasonable, but the continuous task of violation checking at each node is also

dramatically simplified over previous work. With simple safe zones, the overhead at

every node is negligible, rendering the approach feasible even for thin battery-operated

devices.

Safe zones are implemented and tested for 2, 3, and 5-dimensional real-life data

using simple families of shapes, proving that the paradigm can reduce communication

volume by orders of magnitude.

28


Chapter 2

Discrete Safe Zones: Biclique

Approach

2.1 Chapter Summary

In this chapter we present a new approach for reducing communication in networks,

also using Safe Zones. First, we describe the setting and goals for which our method is

relevant. Second, we formulate the problem using terms from Graph Theory. Third, we

focus on the case of a system with 2 nodes and present solutions found in the literature.

Forth, we suggest a solution for the general case using a ”Hierarchical Heuristic”. Then,

we discuss the efficiency of the proposed solution, and we argue that for some very

general classes of global functions, the solution is efficient. Eventually, we discuss the

advantages of this approach over the approach presented in the previous chapter.

29


2.2 Preliminaries

Suppose we have a network with many sites, each observing a local data vector which

can have a different value in every time-step of the network’s lifetime, and someone

called C - a central administrator or one of the sites - is interested in computing, in every

time-step, the value of a Boolean function over all the local data vectors. Obviously, if

C had all the local data vectors of all the sites, in a certain time-step, then he could

have computed the desired function over this data with some effort - depending on the

computational complexity of that function. Generally, the functions of interest can be

computed very efficiently, once all the data is available.

However, we are not concerned with the computational and time complexity, but

with the communications that take place in the network. Although a protocol in which

in every time-step, every site sends his local data vector to C gives a solution for the

task at hand - C knowing the value of the function - the protocol’s communication cost

is very high. We measure the cost of a protocol by the number of messages sent during

its runtime, while ignoring the size of the messages and the costs of the computations

that have to take place at the sites. However, we seek protocols which require the sites

to run as efficient as possible algorithms. We propose protocols such that if all the

sites participate in these protocols, C will know the value of the function, certainly or

with high probability, while the number of messages sent in the protocol is as small as

possible. We describe a protocol which achieves the minimal number of messages sent,

in expectation, based on the available prior knowledge about the network. However, in

most of the interesting cases, in order to run this protocol the sites will have to perform

computations which are NP-Hard. Therefore, we suggest other computationally lighter

protocols and argue that their communication cost is often (for many interesting cases)

close to that of the optimal protocol. We plan to support this claim with experiments.

The methods proposed here are not limited to one architecture model of the network

and can be fit to many architecture models - We ignore this question for the meantime.

In the following, we concentrate on data vectors which are discrete, so essentially,

the local data is taken from a set U of size |U | = n. For example, each site can hold a

(log n)-bit vector - U = 0, 1logn, or an array of length u with elements from 1, ...,m,for some u,m ∈ N, n = um. If, in a network, the local data is not discrete, then we

can use any quantization method to obtain discrete data and then apply the following

discussion.

The global function which someone wants to compute can be, in the most general

case, for a network of k sites, any Boolean f : Uk → 0, 1. However, many interesting

global functions are of simpler forms. Denote the k sites of the network S1, S2, ..., Sk,

and assume that in time t, the local vector of Si is x(t)i ∈ U , or xi in short. Then,

the global function at time t - f (t)(x1, x2, ..., xk) - may be expressed, in some cases,

as f (t)(x1, x2, ..., xk) = h(g(x1, x2, ..., xk)), for some simpler functions g : Uk → V and

h : V → 0, 1. For example, previous works have concentrated on global functions

30


for which, in the above representation, g is simply the sum (or average) of the vectors:

g(x1, ..., xk) = x1 + ... + xk. Another interesting example is when U contains binary

vectors and g(x1, ..., xk) = x1∨ ...∨xk is the bitwise OR operator - This class of function

includes functions computed over bitmaps. We intend to discuss different solutions for

different classes of functions based on their representation with simple g, h.

We are mainly interested in global functions which correspond to monitoring a global

property of the network. Therefore, we assume that for most of the time, the value of

the global function on the present data is β (β ∈ 0, 1), and that time-steps for which

the value is β do not happen very often. Therefore, we can think of C’s task as raising

an alarm whenever the value of the global function is β, while remaining silent when

it’s β.

Furthermore, we assume that C has a predicted p.d.f. for the future local data of

each site in the network. That is, C has a prediction for Pr[x(tj)i = α] for every site

Si, time-step tj and value α ∈ U . The performance of our method is highly dependent

on the accuracy of these predictions, the more accurate these p.d.f.s are in predicting

the future local data, the less messages the protocols send. In this work we only look

into p.d.f.s which are not time-dependant, that is, for every two time-steps t1 6= t2, we

assume that Pr[x(t1)i = α] = Pr[x

(t2)i = α], for every site Si and α ∈ U . In networks

where the local data at the nodes distributes differently in different points in time, we

intend to find sophisticated solutions using our method in future work. However, for

the moment, we suggest the straightforward solution of applying many copies of our

protocols, one for each future time-step. The problem of finding good predictions for

the p.d.f.s is out of the scope of our work, and we assume that either we are given

this knowledge from an outer source, or that we specify the first few time-steps of

the network’s lifetime to learn these p.d.f.s using some machine learning tool and

after the learning is done we start running our protocols. In addition, nodes can run

learning tools on their local data while running the protocols, and once they note a

significant change in their p.d.f. they can notify C and possibly result in a decision

to restart the protocol with this new information. In addition, we assume that the

local data at the nodes distributes independently of the data in the other nodes, that

is: ∀i 6= j, α 6= β : Pr[xi = α|xj = β] = Pr[xi = α]. Handling cases in which this

assumption does not hold is also possible with a slight modification to our methods,

however, we do not go into it here.

2.3 Problem Definition

Given a set U of size n, k sites: S1, ..., Sk, each site Si observing in every time-step t an

item x(t)i from U , a global function f : Uk → 0, 1, a more probable value for the global

function’s value β ∈ 0, 1 and for every site Si and item α ∈ U the probability that

xi = α, which is equal for all time-steps: Pr[xi = α]. The goal is to define a protocol

in which C outputs f(x(t)1 , ..., x

(t)k ) correctly in every time-step t, while the expectation

31


of the number of messages sent in the network during the protocol is minimal. Or, in

other words, C raises an alarm whenever f(x(t)1 , ..., x

(t)k ) 6= β.

The protocols we present are of the following form: Each site Si maintains a set

SZi ⊆ U called the Safe-Zone (SZ) of Si, which a site can either compute for itself

or get from another site, e.g. C. Now, whenever a site observes a local data xi , it

sends it to C if and only if xi /∈ U − SZi. In time-steps for which C doesn’t receive any

messages, it is certain that the value of the global function is β, and it remains silent.

However, if at least one message was received, C must poll all (or some of) the sites for

their local data and then compute the value of the global function by itself. Therefore,

for our protocols to be correct, we need the following property to hold:[legality of the

SZs] if ∀i ∈ [k] : xi ∈ SZi, then f(x1, ..., xk) = β. That is, in time-steps when no

communication occurs, the value of the function is guaranteed to be β and the output

of C to be correct. In addition, we can see that the number of messages sent during

the protocol depends on the sets SZi, the larger these sets are, the less messages are

expected to be sent. Therefore, we seek such sets which will minimize the probability of

sending a message in a given time-step, that is minimize Pr[∃i ∈ [k] : x(t)i ∈ U − SZi],

which is equivalent to maximize Pr[∀i ∈ [k] : x(t)i ∈ SZi]. By that, we reduce the

problem of finding good protocols to finding ”large” and ”legal” such sets SZi, and

from now on we discuss the problem of finding the optimal legal SZs.

Our problem can be written as follows:

Problem 2.3.1. (The optimal Safe-Zones problem)

• maximize Pr[∀i ∈ [k] : xi ∈ SZi]

• subject to:

– ∀i ∈ [k] : SZi ∈ U

– if ∀i ∈ [k] : xi ∈ SZi, then f(x1, ..., xk) = β.

Finding the optimal solution for the above problem gives an optimal protocol of the

form described above. Next, we show an equivalent problem from which we can see the

complexity of finding the optimal solution for our problem, and hopefully, gain insights

for finding good sub-optimal solutions.

2.4 Biclique Formalization - k = 2

If the network had only 2 sites S1, S2, k = 2, then we can reduce our problem to finding

a maximum weight complete bi-partite sub-graph of a bi-partite graph:

Define the bi-partite graph G : (V1, V2, E), where |V1| = |V2| = |U | = n, and every

value α ∈ U has a node in each side of G: v1α ∈ V1,v2α ∈ V2. The set of edges is defined

based on the global function: exy : (v1x, v2y) ∈ E ↔ f(x, y) = β. And, we define a

32


function of weights for the edges: w : E → [0, 1], w(exy) = w(v1x, v2y) = Pr[x1 = x∧x2 =

y] = Pr[x1 = x] · Pr[x2 = y].

A complete bi-partite sub-graph (Biclique) of G is B : (W1,W2) for which, W1 ⊆ V1,

W2 ⊆ V2 and W1 × W2 ⊆ E. That is, every pair of vertices from W1 and W2, is

an edge of G. The weight of a Biclique, for our purpose, is defined as w(B) =∑w1∈W1,w2∈W2

w(ew1w2) . A Biclique B : (W1,W2) corresponds to a solution for our

problem in which SZ1 = W1 and SZ2 = W2 - converting vertices back to their respective

items.

Claim 2.4.1. the maximum weight Biclique of G corresponds to the optimal solution

of the SZs problem.

Proof. First, we notice that the SZs we get are legal: if x1 ∈ SZ1 and x2 ∈ SZ2, then

v1x1 ∈ W1 and v2x2 ∈ W2, and since W1 × W2 ⊆ E, we get that (v1x1 , v2x2) ∈ E and

therefore f(x1, x2) = β.

Second, we prove that these are the optimal SZs. Let us assume for contradiction that

there are better SZs Z ′1, Z′2, that is, they are legal and Pr[x1 ∈ Z ′1 ∧x2 ∈ Z ′2] > Pr[x1 ∈

SZ1 ∧ x2 ∈ SZ2]. Then, we look at the Biclique B′ = (Z1, Z2) for Z1, Z2 which contain

all the nodes corresponding to items from Z ′1, Z′2 respectively. B′ is indeed a Biclique be-

cause Z ′1, Z′2 are legal SZs. However, w(B′) > w(B): w(B′) =

∑v1x∈Z1,v2y∈Z2

w(ev1xv2y) =∑x∈Z′

1,y∈Z′2Pr[x1 = x] · Pr[x2 = y] = Pr[x1 ∈ Z ′1 ∧ x2 ∈ Z ′2] > Pr[x1 ∈ SZ1 ∧ x2 ∈

SZ2] =∑

x∈SZ1,y∈SZ2Pr[x1 = x] · Pr[x2 = y] =

∑v1x∈W1,v2y∈W2

w(ev1xv2y) = w(B) - con-

tradicting the optimality of B

Corollary 2.1. Solving the Biclique problem gives a solution for the optimal SZs

problem. In addition, from the proof we can see that the better the Biclique is, the better

the SZs are.

This leads us to the optimal protocol: find the optimal Biclique and use the

corresponding SZs. Finding the optimal Biclique can be done in exponential time by

going over every sub-graph and if it’s a legal Biclique, compute it’s weight, and then

choose the best Biclique. Therefore, for a network of 2 sites, we have an optimal protocol

which requires C to run an exponential algorithm.

However, we seek efficient protocols, so we need efficient algorithms for the Biclique

problem. But the Biclique problem is known to be NP-Hard to solve. This means that

for the general case, efficient algorithms which are provably good, probably do not exist.

Also, we had shown (in the previous chapter) that the most general Biclique problem

can be reduced to the optimal SZs problem with a complicated enough global function;

meaning that the optimal SZs problem is also NP-hard to solve, in the general case.

Next, we describe some heuristic solutions for the Biclique problem found in the

literature.

33


2.4.1 Greedy Heuristic

The greedy algorithm starts by choosing the vertex with biggest number of neighbors,

removing all the nodes from the other side which are not in the set of neighbors of

the chosen vertex, and proceeding in the same way - choosing the vertex with biggest

number of neighbors from the vertices which are still in the graph. The algorithm stop

when there are no more vertices left to chose. The chosen vertices compose the resulting

Biclique.

Analysis: This algorithm always returns a maximal Biclique, that is: there isn’t

a larger Biclique which contains it. In some works which experimented using this

algorithm, the results were only a 12 -factor away from the optimal solution, however,

in other works the gap was usually larger. It is easy to show that this algorithm’s

approximation factor can be as bad as Ω(n) = Ω(√|E|). The running time, naively, is

O(|E| · |V |2), however it can be easily reduced to O(|E|+ |V |2).

2.4.2 Linear Programming

The Biclique problem can be written as an integer linear program, which can be relaxed

to a linear program with 2n + n2 variables and n2 linear constraints. The fractional

solution to this linear program can be rounded to a binary solution which defines a

Biclique. It is possible to prove that the number of edges not in the Biclique in this

solution, is not more than twice the number of edges not in the optimal Biclique, which

doesn’t provide a good guaranteed approximation factor for our Biclique problem in the

general case. The running time is the time needed to solve a linear program with O(n2)

dimensions.

2.5 Generalized Biclique Formalization

If our network had only 2 sites, then the above algorithms could have given us good

protocols with communication costs close to the optimal protocol. However, this case is

very limited and we want to find protocol for networks with more sites. So, we generalize

the reduction from the SZs problem (which is defined for any number of sites k) to

another Graph-Theory problem - Maximum weighted k-clique in k-uniform k-partite

hypergraphs:

Define the k-partite hyper-graph H : (V1, V2, ..., Vk, E), where |V1| = |V2| = ... =

|Vk| = |U | = n, and every value α ∈ U has a node in every side of H: v1α ∈ V1, ..., vkα ∈ Vk.

The set of edges is defined based on the global function: e(x1...xk) : (v1x1 , ..., vkxk

) ∈ E ↔f(x1, ..., xk) = β. This hyper-graph is k-uniform because for every edge e ∈ E, |e| = k, i.e.

it contains k vertices. And, we define a function of weights for the edges: w : E → [0, 1],

w(ey1...yk) = w(v1y1 , ..., vkyk

) = Pr[x1 = y1 ∧ ... ∧ xk = yk] = Πki=1Pr[xi = yi].

A k-clique, in this case, of H is B : (W1, ...,Wk) for which, W1 ⊆ V1, ...,Wk ⊆ Vk and

W1×...×Wk ⊆ E. That is, every k-tuple of vertices fromW1, ...,Wk, is an edge ofH. The

34


weight of a k-clique, for our purpose, is defined as w(B) =∑

w1∈W1,...,wk∈Wkw(ew1...wk

).

A k-clique B : (W1, ...,Wk) corresponds to a solution for our problem in which SZi = Wi

for all i = 1...k - converting vertices back to their respective items.

As above, the following claim and corollary hold:

Claim 2.5.1. The maximum weight k-clique of H corresponds to the optimal solution

of the SZs problem.

Corollary 2.2. Solving the k-clique problem gives a solution for the SZs problem and

the better the k-clique is, the better the SZs are.

So, next, we look for solutions for this k-clique problem. We notice that applying

the above algorithms for k > 2 is infeasible computationally because their running time

is dependant in the number of edges, which is, in the worst case, nk. Which means

that even if |U | = 2, and the data at the nodes was a single bit, the complexity of any

algorithm that is linear in the number of edges could be exponential (in k) and therefore

impractical for large-scale networks. We also note that for our use, the hyper graph is

not given explicitly, but we can check whether a given set of vertices are an edge or not

by checking the value of the global function on that set.

2.6 Hierarchical Heuristic

A possible way to solve the k-clique problem is by dividing the k sites into a hierarchy of

the form of a full binary tree (assume k = 2l ) with the sites at the leafs. Then, in every

node of the tree we solve a Biclique problem and then use the resulting Biclique to define

the Biclique problem of the two children of that node, which are then solved in turn.

Eventually, in the lowest level of the tree, we have 2 sites, so it is a Biclique problem of

the kind we had discussed above, which we can solve and get the SZs (or the k-clique).

To explain this hierarchical solution, let us look at a network with 4 sites: S1, S2, S3, S4.

Assume the hierarchy groups S1, S2 together, and S3, S4 together. Then, instead of

solving a 4-site problem, we imagine we have only 2 sites, namely S12, S34, which take

their local data from the set U ×U = U2, and each value (y1, y2) ∈ U2 is received at the

”site” S12 with probability Pr[x12 = (y1, y2)] = Pr[x1 = y1 ∧ x2 = y2], and similarly for

S34. Still, the global function is defined as: f1234((x1, x2), (x3, x4)) = f(x1, x2, x3, x4).

Therefore, we get a well defined SZs problem with two sites, and correspondingly a

Biclique problem which we can solve (using one of the suggested algorithms) and get

two SZs SZ12, SZ34 ⊆ U2. Then, going down to the leaves in order to decompose the

SZs of the couples to SZs of individual nodes, we define the following two SZ problems

over two sites each. In the first, the sites are S1, S2 with the regular data distribution

over them; however, the ”global function” is not the same anymore. Instead, the global

function is defined to be: f12 : U2 → 0, 1, f12(x1, x2) = β ↔ (x1, x2) ∈ SZ12. Again,

we solve this SZs problem by converting it to a Biclique problem, and eventually we

35


get sets of vertices W1,W2. Similarly, we define the problem for S3, S4, and get two set

W3,W4. Then, the 4-clique we return is B : (W1,W2,W3,W4).

This can be applied to any k that is a power of 2, yielding an algorithm for the

k-clique problem. Next, we analyze its correctness, optimality and efficiency.

Claim 2.6.1. The resulting sets B : (W1, ...,Wk) compose a legal k-clique.

Proof. It is enough to prove for k = 4, the extension to greater k is trivial. If v1y1 ∈W1, ..., v

4y4 ∈ W4, then (y1, y2) ∈ SZ12, (y3, y4) ∈ SZ34 and therefore f(y1, ..., y4) =

f1234((y1, y2), (y3, y4)) = β

Regarding the optimality of this solution, we can show that even if we optimally

solve the Biclique problem in every phase of the algorithm, the resulting k-clique may

be sub-optimal. However, we expect this heuristic to perform well in practice, and we

intend to test this claim with experiments. Maybe, also, with some effort, approximation

factors can be proved regarding this heuristic. In addition, we note that the optimality

of the solution will depend on the choice of hierarchy, and finding the best or good

hierarchies is an interesting open question.

Eventually, we claim that this algorithm can be efficient for many classes of functions,

even though, in the general case, it’s not. We notice that even though we solve only k/2

Biclique problems in this algorithm, the number of vertices in these problems can be

very large. For k = 4, we have seen that in the first problem we solve, each value from

U2 makes a vertex; that is, the graph has 2 · n2 vertices. Generally, the first problem

will have nk/2 vertices in each side of the bi-partite graph, which yields a very inefficient

algorithm.

2.6.1 Classes of functions

1. General functions over the OR/AND

Let us look at a class of very simple functions: f : (0, 1logn)k → 0, 1, such

that f(x1, ..., xk) = g(x1 ∨ ... ∨ xk) for another g : 0, 1logn → 0, 1. That is,

general Boolean functions defined over the bitwise-OR of log n-bit vectors. In a

network scenario, this could mean that C is interested in monitoring a function

over the global vector which is the bitwise-OR of the local vectors, which can

represent many monitoring task in which bitmaps are used. For example, in the

problem of distinct count (F0) monitoring, each site has a binary vector, and C

wants to know whether the number of 1’s in the vector which is the OR of all the

vectors, is larger than some threshold τ .

In this case, the above hierarchical solution can be very efficient once you observe

that when going up in the hierarchy, we do not need to increase the number

of vertices in the graphs. Let us look at the k = 4 example again, and assume

U = 0, 1logn, and f(x1, ..., xk) = g(x1 ∨ ... ∨ xk). Then, (x1, x2) can be replaced

36


with x1 ∨ x2, and then apply the same algorithm. What we get now, is that in the

first problem, although we have a vertex for every pair (x1, x2), many of them are

equivalent, since there are only n possible values for x1∨x2, even though there are

n2 pairs. Therefore, what we do is give the value of x1∨x2 a probability or weight,

that is equal to the sum of probabilities of all pairs which make up this value.

Formally, in the first stage, we solve a SZ or Biclique problem with S12, S34, which

take their local data from the set U , and each value y12 ∈ U is received at the ”site”

S12 with probability Pr[x12 = y12] =∑

y1,y2:y1∨y2=y12 Pr[x1 = y1 ∧ x2 = y2], and

similarly for S34. The global function is defined as: f1234(x12, x34) = g(x12 ∨ x34).

Then, we solve this problem like we did above, and get two SZs SZ12, SZ34 ⊆ U .

Then, going down to the leaves, for example for decomposing SZ12 to the two

sites S1, S2, we define a SZ problem like before, but the function now becomes:

f12 : U2 → 0, 1, f12(x1, x2) = β ↔ (x1 ∨ x2) ∈ SZ12. And we proceed the same

way, achieving the desired solution.

Therefore, the time complexity in this case is polynomial in both n, k.

Note that the case where f(x1, ..., xk) = g(x1 ∧ ...∧ xk), where x∧ y is the bitwise

AND of the binary arrays x, y can be solved efficiently similarly.

2. General functions over the sum of integers

Another very simple example, which can help explain the ideas presented, is

when each site holds a number from 1 to n, U = 1, ..., n and C is interested

in computing a function of the sum of these numbers, that is: f(x1, ..., xk) =

g(x1 + ... + xk). Again, in this case, the number of vertices in the higher level

of the hierarchy do not become exponential in k, but grow with a factor of 2

in every level, since we only need to keep track of (x1 + x2) ∈ 1, ..., 2n rather

than (x1, x2) ∈ 1, ..., n2. Therefore, all the Biclique problems we solve will be over

bi-partite graphs with up to k · n nodes.

3. General functions over the union of streams

In the distributed streaming model, each site observes a stream of items from a

set 1, ..., u, and the global stream is defined as the union of all these streams.

Usually, C is interested in monitoring a property of a sliding window of this stream,

and we assume that in any window, the number of received items can be at most

m. An equivalent representation of this problem, is saying that every site Si has

a vector xi of length u with values from 0, ...,m, where xi(j) equals the number

of times the item j was received at site Si in the last window. Then, the vector

C wants to compute a function over, is the sum of these u-dimensional vectors:

x1 + ... + xk. Therefore, in our notations U = 0, ...,mu, and, in addition, we

know that the global vector is also in U . So, again, using the hierarchical solution,

we can use |U | = (m+ 1)u = n nodes in all the problems of the hierarchy, and get

a solution polynomial with n, k.

37


4. General Boolean functions

For the most general case of functions over Boolean log n-bit vectors, we do not

have an efficient solution yet, however we will present two ideas.

First, we can follow the same path we did before and try to represent our function

as f(x1, ..., xk) = g(h1(xi1 ∧ ... ∧ xis), ..., ht(xj1 ∨ ... ∨ xjs)), and then use the

hierarchical solution, when in every level of the hierarchy, in the worst case, we

will have to keep items from U t rather than Uk, that is O(nt) instead of O(nk).

While if the representation obeys certain conditions, then with certain hierarchies,

we might need only items from U in all the levels.

Second, we can try to solve the k-clique directly, without dividing it hierarchically.

However, following the previous discussion, we need algorithms that are sub-linear

in the number of edges, which could be nk. Our idea is to use ”property testing”-

like techniques, so that we can work on a small sample from the hyper-graph and

get a solution which is, with high probability, a legal k-clique, and its size is a

good approximation for the size of the optimal k-clique. One possible direction is

generalizing the heuristic greedy algorithm presented earlier to the k-clique case,

while, for finding the node with biggest number of edges connected to it, we sample

t = O(poly(n, k)) edges, and decide based on this sample. This can probably

give us an approximate solution which isn’t much worse than the non-randomized

greedy algorithm, however, it doesn’t guarantee a legal k-clique, because in order

to verify deterministically that a given set of sets nodes is a k-clique we might

need to check Ω(nk) tuples. This idea can be used if we allow the protocols to err

with some probability.

2.6.2 Pruning nodes in the Biclique problem

In many case, n can be very large as well, and O(n3) solutions will not be good enough

for us. Thus, we propose ”pruning” techniques, in order to reduce the number of nodes

in the Biclique problems we solve.

First, we suggest removing from any Biclique problem we intend to solve, nodes

with probability that is smaller than some ε ≥ 0. The removed nodes can either be

simply ignored, or used in more sophisticated ways.

Second, if we have a distance measure defined over U and Uk, and we know that

our global function f : Uk → 0, 1 obeys a Lipchitz condition over this metric, then we

can cluster nodes whose corresponding items are close enough, while connecting this

”big” node to another node iff all the nodes composing the ”big” node were connected

to that node. The weight of such edge will be the sum of all the old edges. Of course,

we may lose edges in this process, but the amount of edges lost can be traded with the

size of the clusters and consequently the number of nodes reduced.

Third, we can generalize the clustering idea to cases where we do not have a metric

or a Lipchitz condition, by clustering together nodes which will not cause many edges

38


to be lost. So, assuming we have a Biclique problem with |V1| = |V2| = n, and we know

the set of edges E, then we can do a pre-processing step in O(|V |3) and get a graph

with less nodes, where every group of nodes which could be clustered together, without

losing more than ε edges (for a given ε ≥ 0), becomes a new ”big” node.

2.7 Advantages over the geometric Safe Zones

In the previous chapter, we introduced a different approach for distributing Safe Zones

to the nodes, which also aims at finding monitoring protocols that are communication

efficient. There, the Safe Zone of each site was some convex polygon in Rd, and we

demanded that the ”Minkowski sum” (or average) of all the Safe Zones of all the sites,

be contained in the global admissible region which we called S. In this chapter, however,

we look at Safe Zones which are sets of discrete points, and the legality constraint which

we demand is a ”Biclique condition” instead of a Minkowski Sum condition. Here we

will discuss the advantages of this approach over the geometric Safe Zones (previous

chapter).

1. Support more target functions: In the geometric Safe Zones approach, the

algorithms suggested only work for a global function f : Uk → 0, 1 that depends

only on the sum or average of the local vectors: f(x1, ..., xk) = g(∑k

i=1 xi) for

some g : U → 0, 1. Otherwise, the Minkowski sum wouldn’t make sense, and

there is no immediate way to compute or even write down geometrically the

constraints for the optimization problem that we solve to get the Safe Zones. Here,

however, this approach, in its general form, can be used to monitor any global

function - and therefore is more generic. Moreover, the class of functions for which

we propose practical scalable algorithms for computing good Safe Zones using this

method, is a superset of the class of functions that depend on the sum or average

of the local vectors.

2. Non-Convex Safe Zones: In the previous approach we were forced to restrict

the Safe Zones to convex shapes, because otherwise the computational complexity

of checking the constraints - and therefore of the whole optimization problem -

would grow exponentially in the number of sites k. In the Biclique approach, the

(non-)convexity of the Safe Zones doesn’t matter, and therefore, in cases where

the data at the nodes may not be easily covered by legal convex Safe Zones, the

Biclique approach may give Safe Zones that contain more of the p.d.f.s of the

nodes, and therefore result in a more efficient monitoring protocol.

To illustrate this, we planned a synthetic setup for which any protocol with convex

Safe Zones would result in a high probability for violations, and then we computed

Safe Zones using the Biclique approach. The resulting Safe Zones are plotted

in Fig 2.1: we can see that the non-convexity of the Safe Zones allows them to

contain a large number of the data samples.

39


Figure 2.1: Non Convex SZs example: Plotted are the data samples of bothnodes (right and left clouds), and the ”legal” global data points (center) which are theaverages of every pair of points, one from each node, such that the average resides inthe admissible region S. This S corresponds to a function s.t. f(x, y) = β ⇐⇒ c1 ≤(x− a1)2 + (y − a2)2 ≤ c2. The data points colored in green, at the nodes, are pointsthat were in the resulting SZ. The SZs are non-convex sets. The green points in thecenter are all the averages of a pair of points, one from each SZ.

3. No use of optimization toolboxes: To solve the optimization problem of find-

ing the optimal SZs, the two approaches use different algorithms. In the geometric

approach, an optimization toolbox is used, which is a black-box algorithm that

gets as input the parameters of the problem we wish to solve, and returns a

solution. The computations inside that black-box involve complicated Gradient

Descent algorithms, making it very hard to understand what is going on inside,

for someone using the method. Here, however, the problem is converted into a

Biclique problem, and is solved by an algorithm for the Biclique problem. For

example, one can choose to use the greedy and hierarchical heuristics, which

involve very simple and direct computations.

40


Chapter 3

Violation Resolution in

Distributed Stream Networks

3.1 Chapter Summary

Distributed stream networks continuously track the global score of the data and alert

whenever a given threshold is crossed. The global score is computed by applying a

scoring function over the aggregated streams. However, the sheer volume and dynamic

nature of the streams impose excessive communication overhead.

Most recent approaches eliminate the need for continuous communication, by using

local constraints assigned at the individual streams. These constraints guarantee that

as long as no constraint is violated, the threshold is not crossed, and therefore no com-

munication is necessary. Regrettably, local constraint violations become more and more

frequent as the network grows and, in the presence of such violations, communication is

inevitable.

In this chapter, we show that in most cases the violations can be resolved efficiently.

Although our solution requires only a reduced subset of the network streams, finding the

minimum resolving set is NP-hard. Through analysis of the probability for resolution, we

suggest methods to select the resolving set so as to minimize the expected communication

overhead and the expected latency of the process. Experimental results with both

synthetic and real-life data sets demonstrate that our methods yield considerable

improvements over existing approaches.

41


3.2 Introduction

Distributed stream networks have become very common in many fields of technology such

as sensor networks [MF02], analysis of financial time series [YSJ+00], Web applications

[KL10], and more. In these networks, numerous distributed nodes handle highly

dynamic, continuous data streams, and their goal is to detect some global property

over the distributed data. A fundamental application in distributed stream networks is

Threshold Monitoring, the goal of which is to constantly alert whenever the value of a

predetermined function, evaluated over the network-wide data, crosses a given threshold.

A trivial approach for monitoring is to continuously or periodically centralize all the

data, thus transforming a distributed problem into a centralized one. This, however,

places intolerable burden on the network.

Considerable research efforts were made to reduce network communication overhead

in continuous distributed monitoring, as presented in a recent survey by Cormode

[Cor11]. Reviewed solutions include data sketching [CG05] and data sampling [CMYZ10]

algorithms, in which the minimum required amount of data is centralized in order to

approximately detect threshold crossing events. Another approach [BO03b, RN04,

SR08, GRM10, TZWL08, SSK07b] is to assign local constraints at the individual nodes,

such that as long as all the constraints are valid, it is guaranteed that the threshold has

not been crossed. The latter approach enables exact detection of threshold crossings,

while minimizing communication overhead.

The main challenge in the local constraints approach is to efficiently define the

constraints so as to minimize the number of violations over time. However, local

constraint violations are bound to happen from time to time due to local behaviour (e.g.,

reading error, energy deficiency or local interrupts). A local violation can sometimes

indicate that a global violation, i.e., threshold crossing, has occurred, but in most cases

it suggests nothing more than a local phenomenon. The process which determines the

network status in the presence of local violations is referred to as violation resolution.

As the size of the network increases, so does the probability that local violations

will occur, and thus the frequent need to resolve them efficiently. The resolution process

is required to reduce the overall communication cost while meeting a certain latency

expectation.

A few violation resolution algorithms have been presented, mainly for star-shaped

networks (that consist of a central coordinator). In [BO03b, RN04, SR08, GRM10,

TZWL08], the network status is determined by the data and constraints held by the

coordinator and the violating nodes. If additional data are required, then the data for

the entire network are collected. As the number of local violations increases, this process

imposes onerous communication cost. In order to reduce network overhead, the coordi-

nator in [CMY11, KCR06] waits for several violation reports before collecting the entire

network data. This process reduces the overall cost but still requires communicating

with all the nodes.

42


Recently, in [SSK07b, SSK07a], an incremental violation resolution method was

presented, which used randomly chosen subsets of non-violating nodes. However, this

method is tailored to their algorithms and no bounds were provided for the expected

size of these sets.

In this work we address the problem of violation resolution for local constraint

monitoring. We present a general approach that attempts to resolve local violations

by polling data from a subset of non-violating nodes, referred to as the resolving set.

Our goal is to reduce communication cost by detecting a minimum-size resolving set,

while maintaining a fair latency. To the best of our knowledge, this is the first time

the problem of a minimum-size resolving set is studied. We prove that this problem is

NP-hard and suggest heuristic approaches for solving it. Assuming homogeneous data

setups, we propose a random method (similar to [SSK07a]) and present some theoretical

bounds using Hoeffding [Hoe63] and Bernstein [Tro10] inequalities. Acknowledging the

challenge of heterogeneous data setups, we propose an efficient method using algorithms

from graph theory. Our methods were extensively tested over both synthetic and

real-life data sets, achieving a substantial reduction in communication cost and latency,

in comparison to current algorithms.

This chapter is organized into eight sections. In Section 3.3 we discuss related work,

and notations and terminology are presented in Section 3.4. In Section 3.5 we present

an overview of our generic approach. In sections 3.6 and 3.7 we present our Random

Logarithmic algorithm (RLG) and Maximum Matching Tree algorithm (MMT) for

homogeneous and heterogeneous setups, respectively. Finally, in Sections 3.8 and 3.9

we present experimental results and conclusions.

3.3 Related Work

Resolution of local constraint violations is commonly addressed as a subproblem in

threshold monitoring algorithms. Threshold monitoring algorithms over star-shaped

networks usually proposed [BO03b, RN04, SR08, GRM10, TZWL08] a two-phase reso-

lution process. First, the coordinator attempts to complete resolution without involving

any nodes other than itself and the violating nodes. If it fails, it tries to resolve the

violations by communicating, in a single round, with the entire network. A similar

approach was suggested for value monitoring algorithms [CMY11, KCR06]. Commu-

nication was somewhat reduced as the coordinator would wait for several violation

reports before communicating with the entire network. However, in sufficiently large

networks, any algorithm that requires communicating with the entire network would

incur high communication cost. Furthermore, this approach exposes a trade-off between

the communication cost and the latency to alert about a global violation.

Recently, [SSK07b, SSK07a] suggested gradually increasing the size of the resolving

set in a number of rounds. At each round the resolving set was increased by a single

non-violating node [SSK07b] or by an exponentially increasing number of non-violating

43


nodes [SSK07a]. While these methods are closest our work, the nodes in both were

selected at random, assuming a homogeneous data setup. In addition, no bounds were

presented for the expected size of the resolving set and for the expected latency of the

resolution process.

Threshold monitoring problems have also been researched for tree-shaped [JDZ+07]

and peer-to-peer [WBK09] networks. In these algorithms, nodes communicate according

to a predefined overlay communication tree. Their resolution processes consist of

multiple rounds, where at each round the size of the resolving set is increased by the set

of adjacent (or ancestor) nodes of the current resolving set. While the communication

tree efficiently reduces the size of the resolving set, neither algorithm suggests a way

to define this tree. In our work we present a construction of an overlay tree structure,

tailored for heterogeneous data setups.

Finally, several threshold monitoring algorithms [OJW03, RN04, HNG+06] assumed

violation resolution processes, but they were not presented by the authors.

3.4 Violation Resolution and Minimum Resolving Set

3.4.1 Problem Definition

We consider a distributed online environment consisting of n remote monitoring nodes

N1, N2, ..., Nn and a central coordinator node NC . Nodes communicate through the

coordinator node, and direct communication between monitoring nodes is not allowed.

We assume that the nodes are synchronized and each monitoring node observes an indi-

vidual stream of multidimensional data over discrete time. Let vti be the d-dimensional

real vector collected by node Ni at time t. We denote this vector as the local vector of

node Ni. Each node is assigned a weight ωi ∈ R, and we define the weighted average of

all the local vectors at time t as the global vector vt, i.e., vt = (∑n

i=1 ωivti)/(

∑ni=1 ωi).

Note that the weighted average operator can be replaced by any other commutative

and associative operator (e.g., multiplication). For simplicity, and w.l.o.g., in the rest of

the chapter we assume that ωi = 1 for every node Ni, such that vt = 1n

∑ni=1 v

ti .

Given an arbitrary monitoring function f : Rd → R and a threshold value τ ∈ R,

the coordinator node needs to constantly alert whenever f(vt) exceeds (or drops below)

τ . This threshold monitoring query can be reduced to a simple domain constraint over

the global vector. Let S be the entire set of vectors over which the monitoring function

doesn’t cross the threshold, i.e., S =v ∈ Rd

∣∣f(v) ≤ τ

. The coordinator node needs to

alert whenever vt /∈ S. We denote S, the domain over which this constraint is satisfied,

as the global safe zone.

Assume that every monitoring node Ni is associated with a subset of the data space

Si ⊆ Rd, denoted as the local safe zone of node Ni, such that the following condition

44


holds: (n∧i=1

vi ∈ Si

)→ 1

n

n∑i=1

vi ∈ S. (3.1)

It follows that an overall satisfaction of the local domain constraints (imposed by

the local safe zones) over the local vectors would imply that the domain constraint

over the global vector is also satisfied. In other words, as long as all the local vectors

reside within their respective local safe zones (vti ∈ Si, i = 1 . . . n), the global vector

is guaranteed to reside within the global safe zone (vt ∈ S). This case requires no

knowledge of the local vectors of the monitoring nodes, and thus eliminates any need

for communication.

At any time t a node’s local vector can deviate from its local safe zone. We refer to

this event as a local violation. When local violations occur, the violating nodes report

their local vectors to the coordinator, which then determines the network status to

see if a global violation has occurred. This decision process is referred to as violation

resolution.

An example of a monitoring system in a 2-dimensional space is presented in Figure

3.1. It depicts a snapshot of the system in two consecutive time steps. At the first, all the

local constraints are satisfied and the coordinator can determine, without communication,

that the global constraint is also satisfied. At the second, a local violation has occurred

(at N1) and resolution is required.

The goal of this work is to reduce the communication required for resolution and

altogether maintain fair latency of this process. The latency is the time required to

determine the network status, particularly in the case of a global violation. While

determining that a global violation has occurred usually requires knowledge of all the

local vectors, this case is rare in most monitoring applications (e.g., natural hazards

detection). Commonly, if there is no global violation, then the local violations can

be resolved by collecting the local vectors of a small set of nodes, referred to as the

resolving set.

3.4.2 Resolving Local Violations

We next show how violation resolution is achieved by the resolving set. To this end we

generalize the notions of local vector and local safe zone to a set of nodes. Given a set

of nodes, A, let vtA denote the vector of A at time t and let SA denote the safe zone

of A. vtA is defined as the average of the local vectors of the nodes in A at time t, i.e.,

vtA = 1|A|∑

Ni∈A vti . Similarly, SA = 1

|A|∑

Ni∈A vi |∧Ni∈A vi ∈ Si. These definitions

are consistent with the operation that defines the global vector (as presented in Section

3.4.1).

Lemma 3.4.1. The above definitions satisfy:

1. SN1,...,Nn ⊆ S.

45


Figure 3.1: A monitoring system, consisting of 3 monitoring nodes and a coordinator,in a 2-dimensional space. The safe zones are given as rectangles in the plane and thevectors are marked by dots. It’s easy to see that the average of every 3 vectors, takenrespectively from the local safe zones, resides within the global safe zone (S). At t = 0,all the local vectors reside within their respective safe zones and, consequently, theglobal vector is also inside the global safe zone. In this case, none of the monitoringnodes reports its local vector to the coordinator. At t = 1, a local violation has occurredat N1 (v11 /∈ S1). N1 would now report its local vector to NC to seek resolution. NC

must poll N2 or N3 (or both) for their local vectors in order to verify that v1 ∈ S.

46


2. For every time t and mutually disjoint subsets of nodes A1, . . . , Am:(m∧i=1

vtAi∈ SAi

)→ vtA ∈ SA

where A =⋃mi=1Ai.

Proof.

1. SN1,...,Nn =

1n

∑ni=1 vi

∣∣∧ni=1 vi ∈ Si

by definition. SN1,...,Nn ⊆ S follows

directly from Equation 3.1

2. For every time t and mutually disjoint subsets of nodes A1, . . . , Am:

m∧i=1

vtAi∈ SAi ↔

m∧i=1

1

|Ai|∑Nj∈Ai

vtj

∈ 1

|Ai|∑Nj∈Ai

vj

∣∣∣∣∣∣∧

Nj∈Ai

vj ∈ Sj

→m∧i=1

∑Nj∈Ai

vtj

∈ ∑Nj∈Ai

vj

∣∣∣∣∣∣∧

Nj∈Ai

vj ∈ Sj

→ m∑i=1

∑Nj∈Ai

vtj

∈

m∑i=1

∑Nj∈Ai

vj

∣∣∣∣∣∣∧

Nj∈Ai

vj ∈ Sj

→∑Nj∈A

vtj

∈∑Nj∈A

vj

∣∣∣∣∣∣∧Nj∈A

vj ∈ Sj

→ 1

|A|∑Nj∈A

vtj

∈ 1

|A|∑Nj∈A

vj

∣∣∣∣∣∣∧Nj∈A

vj ∈ Sj

↔vA ∈ SA

where A =⋃mi=1Ai.

It follows that when local violations occur, the coordinator node can rule out global

violation by acquiring only the local vectors of the resolving set. This is justified by the

following theorem.

Theorem 3.1. Let V be the entire set of violating nodes at time t. If there exists a set

of nodes A such that V ⊆ A and vtA ∈ SA, then vt ∈ S.

Proof. Let N = N1 . . . Nn. Since V ⊆ A, N \A consists of merely non-violating nodes,

i.e., vtNi ∈ SNi for every Ni ∈ N \ A. Since we have also vtA ∈ SA, we conclude by

Lemma 3.4.1 that vt = vtN ∈ SN ⊆ S.

Following this theorem, we denote R = A \ V as the resolving set.

47


(a) 4-node homogeneous system monitoring concentrations of air pollutants

(b) 4-node heterogeneous system monitoring occurences of terms in news reports

Figure 3.2: Local violations in homogeneous and heterogeneous systems. The safe zonesare given as rectangles in the plane and the local vectors are marked by the enlargeddots. The safe zones were intentionally fit to the distributions of past measurements,represented by the clouds of dots, to minimize local violations.(a) The monitoring nodes are homogeneous in the variance of their distributions and inthe dimensions of their safe zones. There is no clear preference for one node over theother in resolving local violations. At t = 1, the local violation of N3 can be resolved byany non-empty subset of the other nodes, and the minimum resolving set comprises asingle node.(b) The monitoring nodes are heterogeneous in the variance of their distributions and inthe dimensions of their safe zones. Some nodes are more likely to resolve certain localviolations than others. At t = 1, the local violation of N2 can only be resolved with N4,and the minimum resolving set comprises N4 alone.

3.4.3 Running Examples

Next we present two running examples which illustrate the concepts and problems

addressed in this work.

Example 1. (homogeneous data) Assume that air quality sensors are deployed

at various geographic locations, measuring the concentration of pollutants the air. Each

sensor maintains a vector of its readings, such as the concentrations of NO, and NO2.

We evaluate a function over the vectors of measurements in order to determine the

overall air quality, and we wish to detect whenever the air quality drops below a certain

threshold. Figure 3.2a depicts a system consisting of four sensors. The safe zones aim

to minimize local violations and are therefore centered around the expected value of the

readings in an attempt to cover as much of the distribution area as possible. Normally,

the readings of the different nodes, for each pollutant, have the same variance and thus,

the safe zones of the different nodes are homogeneous in their dimensions. This implies

that there is no clear preference for one node over the other in resolving local violations.

Example 2. (heterogeneous data) Assume an Internet news agency, which

constantly monitors news reports. The nodes are each assigned to a specific category of

48


news (e.g., economy, sports). A wide collection of news-related terms is assembled, and

each node tracks the occurrences of the terms in the reports it monitors over a sliding

window of one hour. Figure 3.2b depicts a system consisting of four nodes, of which two

are assigned to monitor economic news (N1, N3), while the other two monitor sports

news (N2, N4). The nodes track the occurrences of the terms “team” and “asset.” The

history of occurrence counts in each node determines its distribution area, and its safe

zone is assigned as an interval for each of the terms. As in the air quality example, the

safe zones are centered around the expected value and attempt to cover as much of the

distribution area as possible. However, in this case, the occurrence counts of each term

have a substantially different variance for different nodes; thus, the safe zones of the

different nodes are heterogeneous in their dimensions. For example, the term “team”

has a wider interval in the sports nodes, while the term “asset” has a wider interval in

the economy nodes. This implies, for example, that a sports node has greater flexibility

to resolve constraint violations of the term “team.”

Table 3.1: Frequently Used Notations

Notation Description

n Number of monitoring nodes

Ni Monitoring node i (i = 1 . . . n)

NC Coordinator node

vti Vector of node i at time t

vtA Vector of a set of nodes A at time t

vt Global vector at time t

Si Safe zone of node i

SA Safe zone of a set of nodes A

S Global safe zone

V Set of violating nodes

R Resolving set

3.5 Generic Algorithm

In this section we present our generic algorithm for violation resolution. In the presence

of local violations, the algorithm is executed at the coordinator to determine the network

status. The algorithm outputs whether or not the threshold has been crossed. We assume

that the coordinator is familiar with the safe zones assigned for the monitoring nodes.

Pseudo-code is given in Algorithm 3.1. At time step t, all violating nodes (V) report

their local vectors to the coordinator. Following Theorem 3.1, the coordinator attempts

to resolve the violations by detecting a resolving set (R) such that vtV∪R ∈ SV∪R. The

resolving set is initially empty and gradually extended in every round with the set of

nodes returned by the function getExtendingSet. The only requirement for this function

49


is to return a non-empty set of non-violating nodes that are not already included in

the resolving set. The algorithm terminates when either the violations are resolved

(i.e., the threshold was not crossed) or the resolving set comprises the entire set of

non-violating nodes. In the latter case, the coordinator directly verifies the network

status by evaluating f(vt). Note that the number of rounds it takes to assemble the

resolving set defines the latency of the algorithm, whereas the size of the set defines the

communication cost.

Algorithm 3.1 Generic Violation Resolution Algorithm

1: for all Ni in V do2: Ni sends vti to NC

3: end for4: NC computes vtV , SV5: resolved← vtV ∈ SV a boolean flag6: r ← 1, R ← ∅7: while not resolved and |V ∪ R| < n do8: Rr ← getExtendingSet(V,R, r)9: for all Ni in Rr do

10: NC polls Ni for vti11: end for12: R ← R∪Rr13: NC computes vtV∪R, SV∪R14: resolved← (vtV∪R ∈ SV∪R)15: r ← r + 116: end while17: if not resolved then |V ∪ R| = n18: resolved← (f(vt) ≤ τ)19: end if20: return resolved

Theorem 3.2. The generic algorithm always terminates and correctly determines the

network status.

Proof. The while loop (lines 7-16) terminates when either resolved = true or |V∪R| = n.

The properties of getExtendingSet guarantee that the loop will terminate eventually.

resolved = true indicates that the violations have been resolved by the resolving set,

namely vtV∪R ∈ SV∪R (line 5 or 14), and thus, by Theorem 3.1, a global violation did

not occur. Otherwise, |V ∪ R| = n, which suggests that the coordinator holds the local

vectors of all the nodes and is therefore able to compute f(vt) directly (line 18). In any

case, by the end of the algorithm, resolved = false if and only if a global violation, i.e.,

threshold crossing, has occurred. We conclude that the algorithm always terminates

and correctly determines the network status.

Throughout the run of the algorithm, the probability for violation resolution, namely

PrvtV∪R ∈ SV∪R

, is monotonically non-decreasing. This is implied by the following

theorem:

50


Theorem 3.3. Let V,V be the set of violating nodes at time t and its complement

(V = N1, . . . , Nn \ V). Then for every two subsets of nodes R1 ⊆ R2 ⊆ V:

PrvtV∪R2

∈ SV∪R2

≥ Pr

vtV∪R1

∈ SV∪R1

.

Proof. Assume that vtV∪R1∈ SV∪R1 holds. As V is a set consisting of merely

non-violating nodes, i.e., vti ∈ Si for every Ni ∈ V, it then follows from Lemma

3.4.1 that vtV∪R2∈ SV∪R2 . Namely, vtV∪R1

∈ SV∪R1 → vtV∪R2∈ SV∪R2 and hence,

PrvtV∪R2

∈ SV∪R2

≥ Pr

vtV∪R1

∈ SV∪R1

.

The performance of the algorithm is dictated exclusively by the function getEx-

tendingSet (line 8), which determines the quality and scale of the extension Rr to the

resolving set. For example, an instance of the generic algorithm, denoted the Naive

Algorithm, implements this function to always return the set of all non-violating nodes.

Consequently, this algorithm achieves the minimum latency (1 round) yet also the

maximum communication cost (maximum size resolving set). Another instance, denoted

Random Linear Algorithm (RLN), extends the resolving set by a single randomly chosen

node at each round. While this algorithm attempts to minimize the size of the resolving

set, it may incur a rather high number of rounds. These examples expose the trade-off

between the expected latency and the expected communication cost of the generic

algorithm.

An optimal instance of the generic algorithm is foremost required to reduce com-

munication cost, but at the same time it must maintain a reasonable latency. The

latency determines how long the nodes should keep their local vectors and moreover, it

determines how long it takes to detect a global violation. This is essential for real-time

monitoring applications such as natural hazard detection.

3.5.1 Minimum Resolving Set is NP-Hard

Clearly, the optimal communication cost is attained by detecting a minimum size

resolving set. However, we argue that even a relaxed version of this problem, in which

all the local vectors are known to the coordinator, is NP-hard. Denote this version the

minimum resolving set problem (MRS).

Theorem 3.4. Let V be the set of all the violating nodes at time t. Given the set of

all the local vectors, namely vt1, . . . , vtn, the problem of finding a resolving set R of a

minimum size, such that vtV∪R ∈ SV∪R, is NP-hard.

Proof. We show a reduction to MRS from the maximum clique problem (MC), which is

well-known to be NP-hard. Given a graph G = (V,E), the maximum clique problem is

to find a maximum complete subgraph of G, i.e., a set of vertices V ′ ⊆ V of maximal

size, that are pairwise adjacent: ∀ui, uj ∈ V ′ : ui, uj ∈ E. Given an instance of MC

consisting of G = (V,E), V = u1, . . . , u|V |, we construct an instance of MRS consisting

51


of |V | + 1 monitoring nodes where V = N|V |+1, i.e., N|V |+1 is the single violating

node. Let P1, . . . , Pm be the set of all non-adjacent pairs of vertices in G. Namely,

ui, uj /∈ E if and only if Pk = ui, uj for some 1 ≤ k ≤ m. We specify the local

vectors and the safe zones in the MRS instance as follows:

• for i = 1, . . . , |V |, vti = (vti [1], . . . , vti [m]) ∈ Rm where:

vti [j] =

1 ui ∈ Pj0 otherwise

.

• for i = 1, . . . , |V |:Si = v ∈ Rm | v[j] ≥ 0, ∀j = 1, . . . ,m.

• vt|V |+1 = vtV = 0m = (0, . . . , 0).

• S|V |+1 = SV = v ∈ Rm | v[j] > 0,∀j = 1, . . . ,m.

It is evident that the construction above is polynomial and yields a legal instance of

MRS, in which vti ∈ Si for i = 1, . . . , |V | and vtV /∈ SV . We now show that a solution to

the MRS instance, i.e., a minimum resolving set R ⊆ N1, . . . , N|V |, defines a solution

of the MC instance, i.e., a maximum clique V ′ ⊆ V , by the following two observations:

1. A clique V ′ of size k in the MC instance defines a resolving set R of size |V | − kin the MRS instance. We prove that: R = Ni|ui /∈ V ′ is a resolving set. Assume

to the contrary that vtV∪R /∈ SV∪R. Since SV∪R consists of all strictly positive

vectors, there exists 1 ≤ j ≤ m such that vtV∪R[j] ≤ 0. Hence, for every Ni ∈ R,

vti [j] = 0. This suggests that Pj ∩ R = ∅ and therefore, Pj ⊆ V ′. We conclude

that V ′ contains a pair of non-adjacent nodes, a contradiction.

2. A resolving set R of size k in the MRS instance defines a clique V ′ of size |V | − kin the MC instance. We prove that V ′ = ui | Ni /∈ R is a clique. Assume

to the contrary that V ′ is not a clique. Then it contains a pair of non-adjacent

nodes. Namely, there exist 1 ≤ j ≤ m such that Pj ⊆ V ′. Hence, Pj ∩ R = ∅and for every Ni ∈ R : vti [j] = 0. We conclude that vtV∪R[j] = 0 and therefore,

vtV∪R /∈ SV∪R, a contradiction.

Thus, a minimum resolving set R in the MRS instance defines a maximum clique V ′ in

the MC instance.

3.5.2 Probabilistic Analysis of the Algorithm

We next present a few probabilistic bounds which we use to evaluate the expected

size of the resolving set. We derive a lower bound on the probability that violation

resolution is achieved by the resolving set, namely PrvtV∪R ∈ SV∪R, and show that it

is exponentially increasing in the size of the set. Our method of doing so is to define

a region inside SV∪R which contains the expected value of vtV∪R, and then bound the

52


probability that vtV∪R belongs to this region. We derive two lower bounds for the

cases where the bounded region inside the safe zone is a box or a sphere, by employing

Hoeffding’s and Bernstein’s inequalities, respectively.

Hoeffding’s Lower Bound – Univariate Data

For simplicity, we first consider the case where the data of each node are one-dimensional

(vti ∈ R). Assume that the safe zone of node Ni is given by an interval on the real line

[ai, bi] and denote its length by ∆i. Further, assume that the data of each node are

bounded within an interval whose length is αi∆i. It follows that for a set of nodes A,

SA is defined as the interval [ 1|A|∑

Ni∈A ai,1|A|∑

Ni∈A bi] whose length is 1|A|∑

Ni∈A ∆i.

Denote by δ−,δ+ the distances from the expected value of vtV∪R to the left and right

end points of SV∪R, respectively. Let A = V ∪R, then:

PrvtA ∈ SA

≥

PrE(vtA)− vtA ≤ δ− ∧ vtA − E(vtA) ≤ δ+

=

1− PrE(vtA)− vtA > δ− ∨ vtA − E(vtA) > δ+

=

1− PrE(vtA)− vtA > δ−

− Pr

vtA − E(vtA) > δ+

≥

1− Pr−vtA − E(−vtA) ≥ δ−

− Pr

vtA − E(vtA) ≥ δ+

.

Hoeffding provided an upper bound on the probability for the mean of random variables

to deviate from its expected value:

Theorem 3.5 (Hoeffding 1963, Theorem 2 [Hoe63]). Let X1, . . . , Xn be independent

random variables such that ai ≤ Xi ≤ bi (i = 1, . . . , n). Then for t > 0:

PrX − E(X) ≥ t

≤ exp

(− 2n2t2∑n

i=1(bi − ai)2

)where X = 1

n

∑ni=1Xi.

We employ Hoeffding’s inequality to derive the following corollary:

Corollary 3.6. Given that the local vectors of the nodes are univariate and independent,

a lower bound on the probability for violation resolution is given by:


≥ 1− φ(V,R, δ−)− φ(V,R, δ+)

where

φ(V,R, δ) = exp

− 2(|V|+ |R|)2δ2∑Ni∈V

(αi∆i)2 +∑

Ni∈R∆2i

.

53


This bound exponentially approaches 1 as the size of R increases, regardless of the data

distributions of the nodes. Figure 3.3 depicts these probability bounds for identically

distributed nodes.

2222 4444 6666 8888 10101010 12121212 14141414 16161616 18181818 20202020

0.650.650.650.65

0.70.70.70.7

0.750.750.750.75

0.80.80.80.8

0.850.850.850.85

0.90.90.90.9

0.950.950.950.95

1111

|R|

Prv

t V∪R

∈S

V∪R

Actual (, = 2)Bound (, = 2)Actual (, = 2.5)Bound (, = 2.5)Actual (, = 3)Bound (, = 3)

Figure 3.3: Hoeffding bounds over univariate synthetic data. In this setup, the safezone of each node Ni is an interval of length ∆ centered around the expected value ofvti , and vti is bounded within an interval of length α∆. Hence, SV∪R is also an intervalof length ∆, centered around the expected value of vtV∪R. In addition, V consists of asingle violating node. Therefore, the lower bound for Pr

vtV∪R ∈ SV∪R

is given by:

1− 2φ(V,R,∆/2) = 1− 2 exp(− (1+|R|)2

2(α2+|R|)

). α denotes how far a node can deviate from

its safe zone, in terms of the safe zone’s width.

Hoeffding’s Lower Bound – Multivariate Data

For the multidimensional case, let [ai, bi] ⊆ Rd be the bounding box of Si and let

∆i = bi − ai. In other words, the projection of Si on the jth dimension (j = 1, . . . , d)

is the interval [ai[j], bi[j]] of length ∆i[j]. Assume that the data of each node Ni are

bounded within a d-dimensional box: [ci, di] ⊆ Rd, such that di − ci = αi ·∆i where

αi ∈ Rd (the product denotes multiplying corresponding entries). Let δ−, δ+ ∈ Rd such

that δ−[j], δ+[j] denote the distances from the expected value of vtV∪R to the left and

right end points of a box bounded in SV∪R, respectively, when projected on the jth

dimension.

54


Corollary 3.7. Given that the local vectors of the nodes are d-dimensional and inde-

pendent, a lower bound on the probability for violation resolution is given by:


≥

1−d∑j=1

(φ(V,R, δ−, j) + φ(V,R, δ+, j)

)where

φ(V,R, δ, j) = exp

− 2(|V|+ |R|)2δ[j]2∑Ni∈V

(αi[j]∆i[j])2 +∑

Ni∈R∆i[j]

2

.

Bernstein’s Lower Bound

An even tighter bound can be attained if we consider a d-sphere of radius δ, inside SV∪R

that is centered around the expected value of vtV∪R. Assume that the data of each node

are bounded in a d-sphere of radius ∆. Let A = V ∪R. Then:

PrvtA ∈ SA

≥

Pr‖vtA − E(vtA)‖ ≤ δ

=

Pr

∥∥∥∥∥∥ 1

|A|∑Ni∈A

vti

− E

1

|A|∑Ni∈A

vti

∥∥∥∥∥∥ ≤ δ =

Pr

∥∥∥∥∥∥∑Ni∈A

(vti − E(vti)

)∥∥∥∥∥∥ ≤ δ|A|2 =

Pr

∥∥∥∥∥∥∑Ni∈A

Zi

∥∥∥∥∥∥ ≤ δ|A|2

where Zi = vti − E(vti) is a random vector of length d which satisfies E(Zi) = 0 and

‖Zi‖ < ∆. Tropp provided the following generalization of Bernstein’s inequalities for a

sum of random matrices:

Theorem 3.8 (Matrix Bernstein [Tro10]). Given a finite sequence Zk of indepen-

dent, random matrices with dimensions d1 × d2, assume that each random matrix

satisfies E(Zk) = 0 and ‖Zk‖ < R almost surely. Define:

σ2 := max ‖∑

kE(ZkZ∗k)‖, ‖

∑kE(Z∗kZk)‖ .

Then, for all t ≥ 0,

Pr ‖∑

kZk‖ ≥ t ≤ (d1 + d2) · exp

(− t2/2

σ2 +Rt/3

).

55


We employ Bernstein’s matrix inequality to derive the following corollary:

Corollary 3.9. Given that the local vectors of the nodes are d-dimensional and inde-

pendent, a lower bound on the probability for violation resolution is given by:


≥

1− (1 + d) · exp

(− (δ|V ∪ R|2)2/2σ2 + ∆δ|V ∪ R|2/3

),

where

σ2 := max‖∑

Ni∈V∪RE(ZiZTi )‖, ‖

∑Ni∈V∪RE(ZT

i Zi)‖.

3.6 Homogeneous Data Instance

In the case of homogeneous data, i.e., the monitoring nodes’ data are identically

distributed, it appears that no node is clearly preferable over any other in resolving

local violations. Therefore, we suggest the following instance of the generic algorithm

presented in Section 3.5, to which we refer as the Random Logarithmic algorithm (RLG).

Pseudo-code for the function getExtendingSet is given in Algorithm 3.2. The concept

behind this algorithm is straightforward: at each round r the resolving set is extended

by additional 2r randomly selected non-violating nodes.

Following the probabilistic analysis of the algorithm, presented in Section 3.5.2,

we are able to estimate the expected size of the resolving set. Thus, we are able to

estimate the expected communication cost (|R|), as well as the latency (log(|R|)) of the

algorithm. As depicted in Figure 3.3, when the data are homogeneous, the probability

for resolution converges rapidly to 1 as the size of the resolving set increases. Therefore,

we expect this algorithm to perform well. Note that in the worst case scenario, i.e.,

the resolving set comprises all the non-violating nodes, RLG bounds the latency by

O(log(n)) rounds.

Clearly, the number of nodes added to the resolving set in each round determines

the algorithm’s latency. We have found that doubling the size of the resolving set at

each round yields a fair trade-off between communication cost and latency.

The correctness of RLG derives directly from Theorem 3.2, as getExtendingSet

always returns a non-empty set of non-violating nodes which were not already included

in the resolving set.

Algorithm 3.2 Random Logarithmic Algorithm

getExtendingSet(V,R, r)1: Rr ← 2r random nodes from N1, ...., Nn \ (V ∪R)2: return Rr

56


3.7 Heterogeneous Data Instance

In various distributed stream networks, the data are heterogeneously distributed, i.e., the

distribution of the data may vary greatly among different streams. Data heterogeneity,

in most constraint monitoring algorithms, yields heterogeneous constraints. If the

constraints are boxes, for example, the safe zones among the different nodes may vary

notably in shape, having different width in each dimension. In this section we show that

random methods (such as RLG) may produce poor results over heterogeneous setups,

and present a more suited algorithm, which efficiently chooses the resolving set. Finally,

we discuss the complexity of this algorithm.

3.7.1 The Heterogeneous Data Challenge

Consider Example 2 presented in Section 3.4.3, in which a single local violation has

occurred in one of the sports node (N2). It is evident that the violation can only be

resolved in collaboration with the other sports node (N4), because economy nodes are

characterized by narrow safe zones for sports-related terms. Consequently, they have

little or no flexibility in resolving the violation. This is also expressed in the lower bound

for the probability of resolution presented in Corollary 3.7. The flexibility is represented

by the width of the safe zone, ∆. It follows that greater flexibility exponentially increases

the probability for resolution.

3.7.2 Maximum Matching Tree Algorithm

We suggest a new instance of the generic algorithm presented in Section 3.5, referred to

as the Maximum-Matching Tree algorithm (MMT). The algorithm is preceded by an

initialization phase in which an overlay tree structure is defined over the nodes, such

that nodes with high probability to resolve each other’s local violations are stored under

the same sub-tree. The tree is later used in the implementation of getExtendingSet, to

retrieve the resolving set.

The construction of the overlay tree, MMTree, is presented in detail in the next

subsection. The pseudo-code for the construction, as well as the implementation of

getExtendingSet, are given in Algorithm 3.3. Each level in the tree defines a partition

of N1, . . . , Nn, as depicted in Figure 3.4. Denote the level of the leaves as 0 and the

root level as log(n). Let Pr be the partition defined by level r, and let Pr[Ni] be the

set in Pr that contains the node Ni. In each round r, every node Ni that is not already

included in R and shares the same set in Pr with a violating node (i.e., Pr[Ni]∩V 6= ∅)is added to R. Figure 3.5 illustrates an execution example of MMT over a system of 8

nodes.

The correctness of MMT derives directly from Theorem 3.2, as getExtendingSet

always returns a non-empty set of non-violating nodes which were not already included

in the resolving set.

57


3.7.3 Maximum Matching Tree Construction

We assume that the coordinator is familiar with the data distributions and the safe

zones of all the monitoring nodes. The maximum matching tree is the product of a

greedy process that recursively obtains a coarser partition of N by aggregating the

components of a finer partition. Given a partition P, of size m, the process obtains

a coarser partition P ′, of size dm/2e, by optimally pairing the components of P. In

other words, every component of P ′ is formed by joining a pair of components from

P (if m is odd, they will share a single component). Pairing a partition A1, . . . , Am is

considered optimal if it yields a partition B1, . . . , Bdm/2e such that Pr∧dm/2ei=1 vtBi

∈ SBiis maximized. We initialize the process with the partition of N = N1, . . . , Nn into

singletons (N1, . . . , Nn). In turn, optimal pairings are recursively performed until

the singleton partition (N) is reached. The result of this is a bottom-up construction of

a binary tree where each generated partition defines a new level in the tree. An example

of such a tree is depicted in Figure 3.4.

Optimal Pairing

Given a partition of N into (disjoint) sets, A1, . . . , Am, we define a weighted, non-

directed, complete graph over these sets. The weight of the edge connecting Ai and Aj

is defined by log(PrvtAi∪Aj∈ SAi∪Aj) for all 1 ≤ i < j ≤ m. We perform the pairing

by computing a maximum weighted matching in this graph, as described in [Edm65]. As

this graph is complete, the matching is perfect (or near-perfect if m is odd). Thus, we

obtain a partition, B1, . . . , Bdm/2e, such that∑dm/2e

i=1 log(PrvtBi∈ SBi) is maximized.

It follows that∏dm/2ei=1 PrvtBi

∈ SBi is maximized and, as the monitoring nodes are

independent, we conclude that the pairing is optimal.

Computational Issues

An essential task in the tree construction is computing PrvtA ∈ SA for a given set of

nodes, A. To this end we assume that each node Ni ∈ N is associated with a probability

density function (p.d.f.) given as a discrete set Di ⊆ Rd of sample history data. We

generalize this notion to a set of nodes A by aggregating the sampled data of the nodes

in A (i.e., 1|A|∑

Ni∈A vi ∈ DA where vi ∈ DiNi∈A were sampled at the same time

steps). It follows that

PrvtA ∈ SA =|DA ∩ SA||DA|

.

If the nodes’ p.d.f. is given as an explicit function fi : Rd → [0, 1], then PrvtA ∈ SA =∫SAfA where fA is the convolution of fiNi∈A.

58


Figure 3.4: A maximum matching tree over an 8-node system in a 2-dimensionalspace. Every level of the tree defines a partition of N1, . . . , N8. The distribution ofthe average vector and the safe zone are marked by a cloud of dots and a rectangle,respectively, for every partition set. The root node represents the distribution of theglobal vector and the global safe zone. Note that, indeed, global violations rarelyoccur. There are two types of nodes: type-1 nodes, which have high variance in the1st dimension and low variance in the 2nd dimension, and type-2 nodes, which havehigh variance in the 2nd dimension and low variance in the 1st dimension. 4 nodes ofeach type comprise the leaves of the tree, denoted by the double outlined ellipses. Asexpected, MMT first pairs nodes of the same type.

59


Algorithm 3.3 Maximum-Matching Tree Algorithm

buildMMTree()

1: P0 ← N1, . . . , Nn2: r ← 03: while |Pr| > 1 do4: G ← buildCompleteGraph(Pr)5: M← findMaximumMatching(G)6: Pr+1 ← ∅7: for all A in Pr do8: Add A ∪M(A) to Pr+1

9: end for10: r ← r + 111: end while12: MMTree← P0, . . . ,PrgetExtendingSet(V,R, r)

1: Rr ← ∅2: for all Ni in V do3: Add Pr[Ni] \ (V ∪R) to Rr4: end for5: return Rr

Figure 3.5: An execution example of MMT over an 8-node system. At t = 1, a snapshotof the system is given at the bottom. First, the violating nodes (N2, N7) report theirlocal vectors to the coordinator. Upon failure to resolve the violation, the resolvingset is extended with the non-violating nodes from the sets containing N2, N7 in the1st level of the MMTree. If we consider the MMTree in Figure 3.4, nodes N4, N6 arepolled for their local vectors. At this point the violations are resolved and the algorithmterminates.

60


3.7.4 Distributed Variant of MMT

Up until now we assumed that the nodes communicate only through the coordinator

node. However, this is not necessarily the case in many distributed networks. A star

topology in ever expanding networks implies increasing energy costs for the distant nodes.

Moreover, as the number of local violations increases with the size of the network, it also

implies an increasing load on the coordinator. Next we present a variant of the MMT

algorithm, designed for network topologies that support inter-node communication,

denoted as Distributed MMT (DMMT). Unlike the generic algorithm, in which the

local violations were centralized to the coordinator for resolution, the violating nodes in

DMMT do not immediately address the coordinator but rather attempt to resolve their

violations locally.

DMMT initiates with the construction of the MMTree by the coordinator as in

MMT. In addition, the coordinator disseminates the MMTree to all the nodes. When

local violations occur, the violating nodes perform the MMT algorithm simultaneously,

yet independently. Each of the nodes constructs its own resolving set using the MMTree

until resolution is attained. In each of the sets, the resolution process is led by the

smallest index node. It’s possible that some of the sets are unified during the process.

The MMTree guarantees that, throughout the process, each node can only belong to a

single resolving set. In other words, the process creates a partition of the network into

disjoint sets of nodes so that according to Lemma 3.4.1, the resolution is valid.

The distributed approach dramatically reduces the load on the coordinator. In

addition, it can also lead to savings in communication cost. The MMT algorithm

attempts to resolve the local violations as a whole and therefore extends the resolving

set until all violations are resolved. In DMMT, however, local violations are resolved

independently, thus allowing a different size resolving set to be tailored to each violation.

Finally, the construction of the MMTree in DMMT can be adapted to suit the

needs of the network. Factors can be applied to weights of edges to reflect desired or

non-existing connections and to integrate distances between nodes.

3.8 Experiments

In this section we compare the performance of the presented violation resolution

algorithms. We tested these algorithms over homogeneous and heterogeneous setups,

using both synthetic and real-life data sets. The setups used, as well as the performance

metrics and compared algorithms, are now described.

3.8.1 Data sets

Following are the data sets over which we conducted our experiments:

Syn-HM-n (n = 16, 32, . . . , 1024) – A synthetically generated homogeneous data

set consisting of n streams (nodes) of random 3-dimensional data from the normal

61


distribution. The data set was generated such that data variance in each dimension was

the same in all the nodes.

Air-HM-n (n = 16, 32, . . . , 1024) – A homogeneous data set taken from the European

air quality database (AirBase) [web]. The data set consists of air pollutant measurements

read by geographically distributed sensors. The local vectors are 2-dimensional vectors

representing the concentrations of NO and NO2 in the air, which were measured in

micrograms per cubic meter. We have assembled n nodes having highly correlated data

distributions.

Syn-HT-n (n = 16, 32, . . . , 1024) – A synthetically generated heterogeneous data

set consisting of n streams (nodes) of random n8 -dimensional data from the normal

distribution. The data set was generated such that data variance in each dimension was

the same in all the nodes except for 8 nodes in which it was substantially higher.

RCV-HT-n (n = 16, 32, . . . , 256) – The Reuters Corpus (RCV1-v2) [RSW02] consists

of 804,014 news stories, each tagged as belonging to one or more of 103 content categories.

Every story comprises a precomputed list of terms [LYRL04]. We’ve assembled n8 roughly

equally-sized super-categories and selected a term for each category in which it highly

dominates the other categories (in the sense that the term occurs many more times in

stories within that category). The stories of each super-category were then divided into

8 nodes which tracked the occurrences of all the selected terms over a sliding window of

100 stories (i.e., the local vectors were the occurrence count vectors). This resulted in

the nodes having roughly the same variance in every dimension of their data distribution

except for the dimension that corresponds to the term of their super-category, in which

the variance was substantially higher.

3.8.2 Scoring Functions and Local Constraints

Following are the scoring functions and local constraints defined for the different data

sets:

Syn-HM, Syn-HT, RCV-HT – The global safe zone was defined as a multidimensional

rectangle (box), centered around the expected value of the global vector. A box essentially

sets a lower and an upper bound for the values of the global vector in every dimension.

Similarly, the local safe zone of each node was defined as a box centered around the

expectation of its data distribution. The boxes were defined such that the average box

of the local safe zones was contained in the box of the global safe zone (to ensure that

the safe zones condition of Equation 3.1 hold). In addition, the boxes achieved a fairly

high coverage of the data distribution, guaranteeing that a global violation and a local

violation of any node occur with low probability.

Air-HM – The scoring function was defined as the ratio between the average con-

centrations of NO and NO2, and the safe zones were defined as triangles – a choice

motivated by their simplicity and by their suitability to the data and the definition of

the queried function.

62


(a) Syn-HM (b) Air-HM

(c) Syn-HT (d) RCV-HT

Figure 3.6: Data distributions and safe zones of 2 randomly chosen nodes from eachdata set. In Syn-HT and RCV-HT the data are projected to a 3-dimensional space.

Figure 3.6 provides a graphic illustration of the data sets and safe zones. In the

homogeneous data sets the variance of the data in each dimension is approximately

the same in all the nodes. Consequently, their safe zones are similar in shape (i.e.,

have approximately the same width in each dimension). On the other hand, in the

heterogeneous data sets the variance of the data in each dimension differs greatly in

some nodes and their safe zones are different in shape.

3.8.3 Performance Metrics

We have applied the following metrics:

Average communication cost – The average number of monitoring nodes that re-

ported their local vector during violation resolution. In the centralized algorithms this

corresponds to V ∪R. In the DMMT algorithm, this excludes the violating nodes which

handle the resolution. The actual communication cost (in bytes) is linear in this size.

Average size of resolving set – The average number of non-violating nodes that

participated in a violation resolution. This metric emphasizes the overhead in the

network resources allocated for resolving the local violations.

Average latency – The average running time (in rounds) of the violation resolution

algorithm (namely, the average number of rounds it took to assemble the resolving set).

Average maximum communication load – The maximum communication load is the

maximum communication that goes through a single node during violation resolution.

63


1

2

4

8

16

32

64

128

256

512

1024

16 32 64 128 256 512 1024

𝑛

Naïve

DMMT

MMT

RLG

RLN

|𝒱|

1

2

4

8

16

32

64

128

256

512

1024

16 32 64 128 256 512 1024

𝑛

Naïve

DMMT

RLG

MMT

RLN

|𝒱|

(a) Average communication cost

1

2

4

8

16

32

64

128

256

512

1024

16 32 64 128 256 512 1024

𝑛

Naïve

DMMT

MMT

RLG

RLN

1

2

4

8

16

32

64

128

256

512

1024

16 32 64 128 256 512 1024

𝑛

Naïve

DMMT

RLG

MMT

RLN

(b) Average resolving set size

1

2

4

8

16

16 32 64 128 256 512 1024

𝑛

RLN

RLG

DMMT

MMT

Naïve

1

2

4

8

16

32

64

16 32 64 128 256 512 1024

𝑛

RLN

RLG

DMMT

MMT

Naïve

(c) Average latency

0

5

10

15

20

25

30

35

40

16 32 64 128 256 512 1024

𝑛

MMT RLG DMMT

0

20

40

60

80

100

120

16 32 64 128 256 512 1024

𝑛

RLG MMT DMMT

(d) Average maximum communicaiton load

Figure 3.7: Experimental results over Syn-HM (left) and Air-HM (right) homogeneousdata sets. The vertical axes of the line graphs are in logarithmic scale. In the averagecommunication cost, all algorithms (except the Naive) approach the minimum, asdenoted by the number of violations. The average latency reflects the differencesbetween the algorithms in the expansion rate of the resolving set. DMMT outperformsthe centralized algorithms in reducing the average maximum communication load.

In the centralized algorithms this indicates the load on the coordinator.

Note that we’ve only considered time steps in which a global violation did not occur

(i.e., the local violations could be resolved). A global violation would always require any

algorithm to collect the entire network data. Moreover, the latency of the algorithms

would be maximal (i.e., Naive - 1, RLG/MMT/DMMT - log(n), RLN - n). As global

violations are rare, we’ve omitted them from our evaluation.

3.8.4 Compared Instances

We evaluate the four instances of the centralized generic algorithm, as well as the

distributed version:

Naive – Mentioned in Section 3.5. The resolving set is always defined as the entire

set of non-violating nodes.

RLN – Mentioned in Section 3.5. Extends the resolving set linearly with randomly

chosen nodes.

RLG – Presented in Section 3.6. Extends the resolving set exponentially with

randomly chosen nodes.

MMT – Presented in Section 3.7. Extends the resolving set exponentially using an

overlay tree structure.

DMMT – Presented in Subsection 3.7.4. A distributed variant of MMT.

64


1

2

4

8

16

32

64

128

256

512

1024

16 32 64 128 256 512 1024

𝑛

Naïve

RLG

RLN

MMT

DMMT

|𝒱|

1

2

4

8

16

32

64

128

256

16 32 64 128 256

𝑛

Naïve

RLG

RLN

MMT

DMMT

|𝒱|

(a) Average communication cost

1

2

4

8

16

32

64

128

256

512

1024

16 32 64 128 256 512 1024

𝑛

Naïve

RLG

RLN

DMMT

MMT

1

2

4

8

16

32

64

128

256

16 32 64 128 256

𝑛

Naïve

RLG

RLN

MMT

DMMT

(b) Average resolving set size

1

2

4

8

16

32

64

128

256

16 32 64 128 256 512 1024

𝑛

RLN

RLG

MMT

DMMT

Naïve

1

2

4

8

16

32

64

16 32 64 128 256

𝑛

RLN

RLG

MMT

DMMT

Naïve

(c) Average latency

0

50

100

150

200

250

16 32 64 128 256 512 1024

𝑛

RLG MMT DMMT

0

10

20

30

40

50

60

70

80

90

16 32 64 128 256

𝑛

RLG MMT DMMT

(d) Average maximum communicaiton load

Figure 3.8: Experimental results over over Syn-HT (left) and RCV-HT (right) hetero-geneous data sets. The vertical axes of the line graphs are in logarithmic scale. Theclear advantage of MMT and DMMT over the random algorithms is apparent in eachof the metrics. DMMT outperforms the centralized algorithms in reducing the averagemaximum communication load.

3.8.5 Experimental Results

Homogeneous Setups

We compared the performance of the five algorithms over Syn-HM and Air-HM data

sets. Results are presented in Figure 3.7. In the average communication cost, RLN,

RLG, MMT and DMMT perform similarly and are orders of magnitude away from the

Naive algorithm. The graphs of all algorithms, except the naive, closely converge to

the graph of the average number of violations. The number of violations defines the

minimum communication cost of the resolution, and it increases linearly with the size

of the network. This reinforces the hypothesis that in homogeneous setups, no node

is clearly preferable to any other, and the choice of the resolving set can be made at

random. What really matters is the number of nodes participating in the resolution,

as suggested by the lower bounds in Section 3.5.2. The advantage of DMMT, that the

violating nodes handle the resolution themselves and don’t report their local vectors,

is not reflected in the average communication cost. This is explained by the graphs of

the average size of the resolving set. Since DMMT resolves each of the local violations

independently, it requires more resolving nodes than the centralized algorithms.

In all the algorithms we observe a growth in the graphs of the average size of the

resolving set. However, since the number of violating nodes increases, we would expect

the number of resolving node to decrease. We reconcile this apparent contradiction by

noting that in both data sets, the local violations occur in the positive directions of the

axes and thus, they virtually never resolve each other.

65


The graphs of the average latency reflect the differences between the algorithms

in the expansion rate of the resolving set. RLN extends the resolving set linearly and

exhibits the worst latency as its graph diverges exponentially. RLN, MMT and DMMT

all extend the resolving set exponentially, yet MMT initiates with as many nodes as

the violating nodes. The graphs of RLG and DMMT diverge logarithmically while the

graph of MMT shows a slow decay.

As expected, in the average maximum communication load, DMMT outperforms

the centralized algorithms, which demonstrate an exponential growth.

Heterogeneous Setups

We compared the performance of the five algorithms over Syn-HT and RCV-HT data

sets. Results are presented in Figure 3.8. The clear advantage of MMT and DMMT

over the random algorithms is apparent in each of the metrics. Due to the diversity

of the nodes in the heterogeneous setups, the random algorithms selected nodes that

were of no use in the resolution. MMT and DMMT, on the other hand, selected

the most relevant nodes according to the preconstructed MMTree. Both MMT and

DMMT were successful in reducing the communication cost and the latency almost to

the minimum. Nevertheless, DMMT still outperforms MMT in reducing the average

maximum communication load. In addition, the savings in reporting the local violations

to the coordinator, and the ability to tailor a different size resolving set to each violation

independently, may explain the advantage of DMMT in the other metrics as well.

3.9 Chapter Conclusions

This chapter focused on minimizing the communication required for handling local

constraint violations in distributed threshold monitoring. The key insight was that

when there is no global violation, the local violations can typically be resolved without

collecting the entire network data.

We presented a formal and precise condition for resolving the violations by a set of

nodes. We showed that finding the minimum resolving set is NP-hard, and proposed

a general approach that incrementally collects the resolving set. The latency of the

process should be taken into account as it determines how long it takes to alert the

system about a global violation.

We distinguished between two types of networks: homogeneous and heterogeneous.

The network types are related to the correlation between the data distributions of

the nodes. We focused especially on the variance of the data in the different dimen-

sions because it can tell us in which direction the node has more tendency to violate.

Consequently, it tells us where the safe zone of the node is expected to have greater

flexibility in resolving violations. We assumed that in homogeneous networks, no node

would be clearly preferable over another in resolving violations. On the other hand, in

66


heterogeneous networks, a careful selection of the resolving nodes can be crucial. These

assumptions were reinforced by the lower bounds we presented on the probability for

violation resolution. In homogeneous networks, the size of the resolving set was the

deciding factor, rather than the identity of its members.

We presented violation resolution algorithms for homogeneous (RLG) and heteroge-

neous (MMT) setups. Both algorithms guarantee a latency that is logarithmic in the

size of the network in the rare case of a global violation. Experimental results with both

synthetic and real-life data sets showed that, in homogeneous setups, both algorithms

reduced the average communication cost almost to the minimum, and reduced the

average latency as well. Due to its simplicity and speed, RLG is preferable. In the

heterogeneous setups, however, the superiority of MMT is evident. In addition, if the

infrastructure of the network allows it, using DMMT should be considered, in order to

avoid the load on the coordinator that is created in the centralized algorithms.

67


68


Chapter 4

Conclusions

In this thesis we presented techniques for reducing communication when monitoring

general functions over a distributed stream network. The techniques were based on

the idea of Safe Zones (SZs): Each node of the system gets a Safe Zone (SZ), and is

asked to communicate only when the data it observes drifts out of this SZ. We make

sure that as long as each node is in its SZ, the global ”bad” event (threshold crossing)

may not have happened. When a node observes data that is outside its SZ, we say that

a local violation has happened, and a violation resolution algorithm is initiated in order

to efficiently detect whether the global ”bad” event happened. The expected number of

messages we save using our techniques is proportionate to (1) the probability that the

data of the nodes reside inside their respective SZs, and to (2), in cases of local violations,

the expected number of nodes involved in the violation resolution process. Geometric

and Combinatorial tools were used in order to come up with practical algorithms that

attempt to minimize the number of messages sent during the monitoring protocol.

In chapters 1 and 2, our algorithms for assigning SZs to the nodes were presented.

We defined the problem of finding the optimal SZs – the SZs that will save the maximal

number of messages – as an optimization problem, and presented two approaches for

solving it: The geometric ”Minkowski Sum” approach of chapter 1 which is more

appropriate for continuous data (e.g. sensor data) and monitoring functions that

are defined over the sum or average of the local vectors; and the discrete ”Biclique”

approach of chapter 2 which can handle arbitrary monitoring functions when the data

is taken from a discrete set of values. In contrast to previous solutions which involved a

cover of the entire convex hull of the local data vectors, our techniques focus on direct

computation of safe zones for the nodes. Consequently, SZs are more flexible than

constraints introduced in previous work, as they fit the data distributions much better.

In chapter 3, our algorithms for violation resolution were presented. The key insight

was that when there is no global violation, the local violations can typically be resolved

without collecting the entire network data. We distinguished between ”homogeneous”

and ”heterogeneous” network types, and for each type we suggested a different strategy

for gathering data from the nodes for the resolution process (i.e. determining wether

69


the ”bad” thing really happened). We used a maximum matching algorithm to find a

good strategy efficiently.

The applicability of these techniques to various real problems was provided in exper-

iments (chapters 1,3), which also demonstrated their advantage. SZs were implemented

and tested for both real-life and synthetic data using simple families of shapes, proving

that the paradigm can reduce communication volume by orders of magnitude, over

previous work.

70


Bibliography

[ABC09] C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional moni-

toring without monotonicity. In ICALP (1), pages 95–106, 2009.

[ADNR07] S. Agrawal, S. Deb, K. V. M. Naidu, and R. Rastogi. Efficient detection

of distributed constraint violations. In ICDE, pages 1320–1324, 2007.

[AKT09] A. Abbasi, A. Khonsari, and M. Sadegh Talebi. Flooding-assisted

threshold assignment for aggregate monitoring in sensor networks. In

ICDCN, 2009.

[AM04] A. Arasu and G. S. Manku. Approximate counts and quantiles over

sliding windows. In PODS, 2004.

[BO03a] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD,

pages 28–39. ACM, 2003.

[BO03b] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD

Conf., pages 28–39, New York, NY, USA, 2003. ACM Press.

[CG05] G. Cormode and M. N. Garofalakis. Sketching streams through the

net: Distributed approximate query tracking. In VLDB, pages 13–24,

2005.

[CG09] G. Cormode and M. N. Garofalakis. Histograms and wavelets on

probabilistic data. In ICDE, 2009.

[CGMR05] G. Cormode, M. N. Garofalakis, S. Muthukrishnan, and R. Rastogi.

Holistic aggregates in a networked world: Distributed tracking of

approximate quantiles. In SIGMOD, pages 25–36, 2005.

[Cha87] B. Chazelle. Approximation and decomposition of shapes. 1987.

[CMY08] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed

functional monitoring. In SODA, pages 1076–1085, 2008.

[CMY11] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed

functional monitoring. ACM Transactions on Algorithms, 7(2):21,

2011.

71


[CMYZ10] G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang. Optimal

sampling from distributed streams. In PODS, pages 77–86, 2010.

[CMZ06] G. Cormode, S. Muthukrishnan, and W. Zhuang. What’s different:

Distributed, continuous monitoring of duplicate-resilient aggregates

on data streams. In ICDE, pages 20–31, 2006.

[CMZ07] G. Cormode, S. Muthukrishnan, and W. Zhuang. Conquering the

divide: Continuous clustering of distributed data streams. In ICDE,

2007.

[Cor11] G. Cormode. Continuous distributed monitoring: a short survey. In

Proceedings of the First International Workshop on Algorithms and

Models for Distributed Event Processing, AlMoDEP ’11, pages 1–10,

New York, NY, USA, 2011. ACM.

[DGGR04] A. Das, S. Ganguly, M. N. Garofalakis, and R. Rastogi. Distributed

set expression cardinality estimation. In VLDB, pages 312–323, 2004.

[Edm65] J. Edmonds. Paths, trees, and flowers. Canadian Journal of Mathe-

matics, 17(3):449–467, 1965.

[GRM10] R. Gupta, K. Ramamritham, and M. K. Mohania. Ratio threshold

queries over distributed data sources. In ICDE, pages 581–584, 2010.

[GT01] P. B. Gibbons and S. Tirthapura. Estimating simple functions on

union of data streams. In SPAA, 2001.

[GT02] P. B. Gibbons and S. Tirthapura. Distributed streams algorithms for

sliding windows. In SPAA, 2002.

[HNG+06] L. Huang, XuanLong Nguyen, M. N. Garofalakis, Michael I. Jordan,

Anthony D. Joseph, and Nina Taft. In-network pca and anomaly

detection. In NIPS, pages 617–624, 2006.

[HNG+07] L. Huang, X. Nguyen, M. N. Garofalakis, J. M. Hellerstein, M. I.

Jordan, A. D. Joseph, and N. Taft. Communication-efficient online

detection of network-wide anomalies. In INFOCOM, 2007.

[Hoe63] W. Hoeffding. Probability inequalities for sums of bounded random

variables. Journal of the American Statistical Association, pages 13–30,

1963.

[JDZ+07] N. Jain, M. Dahlin, Y. Zhang, D. Kit, P. Mahajan, and P. Yalagandula.

Star: Self-tuning aggregation for scalable monitoring. In VLDB, pages

962–973, 2007.

72


[JW04] S. Ratnasamy Jain, J.M. Hellerstein and D. Wetherall. A wakeup call

for internet monitoring systems: The case for distributed triggers. In

HotNets-III, 2004.

[KCR06] R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-

efficient distributed monitoring of thresholded counts. In SIGMOD,

2006.

[KG07] M.R. Kurpius and A.H. Goldstein. Gas-phase chemistry dominates

o3 loss to a forest, implying a source of aerosols and hydroxyl radicals

to the atmosphere. Geophysical Research Letters, 30(7), 2007.

[KL10] E. Kiciman and V. B. Livshits. Ajaxscope: A platform for remotely

monitoring the client-side behavior of web 2.0 applications. TWEB,

4(4), 2010.

[LYJ09] F. Li, K. Yi, and J. Jestes. Ranking distributed probabilistic data. In

SIGMOD, 2009.

[LYRL04] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new bench-

mark collection for text categorization research. Journal of Machine

Learning Research, 5(4):361–397, 2004.

[Mat92] J. Matousek. Range searching with efficient hierarchical cuttings. In

SOCG, pages 276–285, 1992.

[MEA01] A. Tal M. Elad and S. Ar. Content based retrieval of VRML objects.

In EG Multimedia, pages 97–108, 2001.

[MF02] S. Madden and M. J. Franklin. Fjording the stream: An architecture

for queries over streaming sensor data. In ICDE, pages 555–566, 2002.

[MGR95] M. Meyer, Y. Gordon, and S. Reisner. Constructing a polytope to

approximate a convex body. Geometriae Dedicata, 57:217–222, 1995.

[MSDO05] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding

(recently) frequent items in distributed data streams. In ICDE, 2005.

[MTW05] S. Michel, P. Triantafillou, and G. Weikum. Klee: a framework for

distributed top-k query algorithms. In VLDB, pages 637–648, 2005.

[OJW03] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous

queries over distributed data streams. In SIGMOD Conf., pages

563–574, 2003.

[RN04] M. Rabbat and R. D. Nowak. Distributed optimization in sensor

networks. In IPSN, pages 20–27, 2004.

73


[RSW02] T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus

Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources.

In Proceedings of the Third International Conference on Language

Resources and Evaluation, Las Palmas de Gran Canaria, May 2002.

[Ser82] J.P. Serra. Image Analysis and Mathematical Morphology. Academic

Press, 1982.

[SKSS10] G. Sagy, D. Keren, I. Sharfman, and A. Schuster. Distributed threshold

querying of general functions by a difference of monotonic representa-

tion. PVLDB, 4(2):46–57, 2010.

[SR08] S. Shah and K. Ramamritham. Handling non-linear polynomial queries

over dynamic data. In ICDE, 2008.

[SSK06] I. Sharfman, A. Schuster, and D. Keren. A geometric approach to

monitoring threshold functions over distributed data streams. In

SIGMOD, 2006.

[SSK07a] I. Sharfman, A. Schuster, and D. Keren. Aggregate threshold queries

in sensor networks. In IPDPS, pages 1–10, 2007.

[SSK07b] I. Sharfman, A. Schuster, and D. Keren. A geometric approach to

monitoring threshold functions over distributed data streams. ACM

Trans. Database Syst., 32(4), 2007.

[SSK08] I. Sharfman, A. Schuster, and D. Keren. Shape sensitive geometric

monitoring. In PODS, 2008.

[Tro10] J. A. Tropp. User-friendly tail bounds for sums of random matrices.

ArXiv e-prints, April 2010.

[TZWL08] L. Tian, P. Zou, F. Wu, and A. Li. Research on communication-

efficient method for distributed threshold monitoring. In WAIM,

pages 441–448, 2008.

[WBK09] R. Wolff, K. Bhaduri, and H. Kargupta. A generic local algorithm

for mining data streams in large distributed systems. IEEE Trans.

Knowl. Data Eng., 21(4):465–478, 2009.

[WDS09] F. Wuhib, M. Dam, and R. Stadler. Gossiping for threshold detection.

In Proceedings of 11th IFIP/IEEE International Symposium on In-

tegrated Network Management, Long Island, NY, USA, 2009. Work

done within the SICS Center for Networked Systems.

[web] The European air quality database,

http://dataservice.eea.europa.eu/dataservice/.

74


[YSJ+00] B. K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos,

and A. Biliris. Online data mining for co-evolving time sequences. In

ICDE, pages 13–22, 2000.

[YZ09] K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters

and quantiles. In PODS, 2009.

75



הטכניקות מבוזרת. רשת ניטור בעת בתקשורת לחיסכון טכניקות שמציעות עבודות נציג זו בתיזה

חדש נתון כל על הודעה לשלוח צריכים לא במערכת צמתים הבאה: הפשוטה האבחנה על מבוססות

את מפרמלים אנו מתרחשים. "מעניינים" דברים כאשר רק הודעות לשלוח מספיק אלא שמגיע,

לתקשר ומתבקש בטוח", "אזור מקבל במערכת צומת כל בטוחים". "אזורים רעיון ע"י הזאת האבחנה

נמצא צומת כל עוד שכל נדאג אנו הבטוח. מהאזור לצאת לו שגורם חדש נתון מקבל הוא כאשר רק

אין ־ ולפיכך להתרחש, יכול לא ־ הסף חציית ־ ה"רע" הגלובלי האירוע שלו, הבטוח האזור בתוך

דוגם במערכת כלשהו שצומת המידע כאשר מקומית, חריגה של במקרה אולם, כלל. בתקשורת צורך

מלתקשר. מנוס ואין הסף לחציית ביחס מובטח אינו דבר לצומת, המוקצה הבטוח באזור לא כבר

ההסתברות את הניתן ככל שיצמצמו בטוחים אזורים לקבוע הוא זו בגישה המרכזי האתגר לכן,

מקומית. לחריגה

מונוטונית. או לינארית היא שמנטרים הפונקציה שבהם הפשוטים המקרים עבור קיימות עבודות הרבה

אלו עבודות להרחיב ניתן ולא מונוטוניות, או לינאריות אינן וחשובות מעניינות פונקציות הרבה אבל,

כלשהי. פונקציה של מבוזר לניטור וגנריות חדשניות טכניקות מציעים אנו כלליות. בפונקציות לטיפול

האזורים מציאת בעיית את מגדירים מכן ולאחר הבטוחים, האזורים מושג את ומפרמלים מציגים אנו

שהשימוש הבטוחים האזורים להיות יוגדרו האופטימליים הבטוחים האזורים האופטימליים. הבטוחים

לפתרון, קשה היא זו שבעיה מראים אנו תחילה בתקשורת. המירבי לחיסכון בתוחלת, יגרום, בהם

לפתור ננסה האופטימלי. לפתרון קרובים שיהיו היוריסטיים פתרונות לחפש אלה ברירה אין ולכן

הרבה שחוסכים בטוחים, אזורים לחישוב אלגוריתם ונציג גאומטריים, כלים באמעצות הבעיה את

שונה גישה מציגים אנו מכן לאחר קמור. פוליגון למשל, כלשהי, גאומטרים צורה שמהווים תקשורת,

שמגיעים הנתונים כאשר מתאימה יותר שהיא האופטימליים, הבטוחים האזורים מציאת בעיית לפתרון

של תת־קבוצה פשוט הוא מקבל צומת שכל הבטוח האזור זו בשיטה דיסקרטי. מאופי הם לצמתים

הזו השניה השיטה הקודמת. בשיטה כמו גאומטרית צורה ולא בעתיד, לקבל עשוי שהוא נתונים ערכי

של הנתונים סכום מעל מוגדרת להיות חייבת לא שהפונקציה מכיוון מהראשונה, כללית יותר אף היא

מעליהם. כלשהי פונקציה להיות יכולה אלא הצמתים, כל

ה"רע" הגולבלי האירוע התרחש האם שקובעים בתקשורת חסכוניים אלגוריתמים מציגים אנו לבסוף,

מקומית. חריגה ־ שלו הבטוח מהאזור הצמתים אחד של חריגה מתרחשת כאשר ־ הסף חציית ־

הסיכוי מכך, יתרה הנמנע. מן אינן שיהיו, ככל טובים הבטוחים, מהאזורים חריגות הצער, למרבה

מקומיות חריגות המקרים שברוב אף על גדל. במערכת הצמתים שמספר ככל גובר מקומית לחריגה

מנוס אין כאמור, לכן, הסף. חציית על להצביע עשויות הן בלבד, מקומית תופעה על מצביעות

הצמתים כל את לערב מבלי החריגות מן להתאושש ניתן המקרים ברוב כי מראים אנו מתקשורת.

והאזורים המידע סמך על זאת להסיק ניתן הסף, של חצייה אין שבהם במקרים בפרט, במערכת.

ההתאוששות. קבוצת מכנים שאותם חרגו, שלא נוספים, וצמתים שחרגו הצמתים קבוצת של הבטוחים

אנו מכן ולאחר להתאוששות. הדרושים התנאים של ופורמלי מדוייק ניסוח לראשונה, מציגים, אנו

המקומית החריגה את לפתור מנת על האפשר, ככל קטנה התאוששות קבוצת לחיפוש שיטות מציגים

מינימלי. הודעות מספר עם

העדיפות את גם שמראים ניסויים, ע"י מוצגת מציאותיות בעיות למספר שלנו הטכניקות של השימושיות

שהעדיפות מראים התאורטי והניתוח הניסויים בתחום. הקודמות העבודות פני על גודל, בסדרי שלהן,

גדל. הנתונים שמימד ככל מובהקת ויותר יותר נהיית שלנו השיטות של

ii


תקציר

רב מספר אלו, במערכות רבים. טכנולוגיה בענפי מאוד נפוץ מבוזרות מידע שטפי במערכות השימוש

דינמי אופי בעל מידע של מתמשכים שטפים מנהלים במרחב, גדול בפיזור להמצא שעשוי צמתים, של

מכלול סמך על כלשהי כללית תופעה אמת, בזמן לאתר, היא אלו מערכות של מטרתן גבוה. וממימד

שריפות גילוי מניות, ניטור באמצעות הבורסה מצב חיזוי הן לכך דוגמאות בצמתים. המוחזק המידע

ועוד. חיישנים של מערכות ע"י טבע אסונות מפני והתראה

היא זה במקרה המטרה סף. ניטור הוא מבוזרות מידע שטפי במערכות הבסיסיים היישומים אחד

בכל בצמתים המוחזק המידע מכלול על המחושב נתונה, מחיר פונקציית של שערכה עת בכל להתריע

שאותה הכללית התופעה על למעשה, מצביעה, הסף חציית מראש. שהוגדר קבוע סף עובר זמן, יחידת

קודם, שהוזכרו הדוגמאות ובכללם רבים, יישומים מתרחשת. היא כאשר ולהתריע לאתר מעוניינים

סף. ניטור של כמופע לראות ניתן

הבא: באופן לתיאור ניתנת צמתים, n פני על שמבוזר מידע שטף מעל הסף ניטור בעיית פורמלית,

iה־ שהצומת הנתונים וקטור או אוסף הוא vi כאשר v1, v2, ..., vn וקטורים ,f פונקציה נתונים

המערכת ועל זמן, יחידת בכל להשתנה יכול ערכם כלומר, דינמיים, הוקטורים .τ סף וערך מקבל,

לערך מהסף קטן מערך עובר כלמר, ־ τ הסף את חוצה f(v1, v2, ..., vn) של שהערך ברגע להתריע

מספיק שימושי, ולא תמים להיראות שעלול הנ"ל, הפשוט הסף חציית תנאי להיפך. או מהסף, גדול

עליהם. יתריעו דינמיות מבוזרות שמערכות שנרצה "רעים" אירועים של מאוד גדול מגוון לייצג בכדי

על שמבוזרים חיישנים ממאה שמורכבת כלשהי, במדינה האוויר איכות לניטור מערכת לדוגמא ניקח

באוויר, מסוימים מזהמים של ריכוזם את שנייה, בכל מודד, חיישן וכל ,n = 100 המדינה, ערי פני

זיהום לרמת מסוים מדד להביע יכולה f הפונקציה השנייה. באותה ממוקם, החיישן שבו באיזור

מזהמים שני ממוצעי בין היחס את או במדינה, מסוים מזהם של הריכוז ממצוע את למשל: האוויר,

־ הזיהום מדד ־ הפונקציה ערך כאשר להתריע תרצה האוורי זיהום לניטור המערכת אפוא, שונים.

מסוים. סף ערך על עולה

הנתונים כל את זמן, פרק בכל או מתמשך באופן לרכז, היא המבוזר הניטור לבעיית כיום נפוצה גישה

ללא לבעיה המבוזרת הבעיה את להמיר ובזאת האלה, הנתונים של סיכומים או ברשת שמתקבלים

והדינמיות העצום הגודל בגלל במציאות, אפשרי בלתי להיות פשוט עלול כזה, ריכוז אבל, ביזור.

רשת של במקרה בנוסף, התקשורת. ברשת ועיקובים גדולים לעומסים שיגרום מה הנתונים, של

מהירה להתרוקנות תגרום הזאת הגדולה התקשורת כמות סוללה, על עובד חיישן כל שבה חיישנים

הסוללות. של

i



המחשב. למדעי בפקולטה קרן, דניאל ופרופסור שוסטר אסף פרופסור של בהנחייתם בוצע המחקר

תודות

על קרן, דניאל ופרופ. שוסטר אסף פרופ. שלי, למנחים הכנה תודתי את להביע רוצה אני תחילה

הגבוהים לסטנדרטים והמחוייבות הידע החוכמה, זה. למחקר המועילות והביקורות הנלהב העידוד

מוטיבציה. בי החדירו שלהם ביותר

ורעיונותיהם החשובות שהערותיהם ומריו, גיא דויד, צחי, המחקר, בקבוצת לעמיתי גם מודה אני

ומהנים. ממריצים מעניינים, היו תמיד שלנו הדיונים זו. לעבודה רבות תרמו הקונסטרוקטיביים

בי. והאמינו עודדו תמכו, שתמיד היקרה למשפחתי גם להודות ארצה

מהנה. יותר לעוד אותה והפכו הזאת בחוויה לצידי שהיו לחברים מודה אני לבסוף,

זה. מחקר מימון על לטכניון מסורה תודה הכרת



מבוזרות במערכות כלליות פונקציות ניטורמינימלית תקשורת עם

מחקר על חיבור

התואר לקבלת הדרישות של חלקי מילוי לשם

המחשב במדעי למדעים מגיסטר

עבוד אמיר

לישראל טכנולוגי מכון – הטכניון לסנט הוגש

2012 יולי חיפה התשע"ב תמוז



מבוזרות במערכות כלליות פונקציות ניטורמינימלית תקשורת עם

עבוד אמיר


Date post:	23-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Monitoring General Functions in Distributed Systems with ... · 2.1 Non Convex SZs example: Plotted...

Documents