Monitoring General Functions inDistributed Systems withMinimal Communication
Amir Abboud
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Monitoring General Functions inDistributed Systems withMinimal Communication
Research Thesis
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
Amir Abboud
Submitted to the Senate
of the Technion — Israel Institute of Technology
Tammuz 5772 Haifa July 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
This research was carried out under the supervision of Prof. Assaf Schuster and Prof.
Daniel Keren, in the Faculty of Computer Science.
Acknowledgements
I would first of all like to extend my sincere gratitude to my supervisors, Prof. Assaf
Schuster and Prof. Daniel Keren , for their enthusiastic encouragement, and useful
critiques of this research work. Their wisdom, knowledge, and commitment to the
highest standards have certainly motivated me.
Expressions of gratitude are also in order for my colleagues in the research group,
Tsachi, David, Guy and Mario, whose useful comments and constructive ideas have
contributed largely to this work. Our discussions have always been stimulating, engaging
and highly enjoyable.
I would also like to thank my parents, my grandparents, my brother and my sister
who have always, supported, encouraged and believed in me.
Finally, I would like to thank my friends for standing by me, and cheering me up
through it all.
The Technion’s funding of this research is hereby acknowledged.
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Contents
List of Figures
Abstract 1
1 Safe Zones: An Efficient Approach to Distributed Monitoring 3
1.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Preliminaries: Minkowski Average . . . . . . . . . . . . . . . . . 8
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Safe-Zone Allocation as an Optimization Problem . . . . . . . . 11
1.4.2 The Parametric Family of Allowable Safe Zone Shapes . . . . . . 13
1.4.3 Convexity of S and the Safe Zones . . . . . . . . . . . . . . . . . 13
1.5 Safe Zone Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Computing the Target Function . . . . . . . . . . . . . . . . . . . 15
1.5.2 Checking the Constraints . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 The Complexity of Computing Optimal Safe Zones . . . . . . . . . . . . 18
1.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.8.1 Data, Methods, and Monitored Functions . . . . . . . . . . . . . 21
1.8.2 Ratio Queries with Triangular Safe Zones . . . . . . . . . . . . . 22
1.8.3 Improvement over GM Algorithm . . . . . . . . . . . . . . . . . . 23
1.8.4 Ratio Queries: Hierarchical Implementation . . . . . . . . . . . 23
1.8.5 Chi-square monitoring in 5 dimensions with axis-aligned box-
shaped safe zones . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.8.6 3-Dimensional Data, Quadratic Function, Polygonal Safe Zones . 26
1.8.7 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.8 Improvement Factor and Dimensionality . . . . . . . . . . . . . . 28
1.9 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
2 Discrete Safe Zones: Biclique Approach 29
2.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Biclique Formalization - k = 2 . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Greedy Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Generalized Biclique Formalization . . . . . . . . . . . . . . . . . . . . . 34
2.6 Hierarchical Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1 Classes of functions . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.2 Pruning nodes in the Biclique problem . . . . . . . . . . . . . . . 38
2.7 Advantages over the geometric Safe Zones . . . . . . . . . . . . . . . . . 39
3 Violation Resolution in Distributed Stream Networks 41
3.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Violation Resolution and Minimum Resolving Set . . . . . . . . . . . . . 44
3.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Resolving Local Violations . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Running Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Generic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Minimum Resolving Set is NP-Hard . . . . . . . . . . . . . . . . 51
3.5.2 Probabilistic Analysis of the Algorithm . . . . . . . . . . . . . . 52
3.6 Homogeneous Data Instance . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Heterogeneous Data Instance . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7.1 The Heterogeneous Data Challenge . . . . . . . . . . . . . . . . . 57
3.7.2 Maximum Matching Tree Algorithm . . . . . . . . . . . . . . . . 57
3.7.3 Maximum Matching Tree Construction . . . . . . . . . . . . . . 58
3.7.4 Distributed Variant of MMT . . . . . . . . . . . . . . . . . . . . 61
3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8.2 Scoring Functions and Local Constraints . . . . . . . . . . . . . . 62
3.8.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.4 Compared Instances . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Conclusions 69
Hebrew Abstract i
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
List of Figures
1.1 Air quality example with four sensors, two pollutants, function max, and
threshold 1. The local data vectors are depicted by red circles, their
convex hull is outlined by a dashed line, and the global average vector is
depicted by a cross sign. The region of vectors whose value corresponds
to good air quality is colored gray. The maximal average value for
either pollutant is 0.5125; hence it is easy to construct constraints that
completely avoid communication (outlined for each pollutant by dotted
lines along the respective axis). However, as the figure clearly shows,
any attempt to cover the convex hull of the sensor readings will result in
constraint violations and unnecessary communication. . . . . . . . . . . 7
1.2 Illegal and legal safe zones. Top: left and right depict two-dimensional
data at two nodes A and B. The p.d.f at both nodes is a Gaussian
(normal) distribution. S, which must contain the Minkowski average of
the two safe zones, is the dotted ellipse in the middle. The point cloud in
the middle is a sample of the global data vectors, obtained by averaging
the data vectors at both nodes. The allowable family of safe zone shapes
consists of four-vertex polygons. The depicted safe zones (outlined in
black) fit the local data well, but are alas illegal, since their Minkowski
average (continuous dark line in the middle) is not inside S. Bottom:
This time the safe zones are legal: they satisfy the constraint (1.2). . . . 12
1.3 The half-plane approach. Top: S is equal to H, which is defined by θ
and b. Bottom: a supporting hyper-plane is used for every safe zone
candidate Si. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
1.4 Hierarchical clustering example. Data is taken from [web]. The allowable
family of safe zone shapes consists of pentagons. Each node’s data is
represented by a scatter diagram at the bottom. The yellow safe zones
are those computed for the original nodes. Clusters of the data of node
pairs are represented as supernodes in the middle row. The two safe zones
corresponding to the supernodes are colored green (as is their Minkowski
average, depicted inside S). The root is the diagram at the top row. S is
the blue ellipse in the root (the root node was scaled for better view). An
ellipse corresponds to monitoring a quadratic function (see also Section
1.8.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 A schematic example of the proof of equivalence of the safe zone and
the biclique problems (Theorem 1.3). The bipartite graph (top), nodes
(middle), and S (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Top: two of the GMM elements super-imposed on the data. Middle:
typical local concentrations of NO and NO2 as a function of time. Bottom:
typical behavior of the local ratio between NO and NO2 as a function of
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.7 Triangular safe zones used for ratio monitoring. . . . . . . . . . . . . . . 22
1.8 Example of optimal safe zones with four nodes. S is the dark triangle;
safe zones are outlined in green. . . . . . . . . . . . . . . . . . . . . . . . 23
1.9 Comparison of safe zones (green line) to GM (blue line) in terms of the
number of violations, up to 10 nodes. For more nodes (up to 200 were
used in this experiment), the average improvement of safe zones over GM
was by a factor of 17.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.10 Comparison of the safe zone method to GM in terms of points which
cause a violation. At each node, the set S is depicted (dark triangle), the
safe zone (green triangles), and a sample of the data points (red dots).
The points which satisfy the GM constraints are depicted in blue. The
advantage of the safe zone method over GM is clear. . . . . . . . . . . . 24
1.11 Clustering example. Each row depicts two nodes from one cluster. . . . 24
1.12 Running time (in logarithmic scale) for “flat” – direct optimization over
all the nodes (blue) vs. hierarchical clustering (green). . . . . . . . . . . 25
1.13 Plots of the chi-square function for two nodes, an “oscillating” one (highly
varying data) in green, and a more stable node (in blue). Horizontal axis
stands for time (in hours), vertical axis for the chi-square value. . . . . . 26
1.14 The safe zones assigned to the two nodes in Fig. 1.13. The “oscillating”
node (top) is assigned a much larger safe zone, to account for its higher
variability. Since the data was 5-dimensional, only a 3-dimensional
projection is depicted, corresponding to the pollutants NO, NO2, and
SO2. Pink dots denote samples from the data, safe zones are in green. . 27
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
1.15 Comparing the number of violations between GM and the safe zone
method, for a period of 1,000 hours. The allowable family of safe zone
shapes used here consisted of 5-dimensional axis-aligned boxes. Horizontal
axis is the threshold for the chi-square function, vertical axis is the ratio
between numbers of violations. . . . . . . . . . . . . . . . . . . . . . . . 27
1.16 3D example. S is the pink ellipsoid, the safe zones are polyhedra with
eight vertices each (in pale blue), their Minkowski average is in green.
The axes stand for concentrations of NO,NO2,SO2. . . . . . . . . . . . . 27
2.1 Non Convex SZs example: Plotted are the data samples of both
nodes (right and left clouds), and the ”legal” global data points (center)
which are the averages of every pair of points, one from each node, such
that the average resides in the admissible region S. This S corresponds
to a function s.t. f(x, y) = β ⇐⇒ c1 ≤ (x− a1)2 + (y − a2)2 ≤ c2. The
data points colored in green, at the nodes, are points that were in the
resulting SZ. The SZs are non-convex sets. The green points in the center
are all the averages of a pair of points, one from each SZ. . . . . . . . . 40
3.1 A monitoring system, consisting of 3 monitoring nodes and a coordinator,
in a 2-dimensional space. The safe zones are given as rectangles in the
plane and the vectors are marked by dots. It’s easy to see that the average
of every 3 vectors, taken respectively from the local safe zones, resides
within the global safe zone (S). At t = 0, all the local vectors reside
within their respective safe zones and, consequently, the global vector is
also inside the global safe zone. In this case, none of the monitoring nodes
reports its local vector to the coordinator. At t = 1, a local violation has
occurred at N1 (v11 /∈ S1). N1 would now report its local vector to NC to
seek resolution. NC must poll N2 or N3 (or both) for their local vectors
in order to verify that v1 ∈ S. . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Local violations in homogeneous and heterogeneous systems . . . . . . . 48
3.3 Hoeffding’s lower bound in a homogeneous system . . . . . . . . . . . . 54
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
3.4 A maximum matching tree over an 8-node system in a 2-dimensional
space. Every level of the tree defines a partition of N1, . . . , N8. The
distribution of the average vector and the safe zone are marked by a
cloud of dots and a rectangle, respectively, for every partition set. The
root node represents the distribution of the global vector and the global
safe zone. Note that, indeed, global violations rarely occur. There are
two types of nodes: type-1 nodes, which have high variance in the 1st
dimension and low variance in the 2nd dimension, and type-2 nodes,
which have high variance in the 2nd dimension and low variance in the 1st
dimension. 4 nodes of each type comprise the leaves of the tree, denoted
by the double outlined ellipses. As expected, MMT first pairs nodes of
the same type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 An execution example of MMT over an 8-node system. At t = 1, a
snapshot of the system is given at the bottom. First, the violating nodes
(N2, N7) report their local vectors to the coordinator. Upon failure to
resolve the violation, the resolving set is extended with the non-violating
nodes from the sets containing N2, N7 in the 1st level of the MMTree. If
we consider the MMTree in Figure 3.4, nodes N4, N6 are polled for their
local vectors. At this point the violations are resolved and the algorithm
terminates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Data distributions and safe zones of 2 randomly chosen nodes from
each data set. In Syn-HT and RCV-HT the data are projected to a
3-dimensional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Experimental results over Syn-HM (left) and Air-HM (right) homogeneous
data sets. The vertical axes of the line graphs are in logarithmic scale.
In the average communication cost, all algorithms (except the Naive)
approach the minimum, as denoted by the number of violations. The
average latency reflects the differences between the algorithms in the
expansion rate of the resolving set. DMMT outperforms the centralized
algorithms in reducing the average maximum communication load. . . . 64
3.8 Experimental results over over Syn-HT (left) and RCV-HT (right) hetero-
geneous data sets. The vertical axes of the line graphs are in logarithmic
scale. The clear advantage of MMT and DMMT over the random al-
gorithms is apparent in each of the metrics. DMMT outperforms the
centralized algorithms in reducing the average maximum communication
load. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Abstract
In today’s connected, data-driven world, traditional database systems are replaced
by data stream systems which are fundamentally distributed. These large-scale and
widespread networked systems generate high-volume streams of data that often require
processing in real time. Examples for such include: network traffic monitoring systems,
real-time analysis of financial data, distributed intrusion detection systems and sensor
networks. A principal concern within these distributed systems is threshold monitoring:
Determining whether the value of a certain function, evaluated over network-wide data,
crosses a certain threshold that may indicate a global phase change which calls for some
action.
Formally, the threshold monitoring problem over a data stream system, consisting of
n nodes, can be described as follows: Given are a function f , vector v1, v2, ..., vn, where
vi is a data tuple (vector) at the ith node, and a threshold τ . The vectors are dynamic,
and the system needs to alert whenever the value of f(v1, v2, ..., vn) crosses τ (that is,
either changes from a value larger than τ to a value smaller than τ , or vice-versa). This
innocuous-looking condition of threshold crossing covers a very wide range of alerts that
complex, distributed, dynamic systems must trigger.
A common approach to the monitoring problem is to continuously or periodically
centralize all data or data summaries, thus transforming a distributed problem into
a centralized one. However, such centralization, may simply be infeasible in realistic
settings, due to the sheer volume and dynamic nature of the data (implying huge
communication overheads and latencies, as well as rapid energy drain, in the case of
sensors).
In this thesis we will present three works which propose techniques for reducing
communication when monitoring a distributed network. The techniques are based on
the following simple observation: nodes in the system should not send a message every
time new data arrives, but rather send messages only when ”interesting” things happen.
We formulate this observation using an idea of Safe Zones (SZs). Each node of the
system gets a Safe Zone (SZ), and is asked to communicate only when the data it
observes drifts out of this SZ. We will make sure that as long as each node is in its SZ,
the global ”bad” event (threshold crossing) may not have happened.
A great deal of work exists for the limited cases in which the threshold function (f)
is either linear or monotonic. However, many functions of interest are neither linear nor
1
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
monotonic and it is impossible to extend work on these cases to general functions. We
propose novel generic techniques for monitoring arbitrary functions.
In Chapter 1 we introduce and formalize the idea of SZs, define the problem of
finding the optimal SZs, and try to solve this problem with Geometric tools. In chapter
2 we present a different approach for the problem of finding the optimal SZs, which
is more appropriate when the data sampled at the nodes is of a Discrete nature. In
chapter 3 we present communication efficient algorithms for determining whether a
global ”bad” event happened, when a node drifts out of its SZ.
The applicability of these techniques to various real problems is provided in experi-
ments, which also demonstrate their advantage, by orders of magnitude, over previous
work. Both the experiments and theoretical analysis show that this advantage increases
with the dimension of the data.
2
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Chapter 1
Safe Zones: An Efficient
Approach to Distributed
Monitoring
1.1 Chapter Summary
Many monitoring tasks over distributed data streams can be formulated as a continuous
query using a function that is defined over the global average of data vectors derived
from the streams. The query will typically produce an alert when the value of the
function crosses a predefined threshold. A fundamental problem in efficient scalable
implementation of such threshold queries is that the data streams are distributed,
sometimes over a wide geographical region. Moving all the data to a centralized data
center for query processing may incur infeasible communication overheads and inflated
data center resource costs. In some cases it may be prohibited altogether by the sheer
aggregated size of the data, or by privacy laws. The goal is thus to enhance scalability
by processing the query locally, using as little communication and global coordination
as possible.
We present a novel scheme for communication reduction in distributed monitoring
using local constraints. Communication and global coordination are required only in the
event that the local constraints are violated by the incoming data. Our work improves
on previous work in a few critical aspects. First, whereas previous work required
constructing a “distributed cover” of the entire convex hull of the local data vectors, our
work compiles constraints that are designed to cover only the global average; further,
they are directly matched and tailored to fit the local data distribution at each stream.
The result is a dramatic decrease in the required volume of communication compared to
previous state of the art, up to two orders of magnitude in our experiments with real-life
data. Both the experiments and theoretical study suggest that the improvement factor
increases with the dimension of the data. Also, in contrast to previous work, which
3
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
necessitated complicated constraints and required enormous computational effort over
each of the streams, our scheme can use very simple constraints which incur negligible
local overhead. This latter advantage makes our new approach applicable to thin,
battery-operated sensors and cellular devices.
4
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
1.2 Introduction
The need for scalable processing of distributed data streams arises in many important
applications, such as network traffic analysis, sensor networks and complex event
processing. These systems consist of sets of (geographically distributed) nodes where
each node receives a stream of data. The task of interest is formulated as a continuous
query, whose output may change as new data arrives on the stream.
In many problems of interest, a vector is derived from the data arriving on each
stream, and the query continuously monitors a global function that is defined over the
average of the current vectors. The query produces an alert every time the function
crosses a predefined threshold. Such queries, called threshold queries, are the building
blocks for important data processing and data mining tools, including top-k queries,
anomaly detection, feature selection, decision tree construction, association rule mining,
data classification, correlation monitoring, and system monitoring [HNG+07]. Consider,
for example, the analysis of frequency moments over the union of distributed data
streams [CG05]. Here the local data are the local histograms of the incoming stream
data, and the monitored global function, defined over the average of the histograms, is
typically the Lp norm for some p.
As a simple running example which will allow to demonstrate both an application
scenario and the improvement over previous work, assume that sensors are deployed at
various locations in a city, measuring the concentration of air pollutants. Each sensor
maintains a vector of its readings, such as the concentrations of CO2, NO, and NO2.
We use a function over the vector of measurements in order to determine the overall air
quality, and we are interested in detecting when the air quality drops below a certain
threshold. Since the air quality in a city may change rapidly from one point to another
and abruptly over short periods of time, we are not interested in determining the air
quality at the individual sensor locations. Rather, we want to determine the relatively
stable measure of overall air quality, to which end we apply the scoring function over
the average of the measurement vectors taken at the sensors. We shall later return to
this example.
In these applications and others, the problem can theoretically be solved by moving
all the data to a central location, computing the average vector, and testing whether
the value of the monitored function crossed the threshold. Alas, that is often impossible,
due to the volume and transmission rate of the distributed streams. The goal we pursue
here is to allow scalability by reducing the global query processing to a set of local
queries, using as little communication as possible.
Many previous studies solve the problem for restricted sets of query functions, see
section 1.3. Several works attempt to determine when the sum of a set of distributed
counts exceeds a given threshold [KCR06] and to detect frequently occurring items
known as “heavy hitters” [YZ09, MSDO05]. The use of sketches has been proposed for
reducing communication in the construction of wavelets and histograms [CG05, CG09],
5
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
and in determining quantiles [CGMR05].
In contrast to many previous works, we present a generic method that provides a
solution for any threshold function defined over the average of current stream vectors.
When processing a threshold query, communication can be reduced by breaking down the
processing task into a set of constraints on the local values held at the nodes. As long as
none of these constraints is violated, it is guaranteed that the global function value has
not crossed the threshold, and therefore no communication is necessary. Communication
is required only in the event that a constraint is breached.
In [SSK08, SSK06] a generic “geometric” scheme was proposed for constructing local
constraints. Using only local information, each node constructs a sphere, where the
union of all spheres contains the convex hull of the local vectors. The local constraint at
each node consists of verifying that the value of the function at every vector in its sphere
does not cross the threshold. Since the spheres cover the convex hull of the local vectors,
they also cover their average vector. Thus, if all the constraints are satisfied, namely,
the value of the function is below the threshold on all the spheres, then the value of the
function does not cross the threshold at any point of the convex hull, including at the
average vector.
The method described in [SSK08, SSK06] has several significant drawbacks. Most
importantly, the constraints guarantee that the value of the function on the entire
convex hull of the local data vectors has not crossed the threshold, whereas we are
interested only in the value on the average vector. In fact, [SSK08] uses the convex hull
area as an optimal lower bound on the size of its sphere cover, thus inducing an upper
bound on the performance of this method. In this sense the constraints proposed in
[SSK08, SSK06] are too conservative, leading to unnecessary communication.
To illustrate this problem, we take a closer look at the air quality example given
above. Say the network consists of four sensors. Two of them are at the center of
the city, where cars are the main source of pollution, and the dominant pollutant is
pa. The other two sensors are located outside the city center, where industrial plants
are the main source of pollution, and the dominant pollutant is pb. We consider the
quality of air to be good if the concentrations of both pollutants pa and pb are below
one unit; otherwise, we consider the quality of air to be bad. In other words, if (ca, cb)
denotes the average measurement vector, where ca is the concentration of pa and cb is
the concentration of pb, then our scoring function is max(ca, cb) and our threshold value
is 1. In Section 1.8 we shall deal with real air-pollution data and more complicated
functions which are not linear, monotonic, or convex.
In the early morning hours all sensors register no pollution (a concentration of 0
units) of any type. As the day progresses, the concentration of pa at the city center
sensors rises to 0.5 and 1.5, and the concentration of pb rises to 0.05. Similarly, the
concentration of pb at the other two sensors rises to 0.5 and 1.5, and the concentration
of pa rises to 0.05. Figure 1.1 depicts this example. Note that even though the score
of the average vector has not crossed the threshold, there are significant parts of the
6
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.1: Air quality example with four sensors, two pollutants, function max, andthreshold 1. The local data vectors are depicted by red circles, their convex hull isoutlined by a dashed line, and the global average vector is depicted by a cross sign.The region of vectors whose value corresponds to good air quality is colored gray. Themaximal average value for either pollutant is 0.5125; hence it is easy to constructconstraints that completely avoid communication (outlined for each pollutant by dottedlines along the respective axis). However, as the figure clearly shows, any attempt tocover the convex hull of the sensor readings will result in constraint violations andunnecessary communication.
convex hull of the local vectors which have. Therefore, using the constraints defined
in [SSK06, SSK08] will result in unnecessary communication. In fact, any method that
verifies that the value of the function over the entire convex hull has not crossed the
threshold will require unnecessary communication.
It is easy to come up with a more appropriate set of constraints for this example –
constraints which do not attempt to cover the convex hull: the two city center sensors
verify that ca does not exceed 1.9 and that cb does not exceed 0.1, and the other two
sensors verify that ca does not exceed 0.1 and that cb does not exceed 1.9. The regions
corresponding to these two types of constraints are outlined by dotted lines in Figure
1.1. Note that as long as all the constraints are upheld, the score of the average vector
is guaranteed not to cross the threshold (the average concentration of each pollutant
does not exceed one unit). Therefore, these constraints are valid. In fact, in the scenario
described above, none of these constraints is ever violated.
The new set of constraints is defined by regions which are optimized for each sensor
(or, in this example, for each pair of sensors), and are called the sensors’ safe zones.
The local operation of deciding whether communication is necessary consists merely
of verifying that the local vector is contained in this region. As opposed to the large
majority of work on distributed monitoring, with the safe zone approach proposed here
we monitor the domain of the function, as opposed to its range. This not only allows a
great deal of freedom in determining local constraints, it also allows to tailor these local
constraints to the behaviors of the data at the different nodes.
Returning to the constraints proposed in [SSK06, SSK08], we see an additional
major disadvantage: the complexity of checking whether they are upheld. Checking
these constraints requires determining whether the maximum (or minimum) value of the
7
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
scoring function inside a sphere exceeds a given threshold. This optimization problem
can be very complex and computationally demanding even for relatively simple functions.
Furthermore, it is performed continuously throughout the lifetime of the query for every
newly introduced data vector. Evidently, the high complexity translates into high power
and strong processing requirements. Thus, this drawback is particularly prohibitive for
ubiquitous battery-operated nodes, such as sensor networks and cellular devices.
1.2.1 Contribution
In this work we propose a novel approach to constructing local constraints by means
of simple forms of safe zones. These constraints focus directly on the average of
the data vectors rather than on covering the entire convex hull of the local vectors,
therefore dramatically reducing communication in comparison to previous algorithms.
Experiments performed on real-world data show that this improvement reaches two
orders of magnitude.
Further, in contrast to the high computational effort required to verify that the
previously proposed constraints are upheld, negligible computational effort is required
in our scheme due to the simplicity of the safe zones. This makes our scheme applicable
to thin clients, such as battery-operated devices and sensors.
The safe zones proposed in this work (Section 1.4) are computed by solving an
optimization problem, whose precise solution can provide optimal local constraints. We
develop a set of algorithmic solutions that significantly relax the complexity of the
problem and provide approximations to the exact solution (Section 1.5 and 1.6). The
main goal of the proposed method is to minimize the overall probability for the local
data vectors to breach their safe zones. The safe zone concept also enables us to define
an algorithm for efficiently recovering from safe zone breaches, which will be presented
in Chapter 3 of this thesis. We then discuss the complexity of solving the optimization
problem and show that it is NP-hard (Section 1.7). An experimental evaluation on
real-life data is provided in Section 1.8, which demonstrate that the method reduces
communication by up to two orders of magnitude over the stat-of-the-art.
Related work is reviewed (Section 1.3) and conclusions drawn (Section 1.9).
1.2.2 Preliminaries: Minkowski Average
Recall that the monitored functions are defined over the average of local vectors. The
safe zones we use are vector sets in Euclidean space for which the value of the function
on any average of vectors, one from each set, is still below the threshold. To manipulate
averages of vectors taken from sets, we use a well-known geometric operator, called the
Minkowski sum, and denoted by ⊕ [Ser82]. Given n sets S1, S2, ..., Sn, their Minkowski
sum is the set
S1 ⊕ ...⊕ Sn = v1 + ...+ vn |v1 ∈ S1...vn ∈ Sn .
8
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Their Minkowski average is their Minkowski sum where every element is divided by n. See
also Wikipedia for examples and illustrations http://en.wikipedia.org/wiki/Minkowski addition.
For a descriptive example of how the Minkowski average notion relates to the safe zones
defined here, see Fig. 1.2.
1.3 Related Work
Methods for reducing communication in distributed systems include sketching [AM04,
CG05, CG09]. Other research concerns detecting “heavy hitters” [MSDO05], [YZ09],
computing quantiles [CGMR05], and counting distinct elements [CMZ06]. Distributed
computation was also addressed in the context of top-k problems [MTW05, BO03a], set-
expression cardinality estimation [DGGR04], clustering [CMZ07], distributed verification
of logical expressions [ADNR07], optimal sampling [CMYZ10], choosing local thresholds
[AKT09], and ranking [LYJ09]. Theoretical analysis of the monitoring problem is
provided in [CMY08], and some non-monotonic functions of frequency moments are
treated in [ABC09].
The lion’s share of the work addresses the limited case in which the threshold function
is linear (e.g., aggregate, average) [KCR06]. In [SR08] the value of a polynomial in one
variable is monitored. A great deal of work was dedicated to distributed monitoring of
monotonic functions, usually weighted averages, max and min operators, etc. [MTW05].
[GT01, GT02] present algorithms for estimating aggregate functions over a sliding
window of the N most recent data items.
In [JW04], assigning appropriate local thresholds at each node is proposed. It
also touches on the importance and difficulty of the problem of monitoring non-linear
functions: ”Standard database languages offer other aggregates including AVERAGE,
STDEV [standard deviation], MAX and MIN. Given a constraint on one of these global
aggregates (e.g. ’ensure that the STDEV of latency is ≤ l second’.), it is not immediately
clear what local ’event’ should trigger global constraint checks”.
In [WDS09], a gossiping-based algorithm is presented, but it does not cover general
functions.
[HNG+07] suggests a distributed paradigm to decide on the dimension of an ap-
proximating subspace for distributed data, with the application of detecting system
anomalies such as a DDoS attack. [CMY08] discussed functional approximation in
a distributed setting but it only deals with obtaining lower bounds for vector norm
functions.
Monitoring non-monotonic functions by representing them as a difference of mono-
tonic functions is presented in [SKSS10], but for the static case only. Aggregative
ratio queries over streams are treated in [GRM10]. (notice the difference from the
instantaneous, non aggrerative, ratio, which we treat here.)
A geometric method for monitoring threshold functions was studied in [SSK06,
SSK08]. We have already discussed the drawbacks of this method in Section 1.2 above.
9
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Each node is assigned a subset of the data space such that as long as the local vectors
are inside their respective subsets, it is guaranteed that the function’s value did not
cross the threshold. However, GM suffers from the following drawbacks, which are
solved by the safe zone method:
1. The shape of the subsets at different nodes is identical. This means that if data in
different nodes obey different distributions, GM will perform poorly, as it is based
on the assumption that data at all nodes is idetically distributed. For example, if
the distribution at some nodes is elongated along the x-direction and at others in
the y-direction, it makes sense that the subsets at the respective nodes will be
elongated along the x(y) directions, thus allowing to ”capture more probability”.
In this thesis we present an algorithm which allows to assign different SZs to
different nodes. In Appendix IX-D, a simple theoretical analysis is presented
which proves that this freedom allows the SZ method, even for relatively simple
data distributions, to improve over GM by a factor which is 1) unbounded from
above, and 2) rapidly increases with the dimension of the data. This theoretical
argument is supported by the experiments in Section 1.8.
2. With the GM method, there is no optimality criterion in the definition of the
subsets. Here we define the SZs as the solution of an optimization problem which
is defined so as to minimize communication during the monitoring task.
3. In previous work only a heuristic solution was proposed for local violation recovery
(a violation is defined to occur whenever a local data vector exits its SZ). Here,
building upon the novel SZ concept, we define a rigorous algorithm to overcome
violations with minimal communication overhead.
1.4 Overview of the Algorithm
We commence with an outline of the algorithm; its distinct stages will be described
throughout the thesis.
Recall that data streams arrive at distributed nodes, and that each node derives a
dynamic d-dimensional data vector from its stream. Denote the number of nodes by n
and the data vector at the i-th node at time t by vti . The global vector, which represents
the entire system at time t, equals vt , vt1+vt2+...v
tn
n . We are interested in determining
when the value of the monitored function f() evaluated at vt crosses a given threshold
T .
In addition to these n nodes, there is an additional coordinator node. The nodes
communicate exclusively with the coordinator. As discussed above, our goal is to set
local constraints at the nodes. To this end, upon initialization of the monitoring process,
or when determined by the algorithm, the coordinator collects data from the nodes,
10
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
determines the local constraints for each node, and sends them to the nodes. This
process is referred to as synchronization.
Assume w.l.o.g that f(v0) ≤ T , so we must submit an alert when f(vt) > T . Next,
we define a set referred to as the admissible region, and denoted by S , v|f(v) ≤ T.As discussed in the Introduction, the constraints are defined by subsets of Rd; the
i-th node is assigned a subset, Si, referred to as its safe zone. These safe zones are chosen
from a suitable parametric family of shapes (e.g. polyhedra with a certain number of
vertices).
We assume that the probability distribution function (p.d.f hereafter) of the data
vectors at each node is known to the coordinator. These may either be known in advance
or approximated from the values arriving at the nodes and sent to the coordinator
upon synchronization. Denote the p.d.f at the i-th node by pi. These p.d.f’s can be e.g.
Gaussian [SSK08], random walk, uniform, or other.
As long as vtk ∈ Sk, the node does not initiate communication. Whenever, vtk /∈ Sk -
an event we call a local violation - the node reports to the coordinator, which attempts
to resolve the violation by employing the violation recovery algorithms described in
Cahpter 3, and if unsuccessful, initiates synchronization which determines new safe
zones at all nodes.
1.4.1 Safe-Zone Allocation as an Optimization Problem
At the heart of our approach is the allocation of safe zones by the coordinator. Obviously,
in order for the algorithm to be correct, the safe zones must adhere to the following
condition:
(v1 ∈ S1) ∧ ... ∧ (vn ∈ Sn)⇒ v1 + ...+ vnn
∈ S (1.1)
In other words, we require the Minkowski average of the safe zones to be contained
in the admissible region. Furthermore, in order for the safe zones to be efficient, we
would like to maximize the probability of the data vectors at all nodes to remain within
their zones.
Assuming probability distributions p1, ..., pn on the data at the respective nodes
(pi is the probability density function of the data seen at node i), we can formulate a
constrained optimization problem as follows:
Maximize
∫S1
p1dv1
∫S2
p2dv2...
∫Sn
pndvn (1.2)
subject toS1 ⊕ S2...⊕ Sn
n⊂ S .
The maximization of the target function∫S1
p1dv1...∫Sn
pndvn means that, under the
constraint, the expected time it will take one of the local data vectors to wander out of
11
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.2: Illegal and legal safe zones. Top: left and right depict two-dimensional dataat two nodes A and B. The p.d.f at both nodes is a Gaussian (normal) distribution. S,which must contain the Minkowski average of the two safe zones, is the dotted ellipsein the middle. The point cloud in the middle is a sample of the global data vectors,obtained by averaging the data vectors at both nodes. The allowable family of safe zoneshapes consists of four-vertex polygons. The depicted safe zones (outlined in black) fitthe local data well, but are alas illegal, since their Minkowski average (continuous darkline in the middle) is not inside S. Bottom: This time the safe zones are legal: theysatisfy the constraint (1.2).
its safe zone is maximal.1
The very general formulation of the optimization problem (Eq. 1.2) allows to assign
to each node a safe zone tailored to match its data distribution; this in contrast to
previous geometry-based monitoring algorithms [SSK06, SSK08], in which the local
conditions at all nodes are identical. Consequently, as demonstrated in the experiments,
the advantage of our approach increases with the diversity of the data distributions
across nodes. The safe zones try to match the shape of the p.d.f of the data at each node,
to maximize the probability of the local data falling inside its safe zone. For example,
a distribution which is wide along the x-axis and narrow along the y-axis should be
assigned a safe zone which is wide horizontally and narrow vertically. In Section 1.8.8
a brief theoretical analysis is provided which demonstrates that even in a very simple
setup, the improvement factor of the proposed method over [SSK06, SSK08] increases
exponentially with the dimensionality of the data vectors.
The Minkowski average of the safe zones should tightly approximate S (from the
inside). If it fills a relatively small part of S, this means that the safe zones can be
enlarged and the value of the optimized function increased.
Note that the geometric constraint and the target function to be maximized have to
reach a “compromise”: figuratively speaking, the Minkowski average constraint forces
the safe zones to be small, while the probability increases as the safe zones become larger.
Fig. 1.2 demonstrates the trade-off, central to the solution of the optimization problem
1Here we assume that data is not correlated between nodes, as is the case with the data used forthe experiments in Section 1.8; if data is correlated, the algorithm is essentially the same, with theexpression for the probability that data at some node breaches its safe zone modified accordingly.
12
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
(1.2), between the fit of the safe zones to the data and the necessity of maintaining their
Minkowski average inside S. Note how the Minkowski average “sticks” to S, which is a
result of maximizing the safe zones to contain as much p.d.f weight as possible.
1.4.2 The Parametric Family of Allowable Safe Zone Shapes
As mentioned above, we suggest that the optimization be restricted to a parametric
family of shapes, denoted by P . The safe zones will be chosen from the members of P .
Note that every node has to continuously test whether its local data vector is inside
its safe zone; if the shape of the safe zone is complicated, the test will consume time
and energy; therefore it is desirable to apply safe zones which are as simple as possible.
In addition, if computing the integrals of the p.d.f or the Minkowski average is very
time consuming, the optimization process may be lengthy, rendering the algorithm
impractical. Optimization time will also increase with the number of parameters required
to define a safe zone.
The above considerations lead to the following requirements for the family P :
• P should be sufficiently rich, so that its members can reasonably approximate
every given p.d.f with a viable candidate for a safe zone. If this does not hold, the
solution may be grossly sub-optimal.
• The shapes of members of P should not be too complicated, in order to allow to
quickly verify containment of the data vectors in the safe zone.
• It should not be too difficult to compute the integral of the p.d.f on members of
P .
• It should be relatively easy to compute, or bound, the Minkowski average of any
members of P .
In our experiments (Section 1.8) we applied various types of polygons and polyhedra
as members of P . Good results were obtained for real-life data even when relatively
simple families were used. The questions of how to choose the exact parameters (e.g.,
the number of polyhedra vertices) and how to use other parametric families of shapes
are beyond the scope of this work, and their complete solution will require further study.
As noted, here we chose to work with safe zones which are both simple and provide
good performance in terms of reducing communication.
1.4.3 Convexity of S and the Safe Zones
The following theorem is useful when S is convex; it shows that in this case we can,
without loss of generality, restrict ourselves to convex safe zones.
Theorem 1.1. If S is convex, the optimal safe zones are convex.
13
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Proof. For the sake of simplicity, assume there are two safe zones (extension to more is
straightforward), so S1⊕S22 ⊂ S. We will prove that C(S1)⊕C(S2)
2 ⊂ S, where for every
set X, C(X) is the convex hull of X. This means that if S1, S2 are legal safe zones, so
are C(S1), C(S2). But Si ⊂ C(Si), which means that if Si were not convex they could
have been enlarged to legal safe zones, thus contradicting their optimality.
Now we return to the proof that S1⊕S22 ⊂ S =⇒ C(S1)⊕C(S2)
2 ⊂ S. Let x1, y1 ∈S1, x2, y2 ∈ S2, and 0 ≤ λ1, λ2 ≤ 1. We need to prove that v = λ1x1+(1−λ1)y1+λ2x2+(1−λ2)y2
2 ∈S. Assume without loss of generality that λ2 ≤ λ1. Since S1⊕S2
2 ⊂ S, it follows thatx1+x2
2 , x1+y22 , y1+y22 ∈ S. Since S is convex, we have that v = λ2x1+x2
2 + (λ1−λ2)x1+y22 +
(1− λ1)y1+y22 ∈ S. But this last expression is a convex combination of elements in S,
hence it is also in S.
If the safe zones are convex polygons with k vertices (as used in this chapter), the
following argument proves that they provide a reasonable approximation to general safe
zones in the case of two variables.
Let C be a convex set in the plane, and k a fixed integer. Denote by Ck the maximal
value of the ratio A(Pk)A(C) , where A is area and Pk is any convex polygon with k vertices
inscribed in C. Then [MGR95] show that the minimal value of Ck is obtained when C
is a disk. Intuitively, this means that disks (and spheres in higher dimensions), are the
convex sets which are hardest to approximate by inscribed polyhedra.
Using this result we get:
Theorem 1.2. Ck behaves as 1− αk2
for a small constant α.
1.5 Safe Zone Optimization
We now turn to the algorithmic core of the proposed method – solving the optimization
problem which defines the safe zones (Eq. 1.2). In a series of theorems deferred to
Section 1.7, we prove that generally the problem is NP-hard. Still, efficient solutions
can be found by applying computational techniques which are presented here and in
the following section.
The parameters which determine the difficulty of the search for good safe zones are:
1. The complexity of computing the target function.
2. The complexity of testing whether the constraints hold.
3. The complexity of the safe zone shapes: when that increases, so does the number
of variables to optimize over.
4. The number of nodes: when that increases, the number of variables to optimize
over increases linearly, which typically results in a super-linear increase in the
optimization complexity.
14
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
To solve problems 1 and 2 we applied tools from the realm of analysis and computational
geometry, as described in Sections 1.5.1, 1.5.2. The solution to problem 3 is by restricting
the parametric family of allowable shapes P , as discussed in Section 1.4.2. The solution
to problem 4 is described in Section 1.6.
1.5.1 Computing the Target Function
The target function is defined as the product of integrals of the respective p.d.f on the
candidate safe zones. Typically, data is provided as discrete samples (Section 1.8). The
integral can be computed by first approximating the discrete samples by a continuous
model, and then integrating it over the safe zone.
We have used this approach for 2- and 3-dimensional data, fitting a GMM (Gaussian
Mixture Model) and integrating the GMM over the safe zone, which was defined by
a polygon or polyhedra; see Fig. 1.6. To compute the integral, we used Monte-Carlo
methods and Green’s theorem, which allows to reduce the dimension of the integration
domain.
Alternatively, the integral can be approximated from the discrete samples. The
simplest approximation is to estimate the integral by the number of points in the
safe zone. In order to improve accuracy (as well as to make the target function
continuous and thus more amenable to optimization), the integral was approximated byk∑i=1
exp(−λd2(qi, SZ)),
where k is the number of sample points, SZ the safe zone, λ a positive constant, qi the
i-th data point, and d(qi, SZ) the distance of qi from the SZ (which is defined to be
zero if qi ∈ SZ).
Computation of the target function can be accelerated by using range searching
algorithms [Mat92].
1.5.2 Checking the Constraints
To implement a constrained optimization routine, a function is required which checks
whether the current parameters satisfy the constraints. It should return zero if the
constraints are satisfied, and a positive number if they are not. It should behave
“smoothly”: if the constraint violation is small, it should return a small value, and vice
versa.
In our case, the constraint is that the Minkowski average of the safe zones is contained
in S. One way to check the constraint is to compute the Minkowski average and check
whether it is inside S, and if not, determine some measure of its deviation from S. This
may entail very high computational complexity, especially in high dimensions in which
computing the Minkowski sum is computationally extensive.
The following method allows us to test the constraint without computing the
Minkowski average. For the sake of simplicity we start with two-dimensional data.
15
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.3: The half-plane approach. Top: S is equal to H, which is defined by θ andb. Bottom: a supporting hyper-plane is used for every safe zone candidate Si.
Assume first that S is a half-plane, denoted H, and its boundary is denoted lH , so H is
the set of points which are below lh (the algorithm proceeds similarly if lH is H’s lower
boundary); see Fig. 1.3. lH is defined by the angle θ and by b, its distance from the
origin (in higher dimensions similar definitions hold, with the direction of the b vector
replaced by a unit vector perpendicular to the hyperplane H).
Then, in order to determine whether the Minkowski average of the candidate safe
zones S1...Sk is contained in H, one has to find for each Si the upper supporting line
(in higher dimensions, upper supporting hyperplane) in the same direction as that of
lh. For a polytope Si, this requires rather low computational complexity – only the
vertices need to be considered, and line sweep algorithms can be applied to further
reduce running time. In order for the Minkowski average of the Si’s to be contained in
S, it is sufficient that∑bik ≤ b. This algorithm also allows the measure of constraint
violation, which depends on the value of∑bik − b, to be estimated. This measure also
allows to easily apply standard optimization routines, such as Matlab’s fmincon, to
solve the optimization problem, with the above measure of constraint violation and the
target function as defined in Section 1.5.1.
If S is a convex polytope it is equal to the intersection of half-planes, and the target
function is the sum, or maximum, of the target functions corresponding to the individual
half-planes. A general convex S can be efficiently approximated by an inscribed convex
polytope [MGR95]. Non-convex shapes can be represented as a union of convex ones
[Cha87].
1.6 Hierarchical Clustering
As mentioned above, the complexity of the safe zone optimization problem (1.2) increases
very quickly with the number of nodes. Assume, for example, that 100 nodes are present,
16
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.4: Hierarchical clustering example. Data is taken from [web]. The allowablefamily of safe zone shapes consists of pentagons. Each node’s data is represented bya scatter diagram at the bottom. The yellow safe zones are those computed for theoriginal nodes. Clusters of the data of node pairs are represented as supernodes in themiddle row. The two safe zones corresponding to the supernodes are colored green (asis their Minkowski average, depicted inside S). The root is the diagram at the top row.S is the blue ellipse in the root (the root node was scaled for better view). An ellipsecorresponds to monitoring a quadratic function (see also Section 1.8.6).
the data is 3-dimensional, and we wish to use polyhedral safe zones with eight vertices
in each node. Since each vertex has three coordinates, the total number of parameters
to optimize over is 100 · 8 · 3 = 2400, which is quite high for a non-convex optimization
problem. To overcome this, we organize the data in a hierarchical structure, which
allows the problem to be solved recursively while reducing it to sub-problems with a
much smaller number of nodes.
The algorithm commences by performing hierarchical clustering on the nodes. Note
that we do not cluster the data in each node separately, but the nodes themselves;
that is, we form clusters of nodes. To achieve this, a distance measure needs to be
defined between clusters. This can be done in various ways – for example, GMMs or
other distributions can be fit to the data in the nodes, and the distance between nodes
can then be defined by some distance between the respective distributions (e.g., the
Kullback-Leibler divergence). A more direct method is to use some distance measure
between the data moments [MEA01]; the clustering results for the two methods were
quite similar for the data we tested. Some typical results are provided in Section 1.8.4.
The hierarchical clustering proceeds as follows: we start by partitioning the entire
set of nodes into a small number of clusters, which can be thought of as supernodes, each
containing the union of data of the nodes in the respective cluster. We fit safe zones to
the supernodes, and continue recursively, by partitioning each supernode into clusters,
and so on. This process constructs (top-down) a tree of supernodes. The leaves of the
tree can be either individual nodes, or node clusters which are uniform enough that
they do not require further partitioning into smaller clusters, and can all be assigned
safe zones with identical shapes.
17
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
To illustrate the hierarchical clustering algorithm, we present in Fig. 1.4 an example
with four nodes, where the safe zones are convex pentagons. The nodes are first clustered
into supernodes, depicted in the middle row. Each supernode is generated by sampling
and averaging from the data of two nodes which it represents, and the entire data (top
row) is generated by sampling and averaging from the two supernodes; it is the root of
the cluster tree.
Note that the Minkowski sum of the two safe zones of the left (right) node pair
is constrained to lie inside the safe zone of the left (right) supernode, depicted in the
middle row. Thus, assigning four safe zones at the nodes was achieved by solving
three optimization problems, each constructing two safe zones only. This leads to a
considerably faster solution than solving for all four nodes simultaneously. In general,
the time complexity of solving an optimization problem increases very rapidly with the
number of parameters, so optimizing three times over two pentagons is much faster
than optimizing once over four. Importantly, this approach also allows parallelization.
This example also demonstrates that nodes with similar data distributions are
typically clustered into the same supernodes, and are assigned similar safe zones. When
many nodes are present, it is usually possible to prune the cluster tree and assign the
same safe zone to all nodes in a supernode, given that it is sufficiently uniform. This,
too, can save a great deal of computation.
1.7 The Complexity of Computing Optimal Safe Zones
In this section we study the complexity of the optimization problem (1.2). We prove
that generally, solving the safe zone problem is NP-hard.
Recall that the input to the safe zone problem consists of the subset S (determined
by the monitored function f() and the threshold T ), and the probability distributions
pi (which may be given in closed form or as sampled data).
Theorem 1.3. Even for two nodes and one-dimensional data, the safe zone problem is
NP-Hard.
Proof. We will show the equivalence of the safe zone problem to the biclique problem.
Given a bipartite graph G with sides L,R, the goal of the biclique problem is to find
the biclique (a complete bipartite subgraph of G) with the maximal number of edges.
Assume R has nodes r1...rn, and L nodes l1...lm. Let the set of edges, E, be a subset
of i, j, where i ∈ 1...n, j ∈ 1...m. Associate with this graph the distributions PR
having delta function (pointwise) probability distributions at locations xi, i = 1..n, and
the same for PL at locations yj , j = 1..m (narrow Gaussians can also be used). The
only restriction on xi, yj is that xi + yj = xi′ + yj′ ⇒ i = i′, j = j′, which is trivial to
achieve.
Now, define S = xi+yj |(i, j) ∈ E. Note that optimal safe zones must be subsets of
xi and yj (including other points will not add any probability, as all the probability
18
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.5: A schematic example of the proof of equivalence of the safe zone and thebiclique problems (Theorem 1.3). The bipartite graph (top), nodes (middle), and S(bottom).
mass resides in xi, yj). Note also that Sx, Sy satisfy the Minkowski sum constraint iff
the respective subsets of L,R form a biclique, and that the target function for the two
safe zones is proportional to the number of edges in that biclique.
We conclude that the two problems are equivalent.
A schematic drawing illustrating the reduction of the biclique problem to the problem
of computing the safe zones is provided in Fig. 1.5.
One may suspect that the difficulty of the general problem follows from allowing
such a discrete, disconnected S as in the proof of Theorem 1.3. The following theorems
prove that is not the case.
Theorem 1.4. If the dimension of the data vectors is at least 4, the safe zone problem
is NP-complete for two nodes even when S is convex.
Proof. The same idea (and notations) are used as for Theorem 1.3. The tricky part is
to construct a convex S having the property that “makes the proof work”, i.e., such
that xi + yj = xi′ + yj′ ⇒ i = i′, j = j′ and that xi + yj ∈ S ⇐⇒ (i, j) ∈ E.
Since S has to be convex, we choose it to equal the convex hull of xi + yj , (i, j) ∈ E.
In order to guarantee that (i, j) /∈ E ⇒ xi+yj /∈ S, we construct the sets of points xi, yj
such that xi0 + yj0 is not in the convex hull of the points xi + yj |i 6= i0 OR j 6= j0(such a construct, obviously, is not possible in one dimension).
Note that for any set of points on the unit circle in R2, none is in the convex hull
of the others (as the unit circle is strictly convex). Take uini=1, vjmj=1 to be any
such two sets, and define xi as (ui, 0, 0) ∈ R4 and yj as (0, 0, vj) ∈ R4 (there are four
coordinates since ui,vj ∈ R2). The points xi, yj satisfy the required property, since if∑i,j|i 6=i0 OR j 6=j0
λi,j(xi + yj) = xi0 + yj0 (for λi,j ≥ 0 and∑i,jλi,j = 1), the equality holds
separately in the two first and the two last coordinates, which means it holds separately
for the ui and vj , violating the strict convexity of the ui and vj sets.
19
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
For two nodes and S a one-dimensional interval, if the numbers of points at the
nodes are O(n), a polynomial algorithm exists for computing the optimal safe zones:
a trivial solution, running at time O(n4), tests all safe zone pairs which are intervals
with data points as endpoints, and the running time can be lowered to O(n2 log(n)) by
a binary search on the endpoints. However, even in this case, the general problem is
NP-complete.
Theorem 1.5. If more than two nodes are allowed, the safe zone problem is NP-
complete for the case in which S is a one-dimensional interval.
Proof. We will show that the knapsack problem (which is known to be NP-complete)
can be reduced to the safe zone problem in this case. Given a knapsack problem, that
is, n objects O1, ..., On with value vi and weight wi for Oi, and a knapsack which can
carry a maximal weight of W , we reduce the problem to a safe zone problem whose
optimal solution can be used to construct an optimal solution to the knapsack problem.
First, we create n nodes N1, ..., Nn, with Ni corresponding to Oi, with the following
data distribution:
pi(x) =
1C if x = 0evi−1C if x = wi
C−eviC if x = W + 1
0 otherwise
(1.3)
for C = maxievi. Note that the overall probability in each node equals 1. Now, we
define the global safe zone S to be the interval[0, Wn
].
Assume we can solve the safe zone problem and obtain an optimal solution, i.e.,
an interval Si = [ai, bi] for each node Ni. Note that in this optimal solution we may
assume that ai = 0 and 0 ≤ bi ≤W for all i (it is not possible to take bi > W as this
would violate the Minkowski average constraint, and it will not add anything to take
ai < 0, as all the probability mass is in the region x ≥ 0).
There are two types of possible safe zones at each node: those which contain only
the origin, and those which equal [0, wi]. Denote by S the subset of nodes in which
[0, wi] is taken. The solution is legal iff the Minkowski average of [0, wi]|i ∈ S is
inside S = [0, Wn ], but this is equivalent to demanding∑i∈S
wi ≤ W – which is exactly
the legality condition for the knapsack problem. Also, the product of the probability
volumes at the nodes (which the safe zone problem attempts to maximize) clearly
equals
∏i∈S
evi
Cn , so up to a constant factor it is e
∑i∈S
vi. Therefore, the safe zone problem is
equivalent to maximizing∑i∈S
vi, under the constraint∑i∈S
wi ≤W , which proves that it
is equivalent to the knapsack problem.
20
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.6: Top: two of the GMM elements super-imposed on the data. Middle: typicallocal concentrations of NO and NO2 as a function of time. Bottom: typical behavior ofthe local ratio between NO and NO2 as a function of time.
1.8 Experiments
The proposed safe zone method was implemented and compared with the algorithm
proposed in [SSK06, SSK08], which we call the GM algorithm (Geometric Method
algorithm). We chose to compare to GM as it, too, constitutes a general approach to
monitoring arbitrary functions. We are not aware of other algorithms which can be
applied to monitor the functions treated here.
1.8.1 Data, Methods, and Monitored Functions
The data we used consists of air pollutant measurements taken from “AirBase – The
European air quality database” [web]. Concentrations were measured in micrograms per
cubic meter. Nodes correspond to sensors at different geographical locations. The data
at different nodes greatly varies in size and shape and is highly irregular as a function
of time; see Fig. 1.6.
Computing the target function in the optimization requires computing the integral
of the p.d.f on the respective safe zones. This is done by approximating the data
with a Gaussian Mixture Model (GMM), using a Matlab routine (see Fig. 1.6), or
by calculating a discrete approximation, as discussed in Section 1.5.1. The quality of
the results was measured by the reduction in safe zone violations, which is roughly
proportional to the reduction in communication operations.
To emphasize the generality of the safe zone method, it was applied to monitor non-
21
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.7: Triangular safe zones used for ratio monitoring.
linear, non-monotonic functions. In Section 1.8.2 results are presented for monitoring the
ratio of NO to NO2, which is known to be an important indicator in air quality analysis
[KG07]. In Section 1.8.5 the chi-square distance between histograms was monitored
for 5-dimensional data. Section 1.8.6 presents an example of monitoring a quadratic
function in three variables. Quadratic functions are important in numerous applications.
For example, variance is a quadratic function. Normal distribution is the exponent
of a quadratic function; consequently, its thresholding is equivalent to thresholding a
quadratic function.
1.8.2 Ratio Queries with Triangular Safe Zones
This set of experiments concerned monitoring the ratio between two pollutants, NO
and NO2, measured in distinct sensors. Formally, each of the n node holds a vector
(xi, yi) (the two concentrations), and the monitored function is∑yi∑xi
(in [GRM10] ratio
is monitored but over aggregates, while here the monitoring is of the instantaneous
ratio). An alert must be sent whenever this function is above a threshold T . The safe
zones tested were triangles of the form depicted in Fig. 1.7, a choice motivated by
their simplicity and by their suitability to the data and to the definition of the queried
function. Note that we allow some of the βi to be positive, in order to cover nodes in
which the ratio yixi
is high.
Fig. 1.8 shows an example on four nodes. Note how nodes with more compact
distributions are assigned smaller safe zones, and how nodes with high values of the
monitored function (NO/NO2 ratio) are assigned safe zones which are translated to the
left in order to cover more data. This is especially evident in the upper right node, in
which the safe zone is shifted to the left so it can cover almost all the data points. In
order to satisfy the Minkowski sum constraint, the safe zone of the upper left node is
shifted to the right, which in that node hardly sacrifices any data points. Note that
the safe zone method allows safe zones which are larger than S, as opposed to the GM
method, in which the safe zones are restricted to translates of subsets of S.
22
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.8: Example of optimal safe zones with four nodes. S is the dark triangle; safezones are outlined in green.
1.8.3 Improvement over GM Algorithm
Here we compare ratio monitoring with safe zones to the GM method. In Fig. 1.9, the
number of safe zone violations is compared for various numbers of nodes, and in Fig.
1.10 some of the safe zones for both methods are compared.
1.8.4 Ratio Queries: Hierarchical Implementation
Hierarchical clustering of the nodes was applied in order to reduce running time (Section
1.6). In Fig. 1.11 a typical result is depicted: 92 nodes were clustered into four groups.
Two representatives from each of the three largest groups are shown, which correspond
to three typical data types: small, indicating low concentrations of NO/NO2 (top), drift,
with many measurements near the origin but also a sizable number of measurements
with high NO concentrations (middle), and vertical, where most measurements are
concentrated in a vertical stripe near the origin and fewer have high NO (bottom). The
Matlab routine kmeans was used for clustering the moment vectors.
In order to test running time and performance, we ran the ratio monitoring algo-
rithm for n = 30 to 240 nodes with various thresholds, both in a “flat” mode (direct
optimization over 2n variables, see Section 1.8.7) and the hierarchical method using the
clustering and tree structure (Section 1.6). Table 1.1 summarizes the results for n = 60
and various values of the threshold T .
In each table entry the first number stands for running time (seconds) and the
second number for the value of the target function. The running time is higher for
the “flat” mode, as the number of parameters to optimize over is much higher, but the
23
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.9: Comparison of safe zones (green line) to GM (blue line) in terms of thenumber of violations, up to 10 nodes. For more nodes (up to 200 were used in thisexperiment), the average improvement of safe zones over GM was by a factor of 17.5.
Figure 1.10: Comparison of the safe zone method to GM in terms of points which causea violation. At each node, the set S is depicted (dark triangle), the safe zone (greentriangles), and a sample of the data points (red dots). The points which satisfy the GMconstraints are depicted in blue. The advantage of the safe zone method over GM isclear.
Figure 1.11: Clustering example. Each row depicts two nodes from one cluster.
24
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.12: Running time (in logarithmic scale) for “flat” – direct optimization overall the nodes (blue) vs. hierarchical clustering (green).
Table 1.1: Optimization time (in seconds), and target function for ratio queries, i.e.,the average integral of the p.d.f covered by the safe zones.
T 4 3 2
Tree 23.9 s. — 0.980 23.4 s. — 0.985 23.8 s. — 0.962
Flat 243.6 s. — 0.999 244.9 s.— 0.994 177.4 s. — 0.970
performance is slightly better. In Fig. 1.12 the running times of “flat” vs. “hierarchical”
are compared for various numbers of nodes; note that running times for “hierarchical”
only increase linearly with the number of nodes.
1.8.5 Chi-square monitoring in 5 dimensions with axis-aligned box-
shaped safe zones
Another important example of a non-linear, non-monotonic function is the chi-square
distance between histograms, defined by χ(f, g) =∑ (fi−gi)2
fi+gifor histograms f, g. The
histogram was defined as the concentration levels of five pollutants, and the monitored
function was the chi-square distance between the hourly average of two nodes and their
average calculated over the previous week (i.e., a measure of how much the hourly
distribution deviates from last week’s average).
When the data distributions in two nodes substantially differ, the advantage of the
safe zone method over GM is very clear, since it can adapt its safe zones to fit the
distinct distributions at the nodes, allowing a much larger safe zone to the node with
the more varying data. In Figs. 1.13,1.14 the different behavior of the nodes’ data is
demonstrated and the safe zones allocated to them depicted. In Fig. 1.15 the advantage
over GM for various thresholds is shown. As the threshold increases, so does safe zone’s
superiority to GM. For the low thresholds, 0.5 to 0.6, there are many actual (global)
violations, but as the threshold increases, GM still suffers from many “false alarms”
(local violations which are not associated with a global violation), while the safe zone
25
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.13: Plots of the chi-square function for two nodes, an “oscillating” one (highlyvarying data) in green, and a more stable node (in blue). Horizontal axis stands fortime (in hours), vertical axis for the chi-square value.
method performs well.
1.8.6 3-Dimensional Data, Quadratic Function, Polygonal Safe Zones
Another example consists of monitoring a quadratic function with more general polygonal
safe zones in three variables (Fig. 1.16). The data consists of measurements of three
pollutants (NO,NO2,SO2), and the safe zones are polyhedra with eight vertices. Since
each vertex has three degrees of freedom, the number of parameters to optimize over per
node is (8 vertices) · (3 degrees of freedom per vertex) = 24. S is the ellipsoid depicted
in pink. As the extent of the data is far larger than S, the safe zones surround the
regions in which the data is denser. The two safe zones contain 89.5 and 89 percent
of the data in the nodes. To check the constraints, the method in Section 1.5.2 was
applied, with bounding planes instead of lines. The ellipsoid was bounded by planes
defined as its tangent planes at a set of uniformly sampled points on its surface.
1.8.7 Optimization
For the triangular safe zones (Section 1.8.2) we must solve a constrained optimization
problem, with the target function evaluated as described in Section 1.5.1, and the
Minkowski sum constraints which can be checked as explained in Section 1.5.2. These
safe zones have two degrees of freedom each (Mi and βi). Hence, for n nodes, we have
2n parameters to optimize over. For the chi-square monitoring (Section 1.8.5), each
safe zone is an axis aligned box in R5 and therefore is defined by ten parameters. The
safe zones in Section 1.8.6 require 24 parameters each. In all cases we used the Matlab
routine fmincon to solve the optimization problem.
26
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 1.14: The safe zones assigned to the two nodes in Fig. 1.13. The “oscillating”node (top) is assigned a much larger safe zone, to account for its higher variability. Sincethe data was 5-dimensional, only a 3-dimensional projection is depicted, correspondingto the pollutants NO, NO2, and SO2. Pink dots denote samples from the data, safezones are in green.
Figure 1.15: Comparing the number of violations between GM and the safe zonemethod, for a period of 1,000 hours. The allowable family of safe zone shapes used hereconsisted of 5-dimensional axis-aligned boxes. Horizontal axis is the threshold for thechi-square function, vertical axis is the ratio between numbers of violations.
Figure 1.16: 3D example. S is the pink ellipsoid, the safe zones are polyhedra witheight vertices each (in pale blue), their Minkowski average is in green. The axes standfor concentrations of NO,NO2,SO2.
27
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
1.8.8 Improvement Factor and Dimensionality
In the experiments presented here, it can be seen that the improvement of the proposed
safe zone method over the GM approach in [SSK06, SSK08] increases with the dimen-
sionality of the data vectors. The following simple analysis indicates why the freedom
in assigning different safe zones to distinct nodes yields such an improvement. Assume
a very simple setup: two nodes are present, a ”small node” and a ”large node”. The
p.d.f in the small node is uniform over a solid ball of radius 1− ε, and same over the
large node with a radius of 1 + ε. The admissible region S is a ball of radius 1. Since
the Minkowski average of balls of radii 1 − ε, 1 + ε is a ball of radius 1, the method
proposed here can assign the small node a safe zone consisting of a ball of radius 1− ε,and the large node a ball with radius (1 + ε). Thus is will incur no false alarms at all
(zero communication). However, the GM method cannot assign any node with a safe
zone larger than S. Hence – since the volume of a d-dimensional ball is proportional to
the d-th power of its radius – the part of the p.d.f volume covered by the safe zone at
the large node will be at most(
11+ε
)d≈ exp(−εd). Thus, even if the p.d.f’s at the two
nodes are quite similar – e.g. ε = 0.1 – then for 20-dimensional data vectors, the safe
zone approach will incur zero communication, while with GM the large node will have
to submit an alert in about 86% of data updates.
1.9 Chapter Conclusions
In this chapter, a general method for monitoring threshold queries on functions over
distributed streams was presented. In contrast to previous solutions which involved
a cover of the entire convex hull of the local data vectors, the new approach focuses
on direct computation of safe zones for the nodes. Consequently, safe zones are more
flexible than constraints introduced in previous work, as they fit the data distributions
much better.
While the optimization problem involved is proved to be computationally challenging,
approximate solutions are proposed, and are shown to be efficient and practical. More-
over, safe zones can be selected from families of very simple shapes and still outperform
previous methods. As a result, not only does the complexity of selecting safe zones
become reasonable, but the continuous task of violation checking at each node is also
dramatically simplified over previous work. With simple safe zones, the overhead at
every node is negligible, rendering the approach feasible even for thin battery-operated
devices.
Safe zones are implemented and tested for 2, 3, and 5-dimensional real-life data
using simple families of shapes, proving that the paradigm can reduce communication
volume by orders of magnitude.
28
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Chapter 2
Discrete Safe Zones: Biclique
Approach
2.1 Chapter Summary
In this chapter we present a new approach for reducing communication in networks,
also using Safe Zones. First, we describe the setting and goals for which our method is
relevant. Second, we formulate the problem using terms from Graph Theory. Third, we
focus on the case of a system with 2 nodes and present solutions found in the literature.
Forth, we suggest a solution for the general case using a ”Hierarchical Heuristic”. Then,
we discuss the efficiency of the proposed solution, and we argue that for some very
general classes of global functions, the solution is efficient. Eventually, we discuss the
advantages of this approach over the approach presented in the previous chapter.
29
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
2.2 Preliminaries
Suppose we have a network with many sites, each observing a local data vector which
can have a different value in every time-step of the network’s lifetime, and someone
called C - a central administrator or one of the sites - is interested in computing, in every
time-step, the value of a Boolean function over all the local data vectors. Obviously, if
C had all the local data vectors of all the sites, in a certain time-step, then he could
have computed the desired function over this data with some effort - depending on the
computational complexity of that function. Generally, the functions of interest can be
computed very efficiently, once all the data is available.
However, we are not concerned with the computational and time complexity, but
with the communications that take place in the network. Although a protocol in which
in every time-step, every site sends his local data vector to C gives a solution for the
task at hand - C knowing the value of the function - the protocol’s communication cost
is very high. We measure the cost of a protocol by the number of messages sent during
its runtime, while ignoring the size of the messages and the costs of the computations
that have to take place at the sites. However, we seek protocols which require the sites
to run as efficient as possible algorithms. We propose protocols such that if all the
sites participate in these protocols, C will know the value of the function, certainly or
with high probability, while the number of messages sent in the protocol is as small as
possible. We describe a protocol which achieves the minimal number of messages sent,
in expectation, based on the available prior knowledge about the network. However, in
most of the interesting cases, in order to run this protocol the sites will have to perform
computations which are NP-Hard. Therefore, we suggest other computationally lighter
protocols and argue that their communication cost is often (for many interesting cases)
close to that of the optimal protocol. We plan to support this claim with experiments.
The methods proposed here are not limited to one architecture model of the network
and can be fit to many architecture models - We ignore this question for the meantime.
In the following, we concentrate on data vectors which are discrete, so essentially,
the local data is taken from a set U of size |U | = n. For example, each site can hold a
(log n)-bit vector - U = 0, 1logn, or an array of length u with elements from 1, ...,m,for some u,m ∈ N, n = um. If, in a network, the local data is not discrete, then we
can use any quantization method to obtain discrete data and then apply the following
discussion.
The global function which someone wants to compute can be, in the most general
case, for a network of k sites, any Boolean f : Uk → 0, 1. However, many interesting
global functions are of simpler forms. Denote the k sites of the network S1, S2, ..., Sk,
and assume that in time t, the local vector of Si is x(t)i ∈ U , or xi in short. Then,
the global function at time t - f (t)(x1, x2, ..., xk) - may be expressed, in some cases,
as f (t)(x1, x2, ..., xk) = h(g(x1, x2, ..., xk)), for some simpler functions g : Uk → V and
h : V → 0, 1. For example, previous works have concentrated on global functions
30
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
for which, in the above representation, g is simply the sum (or average) of the vectors:
g(x1, ..., xk) = x1 + ... + xk. Another interesting example is when U contains binary
vectors and g(x1, ..., xk) = x1∨ ...∨xk is the bitwise OR operator - This class of function
includes functions computed over bitmaps. We intend to discuss different solutions for
different classes of functions based on their representation with simple g, h.
We are mainly interested in global functions which correspond to monitoring a global
property of the network. Therefore, we assume that for most of the time, the value of
the global function on the present data is β (β ∈ 0, 1), and that time-steps for which
the value is β do not happen very often. Therefore, we can think of C’s task as raising
an alarm whenever the value of the global function is β, while remaining silent when
it’s β.
Furthermore, we assume that C has a predicted p.d.f. for the future local data of
each site in the network. That is, C has a prediction for Pr[x(tj)i = α] for every site
Si, time-step tj and value α ∈ U . The performance of our method is highly dependent
on the accuracy of these predictions, the more accurate these p.d.f.s are in predicting
the future local data, the less messages the protocols send. In this work we only look
into p.d.f.s which are not time-dependant, that is, for every two time-steps t1 6= t2, we
assume that Pr[x(t1)i = α] = Pr[x
(t2)i = α], for every site Si and α ∈ U . In networks
where the local data at the nodes distributes differently in different points in time, we
intend to find sophisticated solutions using our method in future work. However, for
the moment, we suggest the straightforward solution of applying many copies of our
protocols, one for each future time-step. The problem of finding good predictions for
the p.d.f.s is out of the scope of our work, and we assume that either we are given
this knowledge from an outer source, or that we specify the first few time-steps of
the network’s lifetime to learn these p.d.f.s using some machine learning tool and
after the learning is done we start running our protocols. In addition, nodes can run
learning tools on their local data while running the protocols, and once they note a
significant change in their p.d.f. they can notify C and possibly result in a decision
to restart the protocol with this new information. In addition, we assume that the
local data at the nodes distributes independently of the data in the other nodes, that
is: ∀i 6= j, α 6= β : Pr[xi = α|xj = β] = Pr[xi = α]. Handling cases in which this
assumption does not hold is also possible with a slight modification to our methods,
however, we do not go into it here.
2.3 Problem Definition
Given a set U of size n, k sites: S1, ..., Sk, each site Si observing in every time-step t an
item x(t)i from U , a global function f : Uk → 0, 1, a more probable value for the global
function’s value β ∈ 0, 1 and for every site Si and item α ∈ U the probability that
xi = α, which is equal for all time-steps: Pr[xi = α]. The goal is to define a protocol
in which C outputs f(x(t)1 , ..., x
(t)k ) correctly in every time-step t, while the expectation
31
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
of the number of messages sent in the network during the protocol is minimal. Or, in
other words, C raises an alarm whenever f(x(t)1 , ..., x
(t)k ) 6= β.
The protocols we present are of the following form: Each site Si maintains a set
SZi ⊆ U called the Safe-Zone (SZ) of Si, which a site can either compute for itself
or get from another site, e.g. C. Now, whenever a site observes a local data xi , it
sends it to C if and only if xi /∈ U − SZi. In time-steps for which C doesn’t receive any
messages, it is certain that the value of the global function is β, and it remains silent.
However, if at least one message was received, C must poll all (or some of) the sites for
their local data and then compute the value of the global function by itself. Therefore,
for our protocols to be correct, we need the following property to hold:[legality of the
SZs] if ∀i ∈ [k] : xi ∈ SZi, then f(x1, ..., xk) = β. That is, in time-steps when no
communication occurs, the value of the function is guaranteed to be β and the output
of C to be correct. In addition, we can see that the number of messages sent during
the protocol depends on the sets SZi, the larger these sets are, the less messages are
expected to be sent. Therefore, we seek such sets which will minimize the probability of
sending a message in a given time-step, that is minimize Pr[∃i ∈ [k] : x(t)i ∈ U − SZi],
which is equivalent to maximize Pr[∀i ∈ [k] : x(t)i ∈ SZi]. By that, we reduce the
problem of finding good protocols to finding ”large” and ”legal” such sets SZi, and
from now on we discuss the problem of finding the optimal legal SZs.
Our problem can be written as follows:
Problem 2.3.1. (The optimal Safe-Zones problem)
• maximize Pr[∀i ∈ [k] : xi ∈ SZi]
• subject to:
– ∀i ∈ [k] : SZi ∈ U
– if ∀i ∈ [k] : xi ∈ SZi, then f(x1, ..., xk) = β.
Finding the optimal solution for the above problem gives an optimal protocol of the
form described above. Next, we show an equivalent problem from which we can see the
complexity of finding the optimal solution for our problem, and hopefully, gain insights
for finding good sub-optimal solutions.
2.4 Biclique Formalization - k = 2
If the network had only 2 sites S1, S2, k = 2, then we can reduce our problem to finding
a maximum weight complete bi-partite sub-graph of a bi-partite graph:
Define the bi-partite graph G : (V1, V2, E), where |V1| = |V2| = |U | = n, and every
value α ∈ U has a node in each side of G: v1α ∈ V1,v2α ∈ V2. The set of edges is defined
based on the global function: exy : (v1x, v2y) ∈ E ↔ f(x, y) = β. And, we define a
32
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
function of weights for the edges: w : E → [0, 1], w(exy) = w(v1x, v2y) = Pr[x1 = x∧x2 =
y] = Pr[x1 = x] · Pr[x2 = y].
A complete bi-partite sub-graph (Biclique) of G is B : (W1,W2) for which, W1 ⊆ V1,
W2 ⊆ V2 and W1 × W2 ⊆ E. That is, every pair of vertices from W1 and W2, is
an edge of G. The weight of a Biclique, for our purpose, is defined as w(B) =∑w1∈W1,w2∈W2
w(ew1w2) . A Biclique B : (W1,W2) corresponds to a solution for our
problem in which SZ1 = W1 and SZ2 = W2 - converting vertices back to their respective
items.
Claim 2.4.1. the maximum weight Biclique of G corresponds to the optimal solution
of the SZs problem.
Proof. First, we notice that the SZs we get are legal: if x1 ∈ SZ1 and x2 ∈ SZ2, then
v1x1 ∈ W1 and v2x2 ∈ W2, and since W1 × W2 ⊆ E, we get that (v1x1 , v2x2) ∈ E and
therefore f(x1, x2) = β.
Second, we prove that these are the optimal SZs. Let us assume for contradiction that
there are better SZs Z ′1, Z′2, that is, they are legal and Pr[x1 ∈ Z ′1 ∧x2 ∈ Z ′2] > Pr[x1 ∈
SZ1 ∧ x2 ∈ SZ2]. Then, we look at the Biclique B′ = (Z1, Z2) for Z1, Z2 which contain
all the nodes corresponding to items from Z ′1, Z′2 respectively. B′ is indeed a Biclique be-
cause Z ′1, Z′2 are legal SZs. However, w(B′) > w(B): w(B′) =
∑v1x∈Z1,v2y∈Z2
w(ev1xv2y) =∑x∈Z′
1,y∈Z′2Pr[x1 = x] · Pr[x2 = y] = Pr[x1 ∈ Z ′1 ∧ x2 ∈ Z ′2] > Pr[x1 ∈ SZ1 ∧ x2 ∈
SZ2] =∑
x∈SZ1,y∈SZ2Pr[x1 = x] · Pr[x2 = y] =
∑v1x∈W1,v2y∈W2
w(ev1xv2y) = w(B) - con-
tradicting the optimality of B
Corollary 2.1. Solving the Biclique problem gives a solution for the optimal SZs
problem. In addition, from the proof we can see that the better the Biclique is, the better
the SZs are.
This leads us to the optimal protocol: find the optimal Biclique and use the
corresponding SZs. Finding the optimal Biclique can be done in exponential time by
going over every sub-graph and if it’s a legal Biclique, compute it’s weight, and then
choose the best Biclique. Therefore, for a network of 2 sites, we have an optimal protocol
which requires C to run an exponential algorithm.
However, we seek efficient protocols, so we need efficient algorithms for the Biclique
problem. But the Biclique problem is known to be NP-Hard to solve. This means that
for the general case, efficient algorithms which are provably good, probably do not exist.
Also, we had shown (in the previous chapter) that the most general Biclique problem
can be reduced to the optimal SZs problem with a complicated enough global function;
meaning that the optimal SZs problem is also NP-hard to solve, in the general case.
Next, we describe some heuristic solutions for the Biclique problem found in the
literature.
33
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
2.4.1 Greedy Heuristic
The greedy algorithm starts by choosing the vertex with biggest number of neighbors,
removing all the nodes from the other side which are not in the set of neighbors of
the chosen vertex, and proceeding in the same way - choosing the vertex with biggest
number of neighbors from the vertices which are still in the graph. The algorithm stop
when there are no more vertices left to chose. The chosen vertices compose the resulting
Biclique.
Analysis: This algorithm always returns a maximal Biclique, that is: there isn’t
a larger Biclique which contains it. In some works which experimented using this
algorithm, the results were only a 12 -factor away from the optimal solution, however,
in other works the gap was usually larger. It is easy to show that this algorithm’s
approximation factor can be as bad as Ω(n) = Ω(√|E|). The running time, naively, is
O(|E| · |V |2), however it can be easily reduced to O(|E|+ |V |2).
2.4.2 Linear Programming
The Biclique problem can be written as an integer linear program, which can be relaxed
to a linear program with 2n + n2 variables and n2 linear constraints. The fractional
solution to this linear program can be rounded to a binary solution which defines a
Biclique. It is possible to prove that the number of edges not in the Biclique in this
solution, is not more than twice the number of edges not in the optimal Biclique, which
doesn’t provide a good guaranteed approximation factor for our Biclique problem in the
general case. The running time is the time needed to solve a linear program with O(n2)
dimensions.
2.5 Generalized Biclique Formalization
If our network had only 2 sites, then the above algorithms could have given us good
protocols with communication costs close to the optimal protocol. However, this case is
very limited and we want to find protocol for networks with more sites. So, we generalize
the reduction from the SZs problem (which is defined for any number of sites k) to
another Graph-Theory problem - Maximum weighted k-clique in k-uniform k-partite
hypergraphs:
Define the k-partite hyper-graph H : (V1, V2, ..., Vk, E), where |V1| = |V2| = ... =
|Vk| = |U | = n, and every value α ∈ U has a node in every side of H: v1α ∈ V1, ..., vkα ∈ Vk.
The set of edges is defined based on the global function: e(x1...xk) : (v1x1 , ..., vkxk
) ∈ E ↔f(x1, ..., xk) = β. This hyper-graph is k-uniform because for every edge e ∈ E, |e| = k, i.e.
it contains k vertices. And, we define a function of weights for the edges: w : E → [0, 1],
w(ey1...yk) = w(v1y1 , ..., vkyk
) = Pr[x1 = y1 ∧ ... ∧ xk = yk] = Πki=1Pr[xi = yi].
A k-clique, in this case, of H is B : (W1, ...,Wk) for which, W1 ⊆ V1, ...,Wk ⊆ Vk and
W1×...×Wk ⊆ E. That is, every k-tuple of vertices fromW1, ...,Wk, is an edge ofH. The
34
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
weight of a k-clique, for our purpose, is defined as w(B) =∑
w1∈W1,...,wk∈Wkw(ew1...wk
).
A k-clique B : (W1, ...,Wk) corresponds to a solution for our problem in which SZi = Wi
for all i = 1...k - converting vertices back to their respective items.
As above, the following claim and corollary hold:
Claim 2.5.1. The maximum weight k-clique of H corresponds to the optimal solution
of the SZs problem.
Corollary 2.2. Solving the k-clique problem gives a solution for the SZs problem and
the better the k-clique is, the better the SZs are.
So, next, we look for solutions for this k-clique problem. We notice that applying
the above algorithms for k > 2 is infeasible computationally because their running time
is dependant in the number of edges, which is, in the worst case, nk. Which means
that even if |U | = 2, and the data at the nodes was a single bit, the complexity of any
algorithm that is linear in the number of edges could be exponential (in k) and therefore
impractical for large-scale networks. We also note that for our use, the hyper graph is
not given explicitly, but we can check whether a given set of vertices are an edge or not
by checking the value of the global function on that set.
2.6 Hierarchical Heuristic
A possible way to solve the k-clique problem is by dividing the k sites into a hierarchy of
the form of a full binary tree (assume k = 2l ) with the sites at the leafs. Then, in every
node of the tree we solve a Biclique problem and then use the resulting Biclique to define
the Biclique problem of the two children of that node, which are then solved in turn.
Eventually, in the lowest level of the tree, we have 2 sites, so it is a Biclique problem of
the kind we had discussed above, which we can solve and get the SZs (or the k-clique).
To explain this hierarchical solution, let us look at a network with 4 sites: S1, S2, S3, S4.
Assume the hierarchy groups S1, S2 together, and S3, S4 together. Then, instead of
solving a 4-site problem, we imagine we have only 2 sites, namely S12, S34, which take
their local data from the set U ×U = U2, and each value (y1, y2) ∈ U2 is received at the
”site” S12 with probability Pr[x12 = (y1, y2)] = Pr[x1 = y1 ∧ x2 = y2], and similarly for
S34. Still, the global function is defined as: f1234((x1, x2), (x3, x4)) = f(x1, x2, x3, x4).
Therefore, we get a well defined SZs problem with two sites, and correspondingly a
Biclique problem which we can solve (using one of the suggested algorithms) and get
two SZs SZ12, SZ34 ⊆ U2. Then, going down to the leaves in order to decompose the
SZs of the couples to SZs of individual nodes, we define the following two SZ problems
over two sites each. In the first, the sites are S1, S2 with the regular data distribution
over them; however, the ”global function” is not the same anymore. Instead, the global
function is defined to be: f12 : U2 → 0, 1, f12(x1, x2) = β ↔ (x1, x2) ∈ SZ12. Again,
we solve this SZs problem by converting it to a Biclique problem, and eventually we
35
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
get sets of vertices W1,W2. Similarly, we define the problem for S3, S4, and get two set
W3,W4. Then, the 4-clique we return is B : (W1,W2,W3,W4).
This can be applied to any k that is a power of 2, yielding an algorithm for the
k-clique problem. Next, we analyze its correctness, optimality and efficiency.
Claim 2.6.1. The resulting sets B : (W1, ...,Wk) compose a legal k-clique.
Proof. It is enough to prove for k = 4, the extension to greater k is trivial. If v1y1 ∈W1, ..., v
4y4 ∈ W4, then (y1, y2) ∈ SZ12, (y3, y4) ∈ SZ34 and therefore f(y1, ..., y4) =
f1234((y1, y2), (y3, y4)) = β
Regarding the optimality of this solution, we can show that even if we optimally
solve the Biclique problem in every phase of the algorithm, the resulting k-clique may
be sub-optimal. However, we expect this heuristic to perform well in practice, and we
intend to test this claim with experiments. Maybe, also, with some effort, approximation
factors can be proved regarding this heuristic. In addition, we note that the optimality
of the solution will depend on the choice of hierarchy, and finding the best or good
hierarchies is an interesting open question.
Eventually, we claim that this algorithm can be efficient for many classes of functions,
even though, in the general case, it’s not. We notice that even though we solve only k/2
Biclique problems in this algorithm, the number of vertices in these problems can be
very large. For k = 4, we have seen that in the first problem we solve, each value from
U2 makes a vertex; that is, the graph has 2 · n2 vertices. Generally, the first problem
will have nk/2 vertices in each side of the bi-partite graph, which yields a very inefficient
algorithm.
2.6.1 Classes of functions
1. General functions over the OR/AND
Let us look at a class of very simple functions: f : (0, 1logn)k → 0, 1, such
that f(x1, ..., xk) = g(x1 ∨ ... ∨ xk) for another g : 0, 1logn → 0, 1. That is,
general Boolean functions defined over the bitwise-OR of log n-bit vectors. In a
network scenario, this could mean that C is interested in monitoring a function
over the global vector which is the bitwise-OR of the local vectors, which can
represent many monitoring task in which bitmaps are used. For example, in the
problem of distinct count (F0) monitoring, each site has a binary vector, and C
wants to know whether the number of 1’s in the vector which is the OR of all the
vectors, is larger than some threshold τ .
In this case, the above hierarchical solution can be very efficient once you observe
that when going up in the hierarchy, we do not need to increase the number
of vertices in the graphs. Let us look at the k = 4 example again, and assume
U = 0, 1logn, and f(x1, ..., xk) = g(x1 ∨ ... ∨ xk). Then, (x1, x2) can be replaced
36
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
with x1 ∨ x2, and then apply the same algorithm. What we get now, is that in the
first problem, although we have a vertex for every pair (x1, x2), many of them are
equivalent, since there are only n possible values for x1∨x2, even though there are
n2 pairs. Therefore, what we do is give the value of x1∨x2 a probability or weight,
that is equal to the sum of probabilities of all pairs which make up this value.
Formally, in the first stage, we solve a SZ or Biclique problem with S12, S34, which
take their local data from the set U , and each value y12 ∈ U is received at the ”site”
S12 with probability Pr[x12 = y12] =∑
y1,y2:y1∨y2=y12 Pr[x1 = y1 ∧ x2 = y2], and
similarly for S34. The global function is defined as: f1234(x12, x34) = g(x12 ∨ x34).
Then, we solve this problem like we did above, and get two SZs SZ12, SZ34 ⊆ U .
Then, going down to the leaves, for example for decomposing SZ12 to the two
sites S1, S2, we define a SZ problem like before, but the function now becomes:
f12 : U2 → 0, 1, f12(x1, x2) = β ↔ (x1 ∨ x2) ∈ SZ12. And we proceed the same
way, achieving the desired solution.
Therefore, the time complexity in this case is polynomial in both n, k.
Note that the case where f(x1, ..., xk) = g(x1 ∧ ...∧ xk), where x∧ y is the bitwise
AND of the binary arrays x, y can be solved efficiently similarly.
2. General functions over the sum of integers
Another very simple example, which can help explain the ideas presented, is
when each site holds a number from 1 to n, U = 1, ..., n and C is interested
in computing a function of the sum of these numbers, that is: f(x1, ..., xk) =
g(x1 + ... + xk). Again, in this case, the number of vertices in the higher level
of the hierarchy do not become exponential in k, but grow with a factor of 2
in every level, since we only need to keep track of (x1 + x2) ∈ 1, ..., 2n rather
than (x1, x2) ∈ 1, ..., n2. Therefore, all the Biclique problems we solve will be over
bi-partite graphs with up to k · n nodes.
3. General functions over the union of streams
In the distributed streaming model, each site observes a stream of items from a
set 1, ..., u, and the global stream is defined as the union of all these streams.
Usually, C is interested in monitoring a property of a sliding window of this stream,
and we assume that in any window, the number of received items can be at most
m. An equivalent representation of this problem, is saying that every site Si has
a vector xi of length u with values from 0, ...,m, where xi(j) equals the number
of times the item j was received at site Si in the last window. Then, the vector
C wants to compute a function over, is the sum of these u-dimensional vectors:
x1 + ... + xk. Therefore, in our notations U = 0, ...,mu, and, in addition, we
know that the global vector is also in U . So, again, using the hierarchical solution,
we can use |U | = (m+ 1)u = n nodes in all the problems of the hierarchy, and get
a solution polynomial with n, k.
37
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
4. General Boolean functions
For the most general case of functions over Boolean log n-bit vectors, we do not
have an efficient solution yet, however we will present two ideas.
First, we can follow the same path we did before and try to represent our function
as f(x1, ..., xk) = g(h1(xi1 ∧ ... ∧ xis), ..., ht(xj1 ∨ ... ∨ xjs)), and then use the
hierarchical solution, when in every level of the hierarchy, in the worst case, we
will have to keep items from U t rather than Uk, that is O(nt) instead of O(nk).
While if the representation obeys certain conditions, then with certain hierarchies,
we might need only items from U in all the levels.
Second, we can try to solve the k-clique directly, without dividing it hierarchically.
However, following the previous discussion, we need algorithms that are sub-linear
in the number of edges, which could be nk. Our idea is to use ”property testing”-
like techniques, so that we can work on a small sample from the hyper-graph and
get a solution which is, with high probability, a legal k-clique, and its size is a
good approximation for the size of the optimal k-clique. One possible direction is
generalizing the heuristic greedy algorithm presented earlier to the k-clique case,
while, for finding the node with biggest number of edges connected to it, we sample
t = O(poly(n, k)) edges, and decide based on this sample. This can probably
give us an approximate solution which isn’t much worse than the non-randomized
greedy algorithm, however, it doesn’t guarantee a legal k-clique, because in order
to verify deterministically that a given set of sets nodes is a k-clique we might
need to check Ω(nk) tuples. This idea can be used if we allow the protocols to err
with some probability.
2.6.2 Pruning nodes in the Biclique problem
In many case, n can be very large as well, and O(n3) solutions will not be good enough
for us. Thus, we propose ”pruning” techniques, in order to reduce the number of nodes
in the Biclique problems we solve.
First, we suggest removing from any Biclique problem we intend to solve, nodes
with probability that is smaller than some ε ≥ 0. The removed nodes can either be
simply ignored, or used in more sophisticated ways.
Second, if we have a distance measure defined over U and Uk, and we know that
our global function f : Uk → 0, 1 obeys a Lipchitz condition over this metric, then we
can cluster nodes whose corresponding items are close enough, while connecting this
”big” node to another node iff all the nodes composing the ”big” node were connected
to that node. The weight of such edge will be the sum of all the old edges. Of course,
we may lose edges in this process, but the amount of edges lost can be traded with the
size of the clusters and consequently the number of nodes reduced.
Third, we can generalize the clustering idea to cases where we do not have a metric
or a Lipchitz condition, by clustering together nodes which will not cause many edges
38
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
to be lost. So, assuming we have a Biclique problem with |V1| = |V2| = n, and we know
the set of edges E, then we can do a pre-processing step in O(|V |3) and get a graph
with less nodes, where every group of nodes which could be clustered together, without
losing more than ε edges (for a given ε ≥ 0), becomes a new ”big” node.
2.7 Advantages over the geometric Safe Zones
In the previous chapter, we introduced a different approach for distributing Safe Zones
to the nodes, which also aims at finding monitoring protocols that are communication
efficient. There, the Safe Zone of each site was some convex polygon in Rd, and we
demanded that the ”Minkowski sum” (or average) of all the Safe Zones of all the sites,
be contained in the global admissible region which we called S. In this chapter, however,
we look at Safe Zones which are sets of discrete points, and the legality constraint which
we demand is a ”Biclique condition” instead of a Minkowski Sum condition. Here we
will discuss the advantages of this approach over the geometric Safe Zones (previous
chapter).
1. Support more target functions: In the geometric Safe Zones approach, the
algorithms suggested only work for a global function f : Uk → 0, 1 that depends
only on the sum or average of the local vectors: f(x1, ..., xk) = g(∑k
i=1 xi) for
some g : U → 0, 1. Otherwise, the Minkowski sum wouldn’t make sense, and
there is no immediate way to compute or even write down geometrically the
constraints for the optimization problem that we solve to get the Safe Zones. Here,
however, this approach, in its general form, can be used to monitor any global
function - and therefore is more generic. Moreover, the class of functions for which
we propose practical scalable algorithms for computing good Safe Zones using this
method, is a superset of the class of functions that depend on the sum or average
of the local vectors.
2. Non-Convex Safe Zones: In the previous approach we were forced to restrict
the Safe Zones to convex shapes, because otherwise the computational complexity
of checking the constraints - and therefore of the whole optimization problem -
would grow exponentially in the number of sites k. In the Biclique approach, the
(non-)convexity of the Safe Zones doesn’t matter, and therefore, in cases where
the data at the nodes may not be easily covered by legal convex Safe Zones, the
Biclique approach may give Safe Zones that contain more of the p.d.f.s of the
nodes, and therefore result in a more efficient monitoring protocol.
To illustrate this, we planned a synthetic setup for which any protocol with convex
Safe Zones would result in a high probability for violations, and then we computed
Safe Zones using the Biclique approach. The resulting Safe Zones are plotted
in Fig 2.1: we can see that the non-convexity of the Safe Zones allows them to
contain a large number of the data samples.
39
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 2.1: Non Convex SZs example: Plotted are the data samples of bothnodes (right and left clouds), and the ”legal” global data points (center) which are theaverages of every pair of points, one from each node, such that the average resides inthe admissible region S. This S corresponds to a function s.t. f(x, y) = β ⇐⇒ c1 ≤(x− a1)2 + (y − a2)2 ≤ c2. The data points colored in green, at the nodes, are pointsthat were in the resulting SZ. The SZs are non-convex sets. The green points in thecenter are all the averages of a pair of points, one from each SZ.
3. No use of optimization toolboxes: To solve the optimization problem of find-
ing the optimal SZs, the two approaches use different algorithms. In the geometric
approach, an optimization toolbox is used, which is a black-box algorithm that
gets as input the parameters of the problem we wish to solve, and returns a
solution. The computations inside that black-box involve complicated Gradient
Descent algorithms, making it very hard to understand what is going on inside,
for someone using the method. Here, however, the problem is converted into a
Biclique problem, and is solved by an algorithm for the Biclique problem. For
example, one can choose to use the greedy and hierarchical heuristics, which
involve very simple and direct computations.
40
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Chapter 3
Violation Resolution in
Distributed Stream Networks
3.1 Chapter Summary
Distributed stream networks continuously track the global score of the data and alert
whenever a given threshold is crossed. The global score is computed by applying a
scoring function over the aggregated streams. However, the sheer volume and dynamic
nature of the streams impose excessive communication overhead.
Most recent approaches eliminate the need for continuous communication, by using
local constraints assigned at the individual streams. These constraints guarantee that
as long as no constraint is violated, the threshold is not crossed, and therefore no com-
munication is necessary. Regrettably, local constraint violations become more and more
frequent as the network grows and, in the presence of such violations, communication is
inevitable.
In this chapter, we show that in most cases the violations can be resolved efficiently.
Although our solution requires only a reduced subset of the network streams, finding the
minimum resolving set is NP-hard. Through analysis of the probability for resolution, we
suggest methods to select the resolving set so as to minimize the expected communication
overhead and the expected latency of the process. Experimental results with both
synthetic and real-life data sets demonstrate that our methods yield considerable
improvements over existing approaches.
41
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
3.2 Introduction
Distributed stream networks have become very common in many fields of technology such
as sensor networks [MF02], analysis of financial time series [YSJ+00], Web applications
[KL10], and more. In these networks, numerous distributed nodes handle highly
dynamic, continuous data streams, and their goal is to detect some global property
over the distributed data. A fundamental application in distributed stream networks is
Threshold Monitoring, the goal of which is to constantly alert whenever the value of a
predetermined function, evaluated over the network-wide data, crosses a given threshold.
A trivial approach for monitoring is to continuously or periodically centralize all the
data, thus transforming a distributed problem into a centralized one. This, however,
places intolerable burden on the network.
Considerable research efforts were made to reduce network communication overhead
in continuous distributed monitoring, as presented in a recent survey by Cormode
[Cor11]. Reviewed solutions include data sketching [CG05] and data sampling [CMYZ10]
algorithms, in which the minimum required amount of data is centralized in order to
approximately detect threshold crossing events. Another approach [BO03b, RN04,
SR08, GRM10, TZWL08, SSK07b] is to assign local constraints at the individual nodes,
such that as long as all the constraints are valid, it is guaranteed that the threshold has
not been crossed. The latter approach enables exact detection of threshold crossings,
while minimizing communication overhead.
The main challenge in the local constraints approach is to efficiently define the
constraints so as to minimize the number of violations over time. However, local
constraint violations are bound to happen from time to time due to local behaviour (e.g.,
reading error, energy deficiency or local interrupts). A local violation can sometimes
indicate that a global violation, i.e., threshold crossing, has occurred, but in most cases
it suggests nothing more than a local phenomenon. The process which determines the
network status in the presence of local violations is referred to as violation resolution.
As the size of the network increases, so does the probability that local violations
will occur, and thus the frequent need to resolve them efficiently. The resolution process
is required to reduce the overall communication cost while meeting a certain latency
expectation.
A few violation resolution algorithms have been presented, mainly for star-shaped
networks (that consist of a central coordinator). In [BO03b, RN04, SR08, GRM10,
TZWL08], the network status is determined by the data and constraints held by the
coordinator and the violating nodes. If additional data are required, then the data for
the entire network are collected. As the number of local violations increases, this process
imposes onerous communication cost. In order to reduce network overhead, the coordi-
nator in [CMY11, KCR06] waits for several violation reports before collecting the entire
network data. This process reduces the overall cost but still requires communicating
with all the nodes.
42
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Recently, in [SSK07b, SSK07a], an incremental violation resolution method was
presented, which used randomly chosen subsets of non-violating nodes. However, this
method is tailored to their algorithms and no bounds were provided for the expected
size of these sets.
In this work we address the problem of violation resolution for local constraint
monitoring. We present a general approach that attempts to resolve local violations
by polling data from a subset of non-violating nodes, referred to as the resolving set.
Our goal is to reduce communication cost by detecting a minimum-size resolving set,
while maintaining a fair latency. To the best of our knowledge, this is the first time
the problem of a minimum-size resolving set is studied. We prove that this problem is
NP-hard and suggest heuristic approaches for solving it. Assuming homogeneous data
setups, we propose a random method (similar to [SSK07a]) and present some theoretical
bounds using Hoeffding [Hoe63] and Bernstein [Tro10] inequalities. Acknowledging the
challenge of heterogeneous data setups, we propose an efficient method using algorithms
from graph theory. Our methods were extensively tested over both synthetic and
real-life data sets, achieving a substantial reduction in communication cost and latency,
in comparison to current algorithms.
This chapter is organized into eight sections. In Section 3.3 we discuss related work,
and notations and terminology are presented in Section 3.4. In Section 3.5 we present
an overview of our generic approach. In sections 3.6 and 3.7 we present our Random
Logarithmic algorithm (RLG) and Maximum Matching Tree algorithm (MMT) for
homogeneous and heterogeneous setups, respectively. Finally, in Sections 3.8 and 3.9
we present experimental results and conclusions.
3.3 Related Work
Resolution of local constraint violations is commonly addressed as a subproblem in
threshold monitoring algorithms. Threshold monitoring algorithms over star-shaped
networks usually proposed [BO03b, RN04, SR08, GRM10, TZWL08] a two-phase reso-
lution process. First, the coordinator attempts to complete resolution without involving
any nodes other than itself and the violating nodes. If it fails, it tries to resolve the
violations by communicating, in a single round, with the entire network. A similar
approach was suggested for value monitoring algorithms [CMY11, KCR06]. Commu-
nication was somewhat reduced as the coordinator would wait for several violation
reports before communicating with the entire network. However, in sufficiently large
networks, any algorithm that requires communicating with the entire network would
incur high communication cost. Furthermore, this approach exposes a trade-off between
the communication cost and the latency to alert about a global violation.
Recently, [SSK07b, SSK07a] suggested gradually increasing the size of the resolving
set in a number of rounds. At each round the resolving set was increased by a single
non-violating node [SSK07b] or by an exponentially increasing number of non-violating
43
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
nodes [SSK07a]. While these methods are closest our work, the nodes in both were
selected at random, assuming a homogeneous data setup. In addition, no bounds were
presented for the expected size of the resolving set and for the expected latency of the
resolution process.
Threshold monitoring problems have also been researched for tree-shaped [JDZ+07]
and peer-to-peer [WBK09] networks. In these algorithms, nodes communicate according
to a predefined overlay communication tree. Their resolution processes consist of
multiple rounds, where at each round the size of the resolving set is increased by the set
of adjacent (or ancestor) nodes of the current resolving set. While the communication
tree efficiently reduces the size of the resolving set, neither algorithm suggests a way
to define this tree. In our work we present a construction of an overlay tree structure,
tailored for heterogeneous data setups.
Finally, several threshold monitoring algorithms [OJW03, RN04, HNG+06] assumed
violation resolution processes, but they were not presented by the authors.
3.4 Violation Resolution and Minimum Resolving Set
3.4.1 Problem Definition
We consider a distributed online environment consisting of n remote monitoring nodes
N1, N2, ..., Nn and a central coordinator node NC . Nodes communicate through the
coordinator node, and direct communication between monitoring nodes is not allowed.
We assume that the nodes are synchronized and each monitoring node observes an indi-
vidual stream of multidimensional data over discrete time. Let vti be the d-dimensional
real vector collected by node Ni at time t. We denote this vector as the local vector of
node Ni. Each node is assigned a weight ωi ∈ R, and we define the weighted average of
all the local vectors at time t as the global vector vt, i.e., vt = (∑n
i=1 ωivti)/(
∑ni=1 ωi).
Note that the weighted average operator can be replaced by any other commutative
and associative operator (e.g., multiplication). For simplicity, and w.l.o.g., in the rest of
the chapter we assume that ωi = 1 for every node Ni, such that vt = 1n
∑ni=1 v
ti .
Given an arbitrary monitoring function f : Rd → R and a threshold value τ ∈ R,
the coordinator node needs to constantly alert whenever f(vt) exceeds (or drops below)
τ . This threshold monitoring query can be reduced to a simple domain constraint over
the global vector. Let S be the entire set of vectors over which the monitoring function
doesn’t cross the threshold, i.e., S =v ∈ Rd
∣∣f(v) ≤ τ
. The coordinator node needs to
alert whenever vt /∈ S. We denote S, the domain over which this constraint is satisfied,
as the global safe zone.
Assume that every monitoring node Ni is associated with a subset of the data space
Si ⊆ Rd, denoted as the local safe zone of node Ni, such that the following condition
44
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
holds: (n∧i=1
vi ∈ Si
)→ 1
n
n∑i=1
vi ∈ S. (3.1)
It follows that an overall satisfaction of the local domain constraints (imposed by
the local safe zones) over the local vectors would imply that the domain constraint
over the global vector is also satisfied. In other words, as long as all the local vectors
reside within their respective local safe zones (vti ∈ Si, i = 1 . . . n), the global vector
is guaranteed to reside within the global safe zone (vt ∈ S). This case requires no
knowledge of the local vectors of the monitoring nodes, and thus eliminates any need
for communication.
At any time t a node’s local vector can deviate from its local safe zone. We refer to
this event as a local violation. When local violations occur, the violating nodes report
their local vectors to the coordinator, which then determines the network status to
see if a global violation has occurred. This decision process is referred to as violation
resolution.
An example of a monitoring system in a 2-dimensional space is presented in Figure
3.1. It depicts a snapshot of the system in two consecutive time steps. At the first, all the
local constraints are satisfied and the coordinator can determine, without communication,
that the global constraint is also satisfied. At the second, a local violation has occurred
(at N1) and resolution is required.
The goal of this work is to reduce the communication required for resolution and
altogether maintain fair latency of this process. The latency is the time required to
determine the network status, particularly in the case of a global violation. While
determining that a global violation has occurred usually requires knowledge of all the
local vectors, this case is rare in most monitoring applications (e.g., natural hazards
detection). Commonly, if there is no global violation, then the local violations can
be resolved by collecting the local vectors of a small set of nodes, referred to as the
resolving set.
3.4.2 Resolving Local Violations
We next show how violation resolution is achieved by the resolving set. To this end we
generalize the notions of local vector and local safe zone to a set of nodes. Given a set
of nodes, A, let vtA denote the vector of A at time t and let SA denote the safe zone
of A. vtA is defined as the average of the local vectors of the nodes in A at time t, i.e.,
vtA = 1|A|∑
Ni∈A vti . Similarly, SA = 1
|A|∑
Ni∈A vi |∧Ni∈A vi ∈ Si. These definitions
are consistent with the operation that defines the global vector (as presented in Section
3.4.1).
Lemma 3.4.1. The above definitions satisfy:
1. SN1,...,Nn ⊆ S.
45
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 3.1: A monitoring system, consisting of 3 monitoring nodes and a coordinator,in a 2-dimensional space. The safe zones are given as rectangles in the plane and thevectors are marked by dots. It’s easy to see that the average of every 3 vectors, takenrespectively from the local safe zones, resides within the global safe zone (S). At t = 0,all the local vectors reside within their respective safe zones and, consequently, theglobal vector is also inside the global safe zone. In this case, none of the monitoringnodes reports its local vector to the coordinator. At t = 1, a local violation has occurredat N1 (v11 /∈ S1). N1 would now report its local vector to NC to seek resolution. NC
must poll N2 or N3 (or both) for their local vectors in order to verify that v1 ∈ S.
46
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
2. For every time t and mutually disjoint subsets of nodes A1, . . . , Am:(m∧i=1
vtAi∈ SAi
)→ vtA ∈ SA
where A =⋃mi=1Ai.
Proof.
1. SN1,...,Nn =
1n
∑ni=1 vi
∣∣∧ni=1 vi ∈ Si
by definition. SN1,...,Nn ⊆ S follows
directly from Equation 3.1
2. For every time t and mutually disjoint subsets of nodes A1, . . . , Am:
m∧i=1
vtAi∈ SAi ↔
m∧i=1
1
|Ai|∑Nj∈Ai
vtj
∈ 1
|Ai|∑Nj∈Ai
vj
∣∣∣∣∣∣∧
Nj∈Ai
vj ∈ Sj
→m∧i=1
∑Nj∈Ai
vtj
∈ ∑Nj∈Ai
vj
∣∣∣∣∣∣∧
Nj∈Ai
vj ∈ Sj
→ m∑i=1
∑Nj∈Ai
vtj
∈
m∑i=1
∑Nj∈Ai
vj
∣∣∣∣∣∣∧
Nj∈Ai
vj ∈ Sj
→∑Nj∈A
vtj
∈∑Nj∈A
vj
∣∣∣∣∣∣∧Nj∈A
vj ∈ Sj
→ 1
|A|∑Nj∈A
vtj
∈ 1
|A|∑Nj∈A
vj
∣∣∣∣∣∣∧Nj∈A
vj ∈ Sj
↔vA ∈ SA
where A =⋃mi=1Ai.
It follows that when local violations occur, the coordinator node can rule out global
violation by acquiring only the local vectors of the resolving set. This is justified by the
following theorem.
Theorem 3.1. Let V be the entire set of violating nodes at time t. If there exists a set
of nodes A such that V ⊆ A and vtA ∈ SA, then vt ∈ S.
Proof. Let N = N1 . . . Nn. Since V ⊆ A, N \A consists of merely non-violating nodes,
i.e., vtNi ∈ SNi for every Ni ∈ N \ A. Since we have also vtA ∈ SA, we conclude by
Lemma 3.4.1 that vt = vtN ∈ SN ⊆ S.
Following this theorem, we denote R = A \ V as the resolving set.
47
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
(a) 4-node homogeneous system monitoring concentrations of air pollutants
(b) 4-node heterogeneous system monitoring occurences of terms in news reports
Figure 3.2: Local violations in homogeneous and heterogeneous systems. The safe zonesare given as rectangles in the plane and the local vectors are marked by the enlargeddots. The safe zones were intentionally fit to the distributions of past measurements,represented by the clouds of dots, to minimize local violations.(a) The monitoring nodes are homogeneous in the variance of their distributions and inthe dimensions of their safe zones. There is no clear preference for one node over theother in resolving local violations. At t = 1, the local violation of N3 can be resolved byany non-empty subset of the other nodes, and the minimum resolving set comprises asingle node.(b) The monitoring nodes are heterogeneous in the variance of their distributions and inthe dimensions of their safe zones. Some nodes are more likely to resolve certain localviolations than others. At t = 1, the local violation of N2 can only be resolved with N4,and the minimum resolving set comprises N4 alone.
3.4.3 Running Examples
Next we present two running examples which illustrate the concepts and problems
addressed in this work.
Example 1. (homogeneous data) Assume that air quality sensors are deployed
at various geographic locations, measuring the concentration of pollutants the air. Each
sensor maintains a vector of its readings, such as the concentrations of NO, and NO2.
We evaluate a function over the vectors of measurements in order to determine the
overall air quality, and we wish to detect whenever the air quality drops below a certain
threshold. Figure 3.2a depicts a system consisting of four sensors. The safe zones aim
to minimize local violations and are therefore centered around the expected value of the
readings in an attempt to cover as much of the distribution area as possible. Normally,
the readings of the different nodes, for each pollutant, have the same variance and thus,
the safe zones of the different nodes are homogeneous in their dimensions. This implies
that there is no clear preference for one node over the other in resolving local violations.
Example 2. (heterogeneous data) Assume an Internet news agency, which
constantly monitors news reports. The nodes are each assigned to a specific category of
48
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
news (e.g., economy, sports). A wide collection of news-related terms is assembled, and
each node tracks the occurrences of the terms in the reports it monitors over a sliding
window of one hour. Figure 3.2b depicts a system consisting of four nodes, of which two
are assigned to monitor economic news (N1, N3), while the other two monitor sports
news (N2, N4). The nodes track the occurrences of the terms “team” and “asset.” The
history of occurrence counts in each node determines its distribution area, and its safe
zone is assigned as an interval for each of the terms. As in the air quality example, the
safe zones are centered around the expected value and attempt to cover as much of the
distribution area as possible. However, in this case, the occurrence counts of each term
have a substantially different variance for different nodes; thus, the safe zones of the
different nodes are heterogeneous in their dimensions. For example, the term “team”
has a wider interval in the sports nodes, while the term “asset” has a wider interval in
the economy nodes. This implies, for example, that a sports node has greater flexibility
to resolve constraint violations of the term “team.”
Table 3.1: Frequently Used Notations
Notation Description
n Number of monitoring nodes
Ni Monitoring node i (i = 1 . . . n)
NC Coordinator node
vti Vector of node i at time t
vtA Vector of a set of nodes A at time t
vt Global vector at time t
Si Safe zone of node i
SA Safe zone of a set of nodes A
S Global safe zone
V Set of violating nodes
R Resolving set
3.5 Generic Algorithm
In this section we present our generic algorithm for violation resolution. In the presence
of local violations, the algorithm is executed at the coordinator to determine the network
status. The algorithm outputs whether or not the threshold has been crossed. We assume
that the coordinator is familiar with the safe zones assigned for the monitoring nodes.
Pseudo-code is given in Algorithm 3.1. At time step t, all violating nodes (V) report
their local vectors to the coordinator. Following Theorem 3.1, the coordinator attempts
to resolve the violations by detecting a resolving set (R) such that vtV∪R ∈ SV∪R. The
resolving set is initially empty and gradually extended in every round with the set of
nodes returned by the function getExtendingSet. The only requirement for this function
49
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
is to return a non-empty set of non-violating nodes that are not already included in
the resolving set. The algorithm terminates when either the violations are resolved
(i.e., the threshold was not crossed) or the resolving set comprises the entire set of
non-violating nodes. In the latter case, the coordinator directly verifies the network
status by evaluating f(vt). Note that the number of rounds it takes to assemble the
resolving set defines the latency of the algorithm, whereas the size of the set defines the
communication cost.
Algorithm 3.1 Generic Violation Resolution Algorithm
1: for all Ni in V do2: Ni sends vti to NC
3: end for4: NC computes vtV , SV5: resolved← vtV ∈ SV a boolean flag6: r ← 1, R ← ∅7: while not resolved and |V ∪ R| < n do8: Rr ← getExtendingSet(V,R, r)9: for all Ni in Rr do
10: NC polls Ni for vti11: end for12: R ← R∪Rr13: NC computes vtV∪R, SV∪R14: resolved← (vtV∪R ∈ SV∪R)15: r ← r + 116: end while17: if not resolved then |V ∪ R| = n18: resolved← (f(vt) ≤ τ)19: end if20: return resolved
Theorem 3.2. The generic algorithm always terminates and correctly determines the
network status.
Proof. The while loop (lines 7-16) terminates when either resolved = true or |V∪R| = n.
The properties of getExtendingSet guarantee that the loop will terminate eventually.
resolved = true indicates that the violations have been resolved by the resolving set,
namely vtV∪R ∈ SV∪R (line 5 or 14), and thus, by Theorem 3.1, a global violation did
not occur. Otherwise, |V ∪ R| = n, which suggests that the coordinator holds the local
vectors of all the nodes and is therefore able to compute f(vt) directly (line 18). In any
case, by the end of the algorithm, resolved = false if and only if a global violation, i.e.,
threshold crossing, has occurred. We conclude that the algorithm always terminates
and correctly determines the network status.
Throughout the run of the algorithm, the probability for violation resolution, namely
PrvtV∪R ∈ SV∪R
, is monotonically non-decreasing. This is implied by the following
theorem:
50
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Theorem 3.3. Let V,V be the set of violating nodes at time t and its complement
(V = N1, . . . , Nn \ V). Then for every two subsets of nodes R1 ⊆ R2 ⊆ V:
PrvtV∪R2
∈ SV∪R2
≥ Pr
vtV∪R1
∈ SV∪R1
.
Proof. Assume that vtV∪R1∈ SV∪R1 holds. As V is a set consisting of merely
non-violating nodes, i.e., vti ∈ Si for every Ni ∈ V, it then follows from Lemma
3.4.1 that vtV∪R2∈ SV∪R2 . Namely, vtV∪R1
∈ SV∪R1 → vtV∪R2∈ SV∪R2 and hence,
PrvtV∪R2
∈ SV∪R2
≥ Pr
vtV∪R1
∈ SV∪R1
.
The performance of the algorithm is dictated exclusively by the function getEx-
tendingSet (line 8), which determines the quality and scale of the extension Rr to the
resolving set. For example, an instance of the generic algorithm, denoted the Naive
Algorithm, implements this function to always return the set of all non-violating nodes.
Consequently, this algorithm achieves the minimum latency (1 round) yet also the
maximum communication cost (maximum size resolving set). Another instance, denoted
Random Linear Algorithm (RLN), extends the resolving set by a single randomly chosen
node at each round. While this algorithm attempts to minimize the size of the resolving
set, it may incur a rather high number of rounds. These examples expose the trade-off
between the expected latency and the expected communication cost of the generic
algorithm.
An optimal instance of the generic algorithm is foremost required to reduce com-
munication cost, but at the same time it must maintain a reasonable latency. The
latency determines how long the nodes should keep their local vectors and moreover, it
determines how long it takes to detect a global violation. This is essential for real-time
monitoring applications such as natural hazard detection.
3.5.1 Minimum Resolving Set is NP-Hard
Clearly, the optimal communication cost is attained by detecting a minimum size
resolving set. However, we argue that even a relaxed version of this problem, in which
all the local vectors are known to the coordinator, is NP-hard. Denote this version the
minimum resolving set problem (MRS).
Theorem 3.4. Let V be the set of all the violating nodes at time t. Given the set of
all the local vectors, namely vt1, . . . , vtn, the problem of finding a resolving set R of a
minimum size, such that vtV∪R ∈ SV∪R, is NP-hard.
Proof. We show a reduction to MRS from the maximum clique problem (MC), which is
well-known to be NP-hard. Given a graph G = (V,E), the maximum clique problem is
to find a maximum complete subgraph of G, i.e., a set of vertices V ′ ⊆ V of maximal
size, that are pairwise adjacent: ∀ui, uj ∈ V ′ : ui, uj ∈ E. Given an instance of MC
consisting of G = (V,E), V = u1, . . . , u|V |, we construct an instance of MRS consisting
51
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
of |V | + 1 monitoring nodes where V = N|V |+1, i.e., N|V |+1 is the single violating
node. Let P1, . . . , Pm be the set of all non-adjacent pairs of vertices in G. Namely,
ui, uj /∈ E if and only if Pk = ui, uj for some 1 ≤ k ≤ m. We specify the local
vectors and the safe zones in the MRS instance as follows:
• for i = 1, . . . , |V |, vti = (vti [1], . . . , vti [m]) ∈ Rm where:
vti [j] =
1 ui ∈ Pj0 otherwise
.
• for i = 1, . . . , |V |:Si = v ∈ Rm | v[j] ≥ 0, ∀j = 1, . . . ,m.
• vt|V |+1 = vtV = 0m = (0, . . . , 0).
• S|V |+1 = SV = v ∈ Rm | v[j] > 0,∀j = 1, . . . ,m.
It is evident that the construction above is polynomial and yields a legal instance of
MRS, in which vti ∈ Si for i = 1, . . . , |V | and vtV /∈ SV . We now show that a solution to
the MRS instance, i.e., a minimum resolving set R ⊆ N1, . . . , N|V |, defines a solution
of the MC instance, i.e., a maximum clique V ′ ⊆ V , by the following two observations:
1. A clique V ′ of size k in the MC instance defines a resolving set R of size |V | − kin the MRS instance. We prove that: R = Ni|ui /∈ V ′ is a resolving set. Assume
to the contrary that vtV∪R /∈ SV∪R. Since SV∪R consists of all strictly positive
vectors, there exists 1 ≤ j ≤ m such that vtV∪R[j] ≤ 0. Hence, for every Ni ∈ R,
vti [j] = 0. This suggests that Pj ∩ R = ∅ and therefore, Pj ⊆ V ′. We conclude
that V ′ contains a pair of non-adjacent nodes, a contradiction.
2. A resolving set R of size k in the MRS instance defines a clique V ′ of size |V | − kin the MC instance. We prove that V ′ = ui | Ni /∈ R is a clique. Assume
to the contrary that V ′ is not a clique. Then it contains a pair of non-adjacent
nodes. Namely, there exist 1 ≤ j ≤ m such that Pj ⊆ V ′. Hence, Pj ∩ R = ∅and for every Ni ∈ R : vti [j] = 0. We conclude that vtV∪R[j] = 0 and therefore,
vtV∪R /∈ SV∪R, a contradiction.
Thus, a minimum resolving set R in the MRS instance defines a maximum clique V ′ in
the MC instance.
3.5.2 Probabilistic Analysis of the Algorithm
We next present a few probabilistic bounds which we use to evaluate the expected
size of the resolving set. We derive a lower bound on the probability that violation
resolution is achieved by the resolving set, namely PrvtV∪R ∈ SV∪R, and show that it
is exponentially increasing in the size of the set. Our method of doing so is to define
a region inside SV∪R which contains the expected value of vtV∪R, and then bound the
52
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
probability that vtV∪R belongs to this region. We derive two lower bounds for the
cases where the bounded region inside the safe zone is a box or a sphere, by employing
Hoeffding’s and Bernstein’s inequalities, respectively.
Hoeffding’s Lower Bound – Univariate Data
For simplicity, we first consider the case where the data of each node are one-dimensional
(vti ∈ R). Assume that the safe zone of node Ni is given by an interval on the real line
[ai, bi] and denote its length by ∆i. Further, assume that the data of each node are
bounded within an interval whose length is αi∆i. It follows that for a set of nodes A,
SA is defined as the interval [ 1|A|∑
Ni∈A ai,1|A|∑
Ni∈A bi] whose length is 1|A|∑
Ni∈A ∆i.
Denote by δ−,δ+ the distances from the expected value of vtV∪R to the left and right
end points of SV∪R, respectively. Let A = V ∪R, then:
PrvtA ∈ SA
≥
PrE(vtA)− vtA ≤ δ− ∧ vtA − E(vtA) ≤ δ+
=
1− PrE(vtA)− vtA > δ− ∨ vtA − E(vtA) > δ+
=
1− PrE(vtA)− vtA > δ−
− Pr
vtA − E(vtA) > δ+
≥
1− Pr−vtA − E(−vtA) ≥ δ−
− Pr
vtA − E(vtA) ≥ δ+
.
Hoeffding provided an upper bound on the probability for the mean of random variables
to deviate from its expected value:
Theorem 3.5 (Hoeffding 1963, Theorem 2 [Hoe63]). Let X1, . . . , Xn be independent
random variables such that ai ≤ Xi ≤ bi (i = 1, . . . , n). Then for t > 0:
PrX − E(X) ≥ t
≤ exp
(− 2n2t2∑n
i=1(bi − ai)2
)where X = 1
n
∑ni=1Xi.
We employ Hoeffding’s inequality to derive the following corollary:
Corollary 3.6. Given that the local vectors of the nodes are univariate and independent,
a lower bound on the probability for violation resolution is given by:
PrvtV∪R ∈ SV∪R
≥ 1− φ(V,R, δ−)− φ(V,R, δ+)
where
φ(V,R, δ) = exp
− 2(|V|+ |R|)2δ2∑Ni∈V
(αi∆i)2 +∑
Ni∈R∆2i
.
53
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
This bound exponentially approaches 1 as the size of R increases, regardless of the data
distributions of the nodes. Figure 3.3 depicts these probability bounds for identically
distributed nodes.
2222 4444 6666 8888 10101010 12121212 14141414 16161616 18181818 20202020
0.650.650.650.65
0.70.70.70.7
0.750.750.750.75
0.80.80.80.8
0.850.850.850.85
0.90.90.90.9
0.950.950.950.95
1111
|R|
Prv
t V∪R
∈S
V∪R
Actual (, = 2)Bound (, = 2)Actual (, = 2.5)Bound (, = 2.5)Actual (, = 3)Bound (, = 3)
Figure 3.3: Hoeffding bounds over univariate synthetic data. In this setup, the safezone of each node Ni is an interval of length ∆ centered around the expected value ofvti , and vti is bounded within an interval of length α∆. Hence, SV∪R is also an intervalof length ∆, centered around the expected value of vtV∪R. In addition, V consists of asingle violating node. Therefore, the lower bound for Pr
vtV∪R ∈ SV∪R
is given by:
1− 2φ(V,R,∆/2) = 1− 2 exp(− (1+|R|)2
2(α2+|R|)
). α denotes how far a node can deviate from
its safe zone, in terms of the safe zone’s width.
Hoeffding’s Lower Bound – Multivariate Data
For the multidimensional case, let [ai, bi] ⊆ Rd be the bounding box of Si and let
∆i = bi − ai. In other words, the projection of Si on the jth dimension (j = 1, . . . , d)
is the interval [ai[j], bi[j]] of length ∆i[j]. Assume that the data of each node Ni are
bounded within a d-dimensional box: [ci, di] ⊆ Rd, such that di − ci = αi ·∆i where
αi ∈ Rd (the product denotes multiplying corresponding entries). Let δ−, δ+ ∈ Rd such
that δ−[j], δ+[j] denote the distances from the expected value of vtV∪R to the left and
right end points of a box bounded in SV∪R, respectively, when projected on the jth
dimension.
54
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Corollary 3.7. Given that the local vectors of the nodes are d-dimensional and inde-
pendent, a lower bound on the probability for violation resolution is given by:
PrvtV∪R ∈ SV∪R
≥
1−d∑j=1
(φ(V,R, δ−, j) + φ(V,R, δ+, j)
)where
φ(V,R, δ, j) = exp
− 2(|V|+ |R|)2δ[j]2∑Ni∈V
(αi[j]∆i[j])2 +∑
Ni∈R∆i[j]
2
.
Bernstein’s Lower Bound
An even tighter bound can be attained if we consider a d-sphere of radius δ, inside SV∪R
that is centered around the expected value of vtV∪R. Assume that the data of each node
are bounded in a d-sphere of radius ∆. Let A = V ∪R. Then:
PrvtA ∈ SA
≥
Pr‖vtA − E(vtA)‖ ≤ δ
=
Pr
∥∥∥∥∥∥ 1
|A|∑Ni∈A
vti
− E
1
|A|∑Ni∈A
vti
∥∥∥∥∥∥ ≤ δ =
Pr
∥∥∥∥∥∥∑Ni∈A
(vti − E(vti)
)∥∥∥∥∥∥ ≤ δ|A|2 =
Pr
∥∥∥∥∥∥∑Ni∈A
Zi
∥∥∥∥∥∥ ≤ δ|A|2
where Zi = vti − E(vti) is a random vector of length d which satisfies E(Zi) = 0 and
‖Zi‖ < ∆. Tropp provided the following generalization of Bernstein’s inequalities for a
sum of random matrices:
Theorem 3.8 (Matrix Bernstein [Tro10]). Given a finite sequence Zk of indepen-
dent, random matrices with dimensions d1 × d2, assume that each random matrix
satisfies E(Zk) = 0 and ‖Zk‖ < R almost surely. Define:
σ2 := max ‖∑
kE(ZkZ∗k)‖, ‖
∑kE(Z∗kZk)‖ .
Then, for all t ≥ 0,
Pr ‖∑
kZk‖ ≥ t ≤ (d1 + d2) · exp
(− t2/2
σ2 +Rt/3
).
55
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
We employ Bernstein’s matrix inequality to derive the following corollary:
Corollary 3.9. Given that the local vectors of the nodes are d-dimensional and inde-
pendent, a lower bound on the probability for violation resolution is given by:
PrvtV∪R ∈ SV∪R
≥
1− (1 + d) · exp
(− (δ|V ∪ R|2)2/2σ2 + ∆δ|V ∪ R|2/3
),
where
σ2 := max‖∑
Ni∈V∪RE(ZiZTi )‖, ‖
∑Ni∈V∪RE(ZT
i Zi)‖.
3.6 Homogeneous Data Instance
In the case of homogeneous data, i.e., the monitoring nodes’ data are identically
distributed, it appears that no node is clearly preferable over any other in resolving
local violations. Therefore, we suggest the following instance of the generic algorithm
presented in Section 3.5, to which we refer as the Random Logarithmic algorithm (RLG).
Pseudo-code for the function getExtendingSet is given in Algorithm 3.2. The concept
behind this algorithm is straightforward: at each round r the resolving set is extended
by additional 2r randomly selected non-violating nodes.
Following the probabilistic analysis of the algorithm, presented in Section 3.5.2,
we are able to estimate the expected size of the resolving set. Thus, we are able to
estimate the expected communication cost (|R|), as well as the latency (log(|R|)) of the
algorithm. As depicted in Figure 3.3, when the data are homogeneous, the probability
for resolution converges rapidly to 1 as the size of the resolving set increases. Therefore,
we expect this algorithm to perform well. Note that in the worst case scenario, i.e.,
the resolving set comprises all the non-violating nodes, RLG bounds the latency by
O(log(n)) rounds.
Clearly, the number of nodes added to the resolving set in each round determines
the algorithm’s latency. We have found that doubling the size of the resolving set at
each round yields a fair trade-off between communication cost and latency.
The correctness of RLG derives directly from Theorem 3.2, as getExtendingSet
always returns a non-empty set of non-violating nodes which were not already included
in the resolving set.
Algorithm 3.2 Random Logarithmic Algorithm
getExtendingSet(V,R, r)1: Rr ← 2r random nodes from N1, ...., Nn \ (V ∪R)2: return Rr
56
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
3.7 Heterogeneous Data Instance
In various distributed stream networks, the data are heterogeneously distributed, i.e., the
distribution of the data may vary greatly among different streams. Data heterogeneity,
in most constraint monitoring algorithms, yields heterogeneous constraints. If the
constraints are boxes, for example, the safe zones among the different nodes may vary
notably in shape, having different width in each dimension. In this section we show that
random methods (such as RLG) may produce poor results over heterogeneous setups,
and present a more suited algorithm, which efficiently chooses the resolving set. Finally,
we discuss the complexity of this algorithm.
3.7.1 The Heterogeneous Data Challenge
Consider Example 2 presented in Section 3.4.3, in which a single local violation has
occurred in one of the sports node (N2). It is evident that the violation can only be
resolved in collaboration with the other sports node (N4), because economy nodes are
characterized by narrow safe zones for sports-related terms. Consequently, they have
little or no flexibility in resolving the violation. This is also expressed in the lower bound
for the probability of resolution presented in Corollary 3.7. The flexibility is represented
by the width of the safe zone, ∆. It follows that greater flexibility exponentially increases
the probability for resolution.
3.7.2 Maximum Matching Tree Algorithm
We suggest a new instance of the generic algorithm presented in Section 3.5, referred to
as the Maximum-Matching Tree algorithm (MMT). The algorithm is preceded by an
initialization phase in which an overlay tree structure is defined over the nodes, such
that nodes with high probability to resolve each other’s local violations are stored under
the same sub-tree. The tree is later used in the implementation of getExtendingSet, to
retrieve the resolving set.
The construction of the overlay tree, MMTree, is presented in detail in the next
subsection. The pseudo-code for the construction, as well as the implementation of
getExtendingSet, are given in Algorithm 3.3. Each level in the tree defines a partition
of N1, . . . , Nn, as depicted in Figure 3.4. Denote the level of the leaves as 0 and the
root level as log(n). Let Pr be the partition defined by level r, and let Pr[Ni] be the
set in Pr that contains the node Ni. In each round r, every node Ni that is not already
included in R and shares the same set in Pr with a violating node (i.e., Pr[Ni]∩V 6= ∅)is added to R. Figure 3.5 illustrates an execution example of MMT over a system of 8
nodes.
The correctness of MMT derives directly from Theorem 3.2, as getExtendingSet
always returns a non-empty set of non-violating nodes which were not already included
in the resolving set.
57
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
3.7.3 Maximum Matching Tree Construction
We assume that the coordinator is familiar with the data distributions and the safe
zones of all the monitoring nodes. The maximum matching tree is the product of a
greedy process that recursively obtains a coarser partition of N by aggregating the
components of a finer partition. Given a partition P, of size m, the process obtains
a coarser partition P ′, of size dm/2e, by optimally pairing the components of P. In
other words, every component of P ′ is formed by joining a pair of components from
P (if m is odd, they will share a single component). Pairing a partition A1, . . . , Am is
considered optimal if it yields a partition B1, . . . , Bdm/2e such that Pr∧dm/2ei=1 vtBi
∈ SBiis maximized. We initialize the process with the partition of N = N1, . . . , Nn into
singletons (N1, . . . , Nn). In turn, optimal pairings are recursively performed until
the singleton partition (N) is reached. The result of this is a bottom-up construction of
a binary tree where each generated partition defines a new level in the tree. An example
of such a tree is depicted in Figure 3.4.
Optimal Pairing
Given a partition of N into (disjoint) sets, A1, . . . , Am, we define a weighted, non-
directed, complete graph over these sets. The weight of the edge connecting Ai and Aj
is defined by log(PrvtAi∪Aj∈ SAi∪Aj) for all 1 ≤ i < j ≤ m. We perform the pairing
by computing a maximum weighted matching in this graph, as described in [Edm65]. As
this graph is complete, the matching is perfect (or near-perfect if m is odd). Thus, we
obtain a partition, B1, . . . , Bdm/2e, such that∑dm/2e
i=1 log(PrvtBi∈ SBi) is maximized.
It follows that∏dm/2ei=1 PrvtBi
∈ SBi is maximized and, as the monitoring nodes are
independent, we conclude that the pairing is optimal.
Computational Issues
An essential task in the tree construction is computing PrvtA ∈ SA for a given set of
nodes, A. To this end we assume that each node Ni ∈ N is associated with a probability
density function (p.d.f.) given as a discrete set Di ⊆ Rd of sample history data. We
generalize this notion to a set of nodes A by aggregating the sampled data of the nodes
in A (i.e., 1|A|∑
Ni∈A vi ∈ DA where vi ∈ DiNi∈A were sampled at the same time
steps). It follows that
PrvtA ∈ SA =|DA ∩ SA||DA|
.
If the nodes’ p.d.f. is given as an explicit function fi : Rd → [0, 1], then PrvtA ∈ SA =∫SAfA where fA is the convolution of fiNi∈A.
58
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Figure 3.4: A maximum matching tree over an 8-node system in a 2-dimensionalspace. Every level of the tree defines a partition of N1, . . . , N8. The distribution ofthe average vector and the safe zone are marked by a cloud of dots and a rectangle,respectively, for every partition set. The root node represents the distribution of theglobal vector and the global safe zone. Note that, indeed, global violations rarelyoccur. There are two types of nodes: type-1 nodes, which have high variance in the1st dimension and low variance in the 2nd dimension, and type-2 nodes, which havehigh variance in the 2nd dimension and low variance in the 1st dimension. 4 nodes ofeach type comprise the leaves of the tree, denoted by the double outlined ellipses. Asexpected, MMT first pairs nodes of the same type.
59
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Algorithm 3.3 Maximum-Matching Tree Algorithm
buildMMTree()
1: P0 ← N1, . . . , Nn2: r ← 03: while |Pr| > 1 do4: G ← buildCompleteGraph(Pr)5: M← findMaximumMatching(G)6: Pr+1 ← ∅7: for all A in Pr do8: Add A ∪M(A) to Pr+1
9: end for10: r ← r + 111: end while12: MMTree← P0, . . . ,PrgetExtendingSet(V,R, r)
1: Rr ← ∅2: for all Ni in V do3: Add Pr[Ni] \ (V ∪R) to Rr4: end for5: return Rr
Figure 3.5: An execution example of MMT over an 8-node system. At t = 1, a snapshotof the system is given at the bottom. First, the violating nodes (N2, N7) report theirlocal vectors to the coordinator. Upon failure to resolve the violation, the resolvingset is extended with the non-violating nodes from the sets containing N2, N7 in the1st level of the MMTree. If we consider the MMTree in Figure 3.4, nodes N4, N6 arepolled for their local vectors. At this point the violations are resolved and the algorithmterminates.
60
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
3.7.4 Distributed Variant of MMT
Up until now we assumed that the nodes communicate only through the coordinator
node. However, this is not necessarily the case in many distributed networks. A star
topology in ever expanding networks implies increasing energy costs for the distant nodes.
Moreover, as the number of local violations increases with the size of the network, it also
implies an increasing load on the coordinator. Next we present a variant of the MMT
algorithm, designed for network topologies that support inter-node communication,
denoted as Distributed MMT (DMMT). Unlike the generic algorithm, in which the
local violations were centralized to the coordinator for resolution, the violating nodes in
DMMT do not immediately address the coordinator but rather attempt to resolve their
violations locally.
DMMT initiates with the construction of the MMTree by the coordinator as in
MMT. In addition, the coordinator disseminates the MMTree to all the nodes. When
local violations occur, the violating nodes perform the MMT algorithm simultaneously,
yet independently. Each of the nodes constructs its own resolving set using the MMTree
until resolution is attained. In each of the sets, the resolution process is led by the
smallest index node. It’s possible that some of the sets are unified during the process.
The MMTree guarantees that, throughout the process, each node can only belong to a
single resolving set. In other words, the process creates a partition of the network into
disjoint sets of nodes so that according to Lemma 3.4.1, the resolution is valid.
The distributed approach dramatically reduces the load on the coordinator. In
addition, it can also lead to savings in communication cost. The MMT algorithm
attempts to resolve the local violations as a whole and therefore extends the resolving
set until all violations are resolved. In DMMT, however, local violations are resolved
independently, thus allowing a different size resolving set to be tailored to each violation.
Finally, the construction of the MMTree in DMMT can be adapted to suit the
needs of the network. Factors can be applied to weights of edges to reflect desired or
non-existing connections and to integrate distances between nodes.
3.8 Experiments
In this section we compare the performance of the presented violation resolution
algorithms. We tested these algorithms over homogeneous and heterogeneous setups,
using both synthetic and real-life data sets. The setups used, as well as the performance
metrics and compared algorithms, are now described.
3.8.1 Data sets
Following are the data sets over which we conducted our experiments:
Syn-HM-n (n = 16, 32, . . . , 1024) – A synthetically generated homogeneous data
set consisting of n streams (nodes) of random 3-dimensional data from the normal
61
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
distribution. The data set was generated such that data variance in each dimension was
the same in all the nodes.
Air-HM-n (n = 16, 32, . . . , 1024) – A homogeneous data set taken from the European
air quality database (AirBase) [web]. The data set consists of air pollutant measurements
read by geographically distributed sensors. The local vectors are 2-dimensional vectors
representing the concentrations of NO and NO2 in the air, which were measured in
micrograms per cubic meter. We have assembled n nodes having highly correlated data
distributions.
Syn-HT-n (n = 16, 32, . . . , 1024) – A synthetically generated heterogeneous data
set consisting of n streams (nodes) of random n8 -dimensional data from the normal
distribution. The data set was generated such that data variance in each dimension was
the same in all the nodes except for 8 nodes in which it was substantially higher.
RCV-HT-n (n = 16, 32, . . . , 256) – The Reuters Corpus (RCV1-v2) [RSW02] consists
of 804,014 news stories, each tagged as belonging to one or more of 103 content categories.
Every story comprises a precomputed list of terms [LYRL04]. We’ve assembled n8 roughly
equally-sized super-categories and selected a term for each category in which it highly
dominates the other categories (in the sense that the term occurs many more times in
stories within that category). The stories of each super-category were then divided into
8 nodes which tracked the occurrences of all the selected terms over a sliding window of
100 stories (i.e., the local vectors were the occurrence count vectors). This resulted in
the nodes having roughly the same variance in every dimension of their data distribution
except for the dimension that corresponds to the term of their super-category, in which
the variance was substantially higher.
3.8.2 Scoring Functions and Local Constraints
Following are the scoring functions and local constraints defined for the different data
sets:
Syn-HM, Syn-HT, RCV-HT – The global safe zone was defined as a multidimensional
rectangle (box), centered around the expected value of the global vector. A box essentially
sets a lower and an upper bound for the values of the global vector in every dimension.
Similarly, the local safe zone of each node was defined as a box centered around the
expectation of its data distribution. The boxes were defined such that the average box
of the local safe zones was contained in the box of the global safe zone (to ensure that
the safe zones condition of Equation 3.1 hold). In addition, the boxes achieved a fairly
high coverage of the data distribution, guaranteeing that a global violation and a local
violation of any node occur with low probability.
Air-HM – The scoring function was defined as the ratio between the average con-
centrations of NO and NO2, and the safe zones were defined as triangles – a choice
motivated by their simplicity and by their suitability to the data and the definition of
the queried function.
62
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
(a) Syn-HM (b) Air-HM
(c) Syn-HT (d) RCV-HT
Figure 3.6: Data distributions and safe zones of 2 randomly chosen nodes from eachdata set. In Syn-HT and RCV-HT the data are projected to a 3-dimensional space.
Figure 3.6 provides a graphic illustration of the data sets and safe zones. In the
homogeneous data sets the variance of the data in each dimension is approximately
the same in all the nodes. Consequently, their safe zones are similar in shape (i.e.,
have approximately the same width in each dimension). On the other hand, in the
heterogeneous data sets the variance of the data in each dimension differs greatly in
some nodes and their safe zones are different in shape.
3.8.3 Performance Metrics
We have applied the following metrics:
Average communication cost – The average number of monitoring nodes that re-
ported their local vector during violation resolution. In the centralized algorithms this
corresponds to V ∪R. In the DMMT algorithm, this excludes the violating nodes which
handle the resolution. The actual communication cost (in bytes) is linear in this size.
Average size of resolving set – The average number of non-violating nodes that
participated in a violation resolution. This metric emphasizes the overhead in the
network resources allocated for resolving the local violations.
Average latency – The average running time (in rounds) of the violation resolution
algorithm (namely, the average number of rounds it took to assemble the resolving set).
Average maximum communication load – The maximum communication load is the
maximum communication that goes through a single node during violation resolution.
63
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
1
2
4
8
16
32
64
128
256
512
1024
16 32 64 128 256 512 1024
𝑛
Naïve
DMMT
MMT
RLG
RLN
|𝒱|
1
2
4
8
16
32
64
128
256
512
1024
16 32 64 128 256 512 1024
𝑛
Naïve
DMMT
RLG
MMT
RLN
|𝒱|
(a) Average communication cost
1
2
4
8
16
32
64
128
256
512
1024
16 32 64 128 256 512 1024
𝑛
Naïve
DMMT
MMT
RLG
RLN
1
2
4
8
16
32
64
128
256
512
1024
16 32 64 128 256 512 1024
𝑛
Naïve
DMMT
RLG
MMT
RLN
(b) Average resolving set size
1
2
4
8
16
16 32 64 128 256 512 1024
𝑛
RLN
RLG
DMMT
MMT
Naïve
1
2
4
8
16
32
64
16 32 64 128 256 512 1024
𝑛
RLN
RLG
DMMT
MMT
Naïve
(c) Average latency
0
5
10
15
20
25
30
35
40
16 32 64 128 256 512 1024
𝑛
MMT RLG DMMT
0
20
40
60
80
100
120
16 32 64 128 256 512 1024
𝑛
RLG MMT DMMT
(d) Average maximum communicaiton load
Figure 3.7: Experimental results over Syn-HM (left) and Air-HM (right) homogeneousdata sets. The vertical axes of the line graphs are in logarithmic scale. In the averagecommunication cost, all algorithms (except the Naive) approach the minimum, asdenoted by the number of violations. The average latency reflects the differencesbetween the algorithms in the expansion rate of the resolving set. DMMT outperformsthe centralized algorithms in reducing the average maximum communication load.
In the centralized algorithms this indicates the load on the coordinator.
Note that we’ve only considered time steps in which a global violation did not occur
(i.e., the local violations could be resolved). A global violation would always require any
algorithm to collect the entire network data. Moreover, the latency of the algorithms
would be maximal (i.e., Naive - 1, RLG/MMT/DMMT - log(n), RLN - n). As global
violations are rare, we’ve omitted them from our evaluation.
3.8.4 Compared Instances
We evaluate the four instances of the centralized generic algorithm, as well as the
distributed version:
Naive – Mentioned in Section 3.5. The resolving set is always defined as the entire
set of non-violating nodes.
RLN – Mentioned in Section 3.5. Extends the resolving set linearly with randomly
chosen nodes.
RLG – Presented in Section 3.6. Extends the resolving set exponentially with
randomly chosen nodes.
MMT – Presented in Section 3.7. Extends the resolving set exponentially using an
overlay tree structure.
DMMT – Presented in Subsection 3.7.4. A distributed variant of MMT.
64
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
1
2
4
8
16
32
64
128
256
512
1024
16 32 64 128 256 512 1024
𝑛
Naïve
RLG
RLN
MMT
DMMT
|𝒱|
1
2
4
8
16
32
64
128
256
16 32 64 128 256
𝑛
Naïve
RLG
RLN
MMT
DMMT
|𝒱|
(a) Average communication cost
1
2
4
8
16
32
64
128
256
512
1024
16 32 64 128 256 512 1024
𝑛
Naïve
RLG
RLN
DMMT
MMT
1
2
4
8
16
32
64
128
256
16 32 64 128 256
𝑛
Naïve
RLG
RLN
MMT
DMMT
(b) Average resolving set size
1
2
4
8
16
32
64
128
256
16 32 64 128 256 512 1024
𝑛
RLN
RLG
MMT
DMMT
Naïve
1
2
4
8
16
32
64
16 32 64 128 256
𝑛
RLN
RLG
MMT
DMMT
Naïve
(c) Average latency
0
50
100
150
200
250
16 32 64 128 256 512 1024
𝑛
RLG MMT DMMT
0
10
20
30
40
50
60
70
80
90
16 32 64 128 256
𝑛
RLG MMT DMMT
(d) Average maximum communicaiton load
Figure 3.8: Experimental results over over Syn-HT (left) and RCV-HT (right) hetero-geneous data sets. The vertical axes of the line graphs are in logarithmic scale. Theclear advantage of MMT and DMMT over the random algorithms is apparent in eachof the metrics. DMMT outperforms the centralized algorithms in reducing the averagemaximum communication load.
3.8.5 Experimental Results
Homogeneous Setups
We compared the performance of the five algorithms over Syn-HM and Air-HM data
sets. Results are presented in Figure 3.7. In the average communication cost, RLN,
RLG, MMT and DMMT perform similarly and are orders of magnitude away from the
Naive algorithm. The graphs of all algorithms, except the naive, closely converge to
the graph of the average number of violations. The number of violations defines the
minimum communication cost of the resolution, and it increases linearly with the size
of the network. This reinforces the hypothesis that in homogeneous setups, no node
is clearly preferable to any other, and the choice of the resolving set can be made at
random. What really matters is the number of nodes participating in the resolution,
as suggested by the lower bounds in Section 3.5.2. The advantage of DMMT, that the
violating nodes handle the resolution themselves and don’t report their local vectors,
is not reflected in the average communication cost. This is explained by the graphs of
the average size of the resolving set. Since DMMT resolves each of the local violations
independently, it requires more resolving nodes than the centralized algorithms.
In all the algorithms we observe a growth in the graphs of the average size of the
resolving set. However, since the number of violating nodes increases, we would expect
the number of resolving node to decrease. We reconcile this apparent contradiction by
noting that in both data sets, the local violations occur in the positive directions of the
axes and thus, they virtually never resolve each other.
65
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
The graphs of the average latency reflect the differences between the algorithms
in the expansion rate of the resolving set. RLN extends the resolving set linearly and
exhibits the worst latency as its graph diverges exponentially. RLN, MMT and DMMT
all extend the resolving set exponentially, yet MMT initiates with as many nodes as
the violating nodes. The graphs of RLG and DMMT diverge logarithmically while the
graph of MMT shows a slow decay.
As expected, in the average maximum communication load, DMMT outperforms
the centralized algorithms, which demonstrate an exponential growth.
Heterogeneous Setups
We compared the performance of the five algorithms over Syn-HT and RCV-HT data
sets. Results are presented in Figure 3.8. The clear advantage of MMT and DMMT
over the random algorithms is apparent in each of the metrics. Due to the diversity
of the nodes in the heterogeneous setups, the random algorithms selected nodes that
were of no use in the resolution. MMT and DMMT, on the other hand, selected
the most relevant nodes according to the preconstructed MMTree. Both MMT and
DMMT were successful in reducing the communication cost and the latency almost to
the minimum. Nevertheless, DMMT still outperforms MMT in reducing the average
maximum communication load. In addition, the savings in reporting the local violations
to the coordinator, and the ability to tailor a different size resolving set to each violation
independently, may explain the advantage of DMMT in the other metrics as well.
3.9 Chapter Conclusions
This chapter focused on minimizing the communication required for handling local
constraint violations in distributed threshold monitoring. The key insight was that
when there is no global violation, the local violations can typically be resolved without
collecting the entire network data.
We presented a formal and precise condition for resolving the violations by a set of
nodes. We showed that finding the minimum resolving set is NP-hard, and proposed
a general approach that incrementally collects the resolving set. The latency of the
process should be taken into account as it determines how long it takes to alert the
system about a global violation.
We distinguished between two types of networks: homogeneous and heterogeneous.
The network types are related to the correlation between the data distributions of
the nodes. We focused especially on the variance of the data in the different dimen-
sions because it can tell us in which direction the node has more tendency to violate.
Consequently, it tells us where the safe zone of the node is expected to have greater
flexibility in resolving violations. We assumed that in homogeneous networks, no node
would be clearly preferable over another in resolving violations. On the other hand, in
66
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
heterogeneous networks, a careful selection of the resolving nodes can be crucial. These
assumptions were reinforced by the lower bounds we presented on the probability for
violation resolution. In homogeneous networks, the size of the resolving set was the
deciding factor, rather than the identity of its members.
We presented violation resolution algorithms for homogeneous (RLG) and heteroge-
neous (MMT) setups. Both algorithms guarantee a latency that is logarithmic in the
size of the network in the rare case of a global violation. Experimental results with both
synthetic and real-life data sets showed that, in homogeneous setups, both algorithms
reduced the average communication cost almost to the minimum, and reduced the
average latency as well. Due to its simplicity and speed, RLG is preferable. In the
heterogeneous setups, however, the superiority of MMT is evident. In addition, if the
infrastructure of the network allows it, using DMMT should be considered, in order to
avoid the load on the coordinator that is created in the centralized algorithms.
67
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
68
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Chapter 4
Conclusions
In this thesis we presented techniques for reducing communication when monitoring
general functions over a distributed stream network. The techniques were based on
the idea of Safe Zones (SZs): Each node of the system gets a Safe Zone (SZ), and is
asked to communicate only when the data it observes drifts out of this SZ. We make
sure that as long as each node is in its SZ, the global ”bad” event (threshold crossing)
may not have happened. When a node observes data that is outside its SZ, we say that
a local violation has happened, and a violation resolution algorithm is initiated in order
to efficiently detect whether the global ”bad” event happened. The expected number of
messages we save using our techniques is proportionate to (1) the probability that the
data of the nodes reside inside their respective SZs, and to (2), in cases of local violations,
the expected number of nodes involved in the violation resolution process. Geometric
and Combinatorial tools were used in order to come up with practical algorithms that
attempt to minimize the number of messages sent during the monitoring protocol.
In chapters 1 and 2, our algorithms for assigning SZs to the nodes were presented.
We defined the problem of finding the optimal SZs – the SZs that will save the maximal
number of messages – as an optimization problem, and presented two approaches for
solving it: The geometric ”Minkowski Sum” approach of chapter 1 which is more
appropriate for continuous data (e.g. sensor data) and monitoring functions that
are defined over the sum or average of the local vectors; and the discrete ”Biclique”
approach of chapter 2 which can handle arbitrary monitoring functions when the data
is taken from a discrete set of values. In contrast to previous solutions which involved a
cover of the entire convex hull of the local data vectors, our techniques focus on direct
computation of safe zones for the nodes. Consequently, SZs are more flexible than
constraints introduced in previous work, as they fit the data distributions much better.
In chapter 3, our algorithms for violation resolution were presented. The key insight
was that when there is no global violation, the local violations can typically be resolved
without collecting the entire network data. We distinguished between ”homogeneous”
and ”heterogeneous” network types, and for each type we suggested a different strategy
for gathering data from the nodes for the resolution process (i.e. determining wether
69
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
the ”bad” thing really happened). We used a maximum matching algorithm to find a
good strategy efficiently.
The applicability of these techniques to various real problems was provided in exper-
iments (chapters 1,3), which also demonstrated their advantage. SZs were implemented
and tested for both real-life and synthetic data using simple families of shapes, proving
that the paradigm can reduce communication volume by orders of magnitude, over
previous work.
70
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Bibliography
[ABC09] C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional moni-
toring without monotonicity. In ICALP (1), pages 95–106, 2009.
[ADNR07] S. Agrawal, S. Deb, K. V. M. Naidu, and R. Rastogi. Efficient detection
of distributed constraint violations. In ICDE, pages 1320–1324, 2007.
[AKT09] A. Abbasi, A. Khonsari, and M. Sadegh Talebi. Flooding-assisted
threshold assignment for aggregate monitoring in sensor networks. In
ICDCN, 2009.
[AM04] A. Arasu and G. S. Manku. Approximate counts and quantiles over
sliding windows. In PODS, 2004.
[BO03a] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD,
pages 28–39. ACM, 2003.
[BO03b] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD
Conf., pages 28–39, New York, NY, USA, 2003. ACM Press.
[CG05] G. Cormode and M. N. Garofalakis. Sketching streams through the
net: Distributed approximate query tracking. In VLDB, pages 13–24,
2005.
[CG09] G. Cormode and M. N. Garofalakis. Histograms and wavelets on
probabilistic data. In ICDE, 2009.
[CGMR05] G. Cormode, M. N. Garofalakis, S. Muthukrishnan, and R. Rastogi.
Holistic aggregates in a networked world: Distributed tracking of
approximate quantiles. In SIGMOD, pages 25–36, 2005.
[Cha87] B. Chazelle. Approximation and decomposition of shapes. 1987.
[CMY08] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed
functional monitoring. In SODA, pages 1076–1085, 2008.
[CMY11] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed
functional monitoring. ACM Transactions on Algorithms, 7(2):21,
2011.
71
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
[CMYZ10] G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang. Optimal
sampling from distributed streams. In PODS, pages 77–86, 2010.
[CMZ06] G. Cormode, S. Muthukrishnan, and W. Zhuang. What’s different:
Distributed, continuous monitoring of duplicate-resilient aggregates
on data streams. In ICDE, pages 20–31, 2006.
[CMZ07] G. Cormode, S. Muthukrishnan, and W. Zhuang. Conquering the
divide: Continuous clustering of distributed data streams. In ICDE,
2007.
[Cor11] G. Cormode. Continuous distributed monitoring: a short survey. In
Proceedings of the First International Workshop on Algorithms and
Models for Distributed Event Processing, AlMoDEP ’11, pages 1–10,
New York, NY, USA, 2011. ACM.
[DGGR04] A. Das, S. Ganguly, M. N. Garofalakis, and R. Rastogi. Distributed
set expression cardinality estimation. In VLDB, pages 312–323, 2004.
[Edm65] J. Edmonds. Paths, trees, and flowers. Canadian Journal of Mathe-
matics, 17(3):449–467, 1965.
[GRM10] R. Gupta, K. Ramamritham, and M. K. Mohania. Ratio threshold
queries over distributed data sources. In ICDE, pages 581–584, 2010.
[GT01] P. B. Gibbons and S. Tirthapura. Estimating simple functions on
union of data streams. In SPAA, 2001.
[GT02] P. B. Gibbons and S. Tirthapura. Distributed streams algorithms for
sliding windows. In SPAA, 2002.
[HNG+06] L. Huang, XuanLong Nguyen, M. N. Garofalakis, Michael I. Jordan,
Anthony D. Joseph, and Nina Taft. In-network pca and anomaly
detection. In NIPS, pages 617–624, 2006.
[HNG+07] L. Huang, X. Nguyen, M. N. Garofalakis, J. M. Hellerstein, M. I.
Jordan, A. D. Joseph, and N. Taft. Communication-efficient online
detection of network-wide anomalies. In INFOCOM, 2007.
[Hoe63] W. Hoeffding. Probability inequalities for sums of bounded random
variables. Journal of the American Statistical Association, pages 13–30,
1963.
[JDZ+07] N. Jain, M. Dahlin, Y. Zhang, D. Kit, P. Mahajan, and P. Yalagandula.
Star: Self-tuning aggregation for scalable monitoring. In VLDB, pages
962–973, 2007.
72
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
[JW04] S. Ratnasamy Jain, J.M. Hellerstein and D. Wetherall. A wakeup call
for internet monitoring systems: The case for distributed triggers. In
HotNets-III, 2004.
[KCR06] R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-
efficient distributed monitoring of thresholded counts. In SIGMOD,
2006.
[KG07] M.R. Kurpius and A.H. Goldstein. Gas-phase chemistry dominates
o3 loss to a forest, implying a source of aerosols and hydroxyl radicals
to the atmosphere. Geophysical Research Letters, 30(7), 2007.
[KL10] E. Kiciman and V. B. Livshits. Ajaxscope: A platform for remotely
monitoring the client-side behavior of web 2.0 applications. TWEB,
4(4), 2010.
[LYJ09] F. Li, K. Yi, and J. Jestes. Ranking distributed probabilistic data. In
SIGMOD, 2009.
[LYRL04] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new bench-
mark collection for text categorization research. Journal of Machine
Learning Research, 5(4):361–397, 2004.
[Mat92] J. Matousek. Range searching with efficient hierarchical cuttings. In
SOCG, pages 276–285, 1992.
[MEA01] A. Tal M. Elad and S. Ar. Content based retrieval of VRML objects.
In EG Multimedia, pages 97–108, 2001.
[MF02] S. Madden and M. J. Franklin. Fjording the stream: An architecture
for queries over streaming sensor data. In ICDE, pages 555–566, 2002.
[MGR95] M. Meyer, Y. Gordon, and S. Reisner. Constructing a polytope to
approximate a convex body. Geometriae Dedicata, 57:217–222, 1995.
[MSDO05] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding
(recently) frequent items in distributed data streams. In ICDE, 2005.
[MTW05] S. Michel, P. Triantafillou, and G. Weikum. Klee: a framework for
distributed top-k query algorithms. In VLDB, pages 637–648, 2005.
[OJW03] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous
queries over distributed data streams. In SIGMOD Conf., pages
563–574, 2003.
[RN04] M. Rabbat and R. D. Nowak. Distributed optimization in sensor
networks. In IPSN, pages 20–27, 2004.
73
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
[RSW02] T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus
Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources.
In Proceedings of the Third International Conference on Language
Resources and Evaluation, Las Palmas de Gran Canaria, May 2002.
[Ser82] J.P. Serra. Image Analysis and Mathematical Morphology. Academic
Press, 1982.
[SKSS10] G. Sagy, D. Keren, I. Sharfman, and A. Schuster. Distributed threshold
querying of general functions by a difference of monotonic representa-
tion. PVLDB, 4(2):46–57, 2010.
[SR08] S. Shah and K. Ramamritham. Handling non-linear polynomial queries
over dynamic data. In ICDE, 2008.
[SSK06] I. Sharfman, A. Schuster, and D. Keren. A geometric approach to
monitoring threshold functions over distributed data streams. In
SIGMOD, 2006.
[SSK07a] I. Sharfman, A. Schuster, and D. Keren. Aggregate threshold queries
in sensor networks. In IPDPS, pages 1–10, 2007.
[SSK07b] I. Sharfman, A. Schuster, and D. Keren. A geometric approach to
monitoring threshold functions over distributed data streams. ACM
Trans. Database Syst., 32(4), 2007.
[SSK08] I. Sharfman, A. Schuster, and D. Keren. Shape sensitive geometric
monitoring. In PODS, 2008.
[Tro10] J. A. Tropp. User-friendly tail bounds for sums of random matrices.
ArXiv e-prints, April 2010.
[TZWL08] L. Tian, P. Zou, F. Wu, and A. Li. Research on communication-
efficient method for distributed threshold monitoring. In WAIM,
pages 441–448, 2008.
[WBK09] R. Wolff, K. Bhaduri, and H. Kargupta. A generic local algorithm
for mining data streams in large distributed systems. IEEE Trans.
Knowl. Data Eng., 21(4):465–478, 2009.
[WDS09] F. Wuhib, M. Dam, and R. Stadler. Gossiping for threshold detection.
In Proceedings of 11th IFIP/IEEE International Symposium on In-
tegrated Network Management, Long Island, NY, USA, 2009. Work
done within the SICS Center for Networked Systems.
[web] The European air quality database,
http://dataservice.eea.europa.eu/dataservice/.
74
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
[YSJ+00] B. K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos,
and A. Biliris. Online data mining for co-evolving time sequences. In
ICDE, pages 13–22, 2000.
[YZ09] K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters
and quantiles. In PODS, 2009.
75
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
הטכניקות מבוזרת. רשת ניטור בעת בתקשורת לחיסכון טכניקות שמציעות עבודות נציג זו בתיזה
חדש נתון כל על הודעה לשלוח צריכים לא במערכת צמתים הבאה: הפשוטה האבחנה על מבוססות
את מפרמלים אנו מתרחשים. "מעניינים" דברים כאשר רק הודעות לשלוח מספיק אלא שמגיע,
לתקשר ומתבקש בטוח", "אזור מקבל במערכת צומת כל בטוחים". "אזורים רעיון ע"י הזאת האבחנה
נמצא צומת כל עוד שכל נדאג אנו הבטוח. מהאזור לצאת לו שגורם חדש נתון מקבל הוא כאשר רק
אין ־ ולפיכך להתרחש, יכול לא ־ הסף חציית ־ ה"רע" הגלובלי האירוע שלו, הבטוח האזור בתוך
דוגם במערכת כלשהו שצומת המידע כאשר מקומית, חריגה של במקרה אולם, כלל. בתקשורת צורך
מלתקשר. מנוס ואין הסף לחציית ביחס מובטח אינו דבר לצומת, המוקצה הבטוח באזור לא כבר
ההסתברות את הניתן ככל שיצמצמו בטוחים אזורים לקבוע הוא זו בגישה המרכזי האתגר לכן,
מקומית. לחריגה
מונוטונית. או לינארית היא שמנטרים הפונקציה שבהם הפשוטים המקרים עבור קיימות עבודות הרבה
אלו עבודות להרחיב ניתן ולא מונוטוניות, או לינאריות אינן וחשובות מעניינות פונקציות הרבה אבל,
כלשהי. פונקציה של מבוזר לניטור וגנריות חדשניות טכניקות מציעים אנו כלליות. בפונקציות לטיפול
האזורים מציאת בעיית את מגדירים מכן ולאחר הבטוחים, האזורים מושג את ומפרמלים מציגים אנו
שהשימוש הבטוחים האזורים להיות יוגדרו האופטימליים הבטוחים האזורים האופטימליים. הבטוחים
לפתרון, קשה היא זו שבעיה מראים אנו תחילה בתקשורת. המירבי לחיסכון בתוחלת, יגרום, בהם
לפתור ננסה האופטימלי. לפתרון קרובים שיהיו היוריסטיים פתרונות לחפש אלה ברירה אין ולכן
הרבה שחוסכים בטוחים, אזורים לחישוב אלגוריתם ונציג גאומטריים, כלים באמעצות הבעיה את
שונה גישה מציגים אנו מכן לאחר קמור. פוליגון למשל, כלשהי, גאומטרים צורה שמהווים תקשורת,
שמגיעים הנתונים כאשר מתאימה יותר שהיא האופטימליים, הבטוחים האזורים מציאת בעיית לפתרון
של תת־קבוצה פשוט הוא מקבל צומת שכל הבטוח האזור זו בשיטה דיסקרטי. מאופי הם לצמתים
הזו השניה השיטה הקודמת. בשיטה כמו גאומטרית צורה ולא בעתיד, לקבל עשוי שהוא נתונים ערכי
של הנתונים סכום מעל מוגדרת להיות חייבת לא שהפונקציה מכיוון מהראשונה, כללית יותר אף היא
מעליהם. כלשהי פונקציה להיות יכולה אלא הצמתים, כל
ה"רע" הגולבלי האירוע התרחש האם שקובעים בתקשורת חסכוניים אלגוריתמים מציגים אנו לבסוף,
מקומית. חריגה ־ שלו הבטוח מהאזור הצמתים אחד של חריגה מתרחשת כאשר ־ הסף חציית ־
הסיכוי מכך, יתרה הנמנע. מן אינן שיהיו, ככל טובים הבטוחים, מהאזורים חריגות הצער, למרבה
מקומיות חריגות המקרים שברוב אף על גדל. במערכת הצמתים שמספר ככל גובר מקומית לחריגה
מנוס אין כאמור, לכן, הסף. חציית על להצביע עשויות הן בלבד, מקומית תופעה על מצביעות
הצמתים כל את לערב מבלי החריגות מן להתאושש ניתן המקרים ברוב כי מראים אנו מתקשורת.
והאזורים המידע סמך על זאת להסיק ניתן הסף, של חצייה אין שבהם במקרים בפרט, במערכת.
ההתאוששות. קבוצת מכנים שאותם חרגו, שלא נוספים, וצמתים שחרגו הצמתים קבוצת של הבטוחים
אנו מכן ולאחר להתאוששות. הדרושים התנאים של ופורמלי מדוייק ניסוח לראשונה, מציגים, אנו
המקומית החריגה את לפתור מנת על האפשר, ככל קטנה התאוששות קבוצת לחיפוש שיטות מציגים
מינימלי. הודעות מספר עם
העדיפות את גם שמראים ניסויים, ע"י מוצגת מציאותיות בעיות למספר שלנו הטכניקות של השימושיות
שהעדיפות מראים התאורטי והניתוח הניסויים בתחום. הקודמות העבודות פני על גודל, בסדרי שלהן,
גדל. הנתונים שמימד ככל מובהקת ויותר יותר נהיית שלנו השיטות של
ii
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
תקציר
רב מספר אלו, במערכות רבים. טכנולוגיה בענפי מאוד נפוץ מבוזרות מידע שטפי במערכות השימוש
דינמי אופי בעל מידע של מתמשכים שטפים מנהלים במרחב, גדול בפיזור להמצא שעשוי צמתים, של
מכלול סמך על כלשהי כללית תופעה אמת, בזמן לאתר, היא אלו מערכות של מטרתן גבוה. וממימד
שריפות גילוי מניות, ניטור באמצעות הבורסה מצב חיזוי הן לכך דוגמאות בצמתים. המוחזק המידע
ועוד. חיישנים של מערכות ע"י טבע אסונות מפני והתראה
היא זה במקרה המטרה סף. ניטור הוא מבוזרות מידע שטפי במערכות הבסיסיים היישומים אחד
בכל בצמתים המוחזק המידע מכלול על המחושב נתונה, מחיר פונקציית של שערכה עת בכל להתריע
שאותה הכללית התופעה על למעשה, מצביעה, הסף חציית מראש. שהוגדר קבוע סף עובר זמן, יחידת
קודם, שהוזכרו הדוגמאות ובכללם רבים, יישומים מתרחשת. היא כאשר ולהתריע לאתר מעוניינים
סף. ניטור של כמופע לראות ניתן
הבא: באופן לתיאור ניתנת צמתים, n פני על שמבוזר מידע שטף מעל הסף ניטור בעיית פורמלית,
iה־ שהצומת הנתונים וקטור או אוסף הוא vi כאשר v1, v2, ..., vn וקטורים ,f פונקציה נתונים
המערכת ועל זמן, יחידת בכל להשתנה יכול ערכם כלומר, דינמיים, הוקטורים .τ סף וערך מקבל,
לערך מהסף קטן מערך עובר כלמר, ־ τ הסף את חוצה f(v1, v2, ..., vn) של שהערך ברגע להתריע
מספיק שימושי, ולא תמים להיראות שעלול הנ"ל, הפשוט הסף חציית תנאי להיפך. או מהסף, גדול
עליהם. יתריעו דינמיות מבוזרות שמערכות שנרצה "רעים" אירועים של מאוד גדול מגוון לייצג בכדי
על שמבוזרים חיישנים ממאה שמורכבת כלשהי, במדינה האוויר איכות לניטור מערכת לדוגמא ניקח
באוויר, מסוימים מזהמים של ריכוזם את שנייה, בכל מודד, חיישן וכל ,n = 100 המדינה, ערי פני
זיהום לרמת מסוים מדד להביע יכולה f הפונקציה השנייה. באותה ממוקם, החיישן שבו באיזור
מזהמים שני ממוצעי בין היחס את או במדינה, מסוים מזהם של הריכוז ממצוע את למשל: האוויר,
־ הזיהום מדד ־ הפונקציה ערך כאשר להתריע תרצה האוורי זיהום לניטור המערכת אפוא, שונים.
מסוים. סף ערך על עולה
הנתונים כל את זמן, פרק בכל או מתמשך באופן לרכז, היא המבוזר הניטור לבעיית כיום נפוצה גישה
ללא לבעיה המבוזרת הבעיה את להמיר ובזאת האלה, הנתונים של סיכומים או ברשת שמתקבלים
והדינמיות העצום הגודל בגלל במציאות, אפשרי בלתי להיות פשוט עלול כזה, ריכוז אבל, ביזור.
רשת של במקרה בנוסף, התקשורת. ברשת ועיקובים גדולים לעומסים שיגרום מה הנתונים, של
מהירה להתרוקנות תגרום הזאת הגדולה התקשורת כמות סוללה, על עובד חיישן כל שבה חיישנים
הסוללות. של
i
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
המחשב. למדעי בפקולטה קרן, דניאל ופרופסור שוסטר אסף פרופסור של בהנחייתם בוצע המחקר
תודות
על קרן, דניאל ופרופ. שוסטר אסף פרופ. שלי, למנחים הכנה תודתי את להביע רוצה אני תחילה
הגבוהים לסטנדרטים והמחוייבות הידע החוכמה, זה. למחקר המועילות והביקורות הנלהב העידוד
מוטיבציה. בי החדירו שלהם ביותר
ורעיונותיהם החשובות שהערותיהם ומריו, גיא דויד, צחי, המחקר, בקבוצת לעמיתי גם מודה אני
ומהנים. ממריצים מעניינים, היו תמיד שלנו הדיונים זו. לעבודה רבות תרמו הקונסטרוקטיביים
בי. והאמינו עודדו תמכו, שתמיד היקרה למשפחתי גם להודות ארצה
מהנה. יותר לעוד אותה והפכו הזאת בחוויה לצידי שהיו לחברים מודה אני לבסוף,
זה. מחקר מימון על לטכניון מסורה תודה הכרת
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
מבוזרות במערכות כלליות פונקציות ניטורמינימלית תקשורת עם
מחקר על חיבור
התואר לקבלת הדרישות של חלקי מילוי לשם
המחשב במדעי למדעים מגיסטר
עבוד אמיר
לישראל טכנולוגי מכון – הטכניון לסנט הוגש
2012 יולי חיפה התשע"ב תמוז
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012
מבוזרות במערכות כלליות פונקציות ניטורמינימלית תקשורת עם
עבוד אמיר
Technion - Computer Science Department - M.Sc. Thesis MSC-2012-21 - 2012