1
Scalable, Asynchronous, Distributed
Eigen-Monitoring of Astronomy Data Streams
Kanishka Bhaduri∗, Kamalika Das†, Kirk Borne‡, Chris Giannella§, Tushar
Mahule¶, Hillol Kargupta¶‖
∗Mission Critical Technologies Inc.,
NASA Ames Research Center, MS 269-1, Moffett Field, CA-94035
Email:[email protected]
†Stinger Ghaffarian Technologies Inc.,
NASA Ames Research Center, MS 269-3, Moffett Field, CA-94035
Email:[email protected]
‡Computational and Data Sciences Dept., GMU, VA-22030
Email:[email protected]
§The MITRE Corporation
300 Sentinel Dr. Suite 600, Annapolis Junction MD 20701
Email:[email protected]
¶CSEE Department, UMBC
1000 Hilltop Circle, Baltimore, Maryland, 21250, USA
Email:{tusharm1,hillol}@cs.umbc.edu
‖AGNIK, LLC., Columbia, MD, USA
A shorter version of this paper was published in SIAM Data Mining Conference 2009
February 9, 2011 DRAFT
2
February 9, 2011 DRAFT
3
Abstract
In this paper we develop a distributed algorithm for monitoring the principal components (PCs) for
next generation of astronomy petascale data pipelines suchas the Large Synoptic Survey Telescopes
(LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating
30 terabytes of calibrated imagery every night that will need to be co-analyzed with other astronomical
data stored at different locations around the world. Event detection, classification and isolation in such
data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically
significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. How-
ever, performing such data mining tasks is a challenging problem for such high-throughput distributed
data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for
monitoring the principal components (PC) of such dynamic data streams and discuss a prototype web-
based system PADMINI (Peer to Peer Astronomy Data Mining) which implements this algorithm for
use by the astronomers. We demonstrate the algorithm on a large set of distributed astronomical data
to accomplish well-known astronomy tasks such as measuringvariations in the fundamental plane of
galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without
centralizing any data) and can seamlessly handle changes tothe data or the network. Real experiments
performed on Sloan Digital Sky Survey (SDSS) catalogue datashow the effectiveness of the algorithm.
I. INTRODUCTION
Data mining is playing an increasingly important role in astronomy research [25][9][4] in-
volving very large sky surveys such as Sloan Digital Sky Survey SDSS and the 2-Micron
All-Sky Survey 2MASS. These sky-surveys are offering a new way to study and analyze the
behavior of the astronomical objects. The next generation of sky-surveys are poised to take
a step further by incorporating sensors that will stream in large volume of data at a high
rate. For example, the Large Synoptic Survey Telescopes (LSST) will take repeated images
of the night sky every 20 seconds. This will generate 30 terabytes of calibrated imagery every
night that will need to be co-analyzed with other astronomical data stored at different locations
around the world. Event identification and classification insuch data sets may provide useful
insights to unique astronomical phenomenon displaying astrophysically significant variations:
quasars, supernovae, variable stars, and potentially hazardous asteroids. Analyzing such high-
throughput data streams would require large distributed computing environments for offering
February 9, 2011 DRAFT
4
scalable performance. The knowledge discovery potential of these future massive data streams
will not be achieved unless novel data mining algorithms aredeveloped to handle decentralized
petascale data flows, often from multiple distributed sensors (data producers) and archives (data
providers). Several distributed computing frameworks such as [16], [22], [23], [19], and [14]
are being developed for such applications. We need distributed data mining algorithms that can
operate on such distributed computing environments. Thesealgorithms should be highly scalable,
be able to provide good accuracy and should have a low communication overhead.
This paper considers the problem of monitoring the spectralproperties of data streams in
a distributed environment. It offers an asynchronous, communication-efficient distributed eigen
monitoring algorithm for monitoring the principle components (PCs) of dynamic astronomical
data streams. It particularly considers an important problem in astronomy regarding the variation
of fundamental plane structure of galaxies with respect to spatial galactic density and demon-
strates the power of DDM algorithms using this example application. This paper presents the
algorithm, analytical findings, and results from experiments performed using currently available
astronomy data sets from virtual observatories. Our distributed algorithm is a first step in
analyzing the astronomy data arriving from such high throughput data streams of the future.
The specific contributions of this paper can be summarized asfollows:
• To the best of the authors knowledge this is one of the first attempts on developing a
completely asynchronous and local algorithm for doing eigen analysis in distributed data
streams
• Based on data sets downloaded from astronomy catalogues such as SDSS and 2MASS, we
demonstrate how the galactic fundamental plane structure varies with difference in galactic
density
• We discuss the architecture, workflow and deployment of an entirely web-based P2P astron-
omy data mining prototype (PADMINI) that allows astronomers to perform event detection
and analysis using P2P data mining technology
Section II describes the astronomy problem. Section III presents the related work. Section
IV describes the centralized version of the problem while Section V models the distributed
version and explains our distributed the eigenstate monitoring algorithm. Section VI presents
the experimental results followed by a web-based astronomyPADMINI system in Section VII.
February 9, 2011 DRAFT
5
Finally, Section VIII concludes this paper.
II. A STRONOMY DATA STREAM CHALLENGE PROBLEMS
When the LSST astronomy project becomes operational withinthe next decade, it will pose
enormous petascale data challenges. This telescope will repeatedly take images of the night sky
every 20 seconds, throughout every night, for 10 years. Eachimage will consist of 3 gigapixels,
yielding 6 gigabytes of raw imagery every 20 seconds and nearly 30 terabytes of calibrated
imagery every night. From this “cosmic cinematography”, a new vision of the night sky will
emerge – a vision of the temporal domain – a ten-year time series (movie) of the Universe.
Astronomers will monitor these repeat images night after night, for 10 years, for everything that
has changed – changes in position and intensity (flux) will bemonitored, detected, measured,
and reported. For those temporal variations that are novel,unexpected, previously unknown, or
outside the bounds of our existing classification schemes, astronomers will want to know quickly
if such an event has occurred. This event alert notification must necessarily include as much
information as possible to help the astronomers around the world to prioritize their response
to each time-critical event. This information packet will include a probabilistic classification of
the event, with some measure of the confidence of the classification. What makes the LSST so
incredibly beyond current projects is that most time-domain sky surveys today detect 5-10 events
per week; LSST will detect 10 to 100 thousand events per night! Without good classification
information in those alert packets, and hence without some means with which to prioritize the
huge number of events, the astronomy community would consequently be buried in the data
deluge and will miss some of the greatest astronomical discoveries of the next 20 yearse.g.
even the next “killer asteroid” heading for Earth.
To solve the astronomers’ massive event classification problem, a collection of high-throughput
monitoring and event detection algorithms will be needed. These algorithms will need to access
distributed astronomical databases worldwide to correlate with each of those 100,000 nightly
events, in order to model, classify, and prioritize each event correctly and rapidly. One known
category of temporally varying astronomical object is a variable star. There are dozens of different
well known classes of variable stars, and there are hundreds(even thousands) of known examples
of these classes. These stars are not “interesting” in the sense that they should not produce alerts,
even though they are changing in brightness from hour to hour, night to night, week to week –
February 9, 2011 DRAFT
6
their variability is known, well studied, and well characterized already. However, if one of these
stars’ class of variability were to change, that would be extremely interesting and be a signal
that some very exotic astrophysical processes are involved. Astronomers will definitely want to
be notified promptly (with an alert) of these types of variations. One way of characterizing this
variation is by studying changes in the Fourier components (eigenvectors) of the temporal flux
curve which astronomers call “the light curve”.
This problem has several interesting data challenge characteristics: (1) the data streaming rate
is enormous (6 gigabytes every 20 seconds); (2) there are roughly 100 million astronomical
objects in each of these images, all of which need to monitored for change (i.e., a new variable
object, or a known variable object with a new class of variability); (3) 10 to 100 thousand “new”
events will be detected each and every night for 10 years; and(4) distributed data collections
accessed through the Virtual Astronomy Observatory’s worldwide distribution of databases and
data repositories will need to correlated and mined in conjunction with each new variable object’s
data from LSST, in order to provide the best classification models and probabilities, and thus to
generate the most informed alert notification messages.
Astronomers cannot wait until the year 2016 when LSST beginsits sky survey operations
for new algorithms to begin being researched. Those algorithms should be able to analyze the
data in-situ, without the costly need for centralizing all of it for analysis at each time point.
Furthermore, the distributed mining algorithms will need to be robust, scalable, and validated
already at that time. So, it is imperative to begin now to research, test, and validate such data
mining paradigms through experiments that replicate the expected conditions of the LSST sky
survey. Consequently, we have chosen an astronomical research problem that is both scientifically
valid (i.e., a real astronomy research problem today) and that parallels the eigenvector monitoring
problem that we have described above. We have chosen to studythe principal components of
galaxy parameters as a function of an independent variable,similar to the temporal dynamic
stream mining described above. In our current experiments,the independent variable is not the
time dimension, but local galaxy density. We specifically investigate this problem because it is
scientifically current and interesting, thereby producingnew astronomical research results, and
also because it performs tests of our algorithms specifically on the same types of distributed
databases that will be used in the future LSST event classification problems
The class of elliptical galaxies has been known for 20 years to show dimension reduction
February 9, 2011 DRAFT
7
among a subset of physical attributes, such that the 3-dimensional distribution of three of those
astrophysical parameters reduce to a 2-dimensional plane.The normal to that plane represents the
principal eigenvector of the distribution, and it is found that the first two principal components
capture significantly more than 90% of the variance among those 3 parameters.
By analyzing existing large astronomy databases (such as the Sloan Digital Sky Survey SDSS
and the 2-Micron All-Sky Survey 2MASS), we have generated a very large data set of galaxies.
Each galaxy in this large data set was then assigned (labeledwith) a new “local galaxy density”
attribute, calculated through a volumetric Voronoi tessellation of the total galaxy distribution in
space. Then the entire galaxy data set was horizontally partitioned across several dozen partitions
as a function of our independent variable: the local galaxy density.
As a result, we have been able to study eigenvector changes ofthe fundamental plane of
elliptical galaxies as a function of density. Computing these eigenvectors for a very large number
of galaxies, one density bin at a time, in a distributed environment, thus mimics the future
LSST dynamic data stream mining (eigenvector change) challenge problem described earlier. In
addition, this galaxy problem actually has uncovered some new astrophysical results: we find
that the variance captured in the first 2 principal components increases systematically from low-
density regions of space to high-density regions of space, and we find that the direction of the
principal eigenvector also drifts systematically in the 3-dimensional parameter space from low-
density regions to the highest-density regions. However, since the focus of this paper is on the
distributed algorithms themselves and not the discovery ofnew astrophysical results, we leave
a detailed discussion of them to another paper.
III. RELATED WORK
The work related to this area of research can broadly be subdivided into data analysis for
large scientific data repository and distributed data mining in a dynamic networks of nodes. We
discuss each of them in the following two sections.
A. Analysis of large scientific data collections
The U.S. National Virtual Observatory (NVO) [35] is a large scale effort to develop an
information technology infrastructure enabling easy and robust access to distributed astronomical
archives. It will provide services for users to search and gather data across multiple archives
February 9, 2011 DRAFT
8
and will provide some basic statistical analysis and visualization functions. The International
Virtual Observatory Alliance (IVOA) [27] is the international steering body that federates the
work of about two dozen national VOs across the world (including the NVO in the US). The
IVOA oversees this large-scale effort to develop an IT infrastructure enabling easy and robust
access to distributed astronomical archives worldwide.
There are several instances in the astronomy and space sciences research communities where
data mining is being applied to large data collections [16][13][2]. Another recent area of research
is distributed data mining [30][28] which deals with the problem of data analysis in environments
with distributed data, computing nodes, and users. Distributed eigen-analysis and outlier detection
algorithms have been developed for analyzing astronomy data stored at different locations by
Dutta et al.[20]. Karguptaet al. [29] have developed a technique for performing distributed
principal component analysis by first projecting the local data along its principal components
and then centralizing the projected data. In both these cases, the data is distributed vertically
(different full attribute columns reside at different sites), while in this paper, the data is distributed
horizontally (different data tuple sets reside at different sites). Moreover, none of the above efforts
address the problem of analyzing rapidly changing astronomy data streams.
B. Data analysis in large dynamic networks
There is a significant amount of recent research consideringdata analysis in large-scale
dynamic networks. Since efficient data analysis algorithmscan often be developed based on
efficient primitives, approaches have been developed for computing basic operations (e.g.average,
sum, max, random sampling) on large-scale, dynamic networks. Kempeet al. [31] and Boyd
et al. [10] developed gossip based randomized algorithms. These approaches used an epidemic
model of computation. Bawaet al. [5] developed an approach based on probabilistic counting.
In addition, techniques have been developed for addressingmore complex data mining/data
problems over large-scale dynamic networks: association rule mining [39], facility location [32],
outlier detection [11], decision tree induction [8], ensemble classification [33], support vector
machine-based classification [1],k-means clustering [15], top-k query processing [3].
A related line of research concerns the monitoring of various kinds of data models over large
numbers of data streams. Sharfmanet al. [37] develop an algorithm for monitoring arbitrary
threshold functions over distributed data streams. And, most relevant to this paper, Wolffet al.
February 9, 2011 DRAFT
9
[38] developed an algorithm for monitoring the L2 norm. We use this technique to monitor eigen-
states of the fundamental plane concerning elliptical galaxies. The formulation and experimental
results presented in this paper are new and do not appear in [38].
Huang et al. [26] consider the problem of detecting network-wide volumeanomalies via
thresholding the length of a data vector (representing current network volume) projected onto a
subspace closely related to the dominant principal component subspace of past network volume
data vectors. Unlike us, these authors consider the analysis of a vertically distributed data set.
Each network node holds a sliding window stream of numbers (representing volume through it
with time) and the network-wide volume is represented as a matrix with each column a node
stream. Because of the difference in data distribution (vertical vs. horizontal), their approach is
not applicable to our problem. We assume that each node is receiving a stream of tuples and
the network-wide dataset is matrix formed by the union of allnodes’ currently held tuples (each
node holds a collection ofrows of the matrix rather than a singlecolumn as considered by
Huang).
In the next few sections we first discuss our analysis for identifying the fundamental plane
of elliptical galaxies, and then show how the same computation can be carried out if the data is
stored at multiple locations.
IV. CENTRALIZED PRINCIPAL COMPONENTSANALYSIS FOR THEFUNDAMENTAL PLANE
COMPUTATION
The identification of certain correlations among parameters has lead to important discoveries
in astronomy. For example, the class of elliptical and spiral galaxies (including dwarfs) have
been found to occupy a 2D space inside a 3D space of observed parameters — radius, mean
surface brightness and velocity dispersion. From this 3D space of observed parameters, the 2D
plane can be derived by projecting the data on the top 2 eigenvectors of the covariance matrix
i.e. performing a principal component analysis (PCA) of the covariance matrix of the data. This
2D plane has been referred to as the Fundamental Plane [21]. We study the variation of this
fundamental plane with the density of each galaxy derived from location and other observed
parameters.
February 9, 2011 DRAFT
10
A. Data preparation
For identifying the variability of fundamental plane on thebasis of galactic densities, we have
used the SDSS and 2MASS data sets available individually through the NVO. Since galactic
density is not observed by the NVOs, we have cross-matched the two data sets and computed
the densities based on other property values the details of which we describe next.
We create a large, aggregate data set by downloading the 2MASS XSC extended source catalog
(http://irsa.ipac.caltech.edu/applications/Gator/) for the entire sky and
cross-match it against the SDSS catalog using the SDSS Crossid tool (http://cas.sdss.
org/astro/en/tools/crossid/upload.asp) such that we select all unique attributes
from thePhotoObjAllandSpecObjAlltables as well as thephotozd1attribute from thePhotoz2
table which is an estimated redshift value. We filter the databased on the SDSS identified type to
remove all non-galaxy tuples. We then filter the data again based on reasonable redshift (actual
or estimated) values between0.003 ≤ z ≤ 0.300.
Next, we create a new attribute, local galactic density to quantify the proximity of nearby
galaxies to a given galaxy (this attribute has strong astrophysical significance). We transformed
the attributescx, cy, cz (unit vectors),z, andphotozd1 to 3D Euclidean coordinates
(X, Y, Z) = (Distance× cx,Distance× cy,Distance× cz)
whereDistance = 2×[
1− 1√(1+redshift)
]
. We finally use these Cartesian coordinates to compute
the Delaunay Tessellation [18] of each point (galaxy) in 3D space. The local galactic density of
a given galaxyi is computed using the Delaunay Tessellation Field Estimator (DTE) formulation
[36]:
den(i) = 4vol(i)
wherevol(i) denotes the volume of the Delaunay cell containing galaxyi. A small number of
galaxies have undefinedden(i) because they are on the boundary and havevol(i) =∞. These
galaxies are dropped from our calculations.
B. Binning and PCA
The astronomy question that we want to address here is whether the fundamental plane
structure of galaxies in low density regions differ from that of galaxies in high density regions.
February 9, 2011 DRAFT
11
For this, we take the above data set containing 155650 tuplesand associate with each tuple, a
measure of its local galactic density. Our final aggregated data set has the following attributes
from SDSS: Petrosian I band angular effective radius (Iaer), redshift (rs), and velocity dispersion
(vd); and has the following attribute from 2MASS: K band mean surface brightness (Kmsb). We
produce a new attribute, logarithm Petrosian I band effective radius (log(Ier)), as log(Iaer*rs)
and a new attribute, logarithm velocity dispersion (log(vd)), by applying the logarithm tovd.
We finally append the galactic density (cellDensity) associated with each of the tuples as the
last attribute of out aggregated data set. We divide the tuples into 30 bins based on increasing
cell density, such that there are equal number of tuples in each bin. For each bin we carry
out the fundamental plane calculation or principal component analysis and observe that the
percent of variance captured by the first two PCs is very high.This implies that the galaxies
can be represented by the plane defined by the first two eigen vectors. It is also observed that
this percentage increases for bins with higher mean galactic density. We report these results in
Section VI.
As discussed earlier, analysis of very large astronomy catalogs can pose serious scalability
issues, especially when considering streaming data from multiple sources. In the next section we
describe a distributed architecture for addressing these issues and then show how the centralized
eigen analysis of the covariance matrix can be formulated asa distributed computation and
solved in a communication efficient manner.
V. D ISTRIBUTED PRINCIPAL COMPONENT ANALYSIS
When resources become a constraint for doing data mining on massive data sets such as
astronomical catalogs, distributed data mining provides acommunication efficient solution. For
the problem discussed in the last section, we can formulate adistributed architecture where
after cross matching the data using a centralized cross matching tool, we can store the meta data
information in a central location. Such a service-orientedarchitecture would facilitate astronomers
to query multiple databases and do data mining on large data sets without downloading the data
to their local computing resources. The data set is downloaded in parts at a number of computing
nodes (that are either dedicated computers connected through communication channels or part
of a large grid) based on the meta data information maintained at the central server site. In such
a computational environment, distributed data mining algorithms can run in the background
February 9, 2011 DRAFT
12
seamlessly for providing fast and efficient solutions to theastronomers by distributing the task
among a number of nodes. Figure 1 represents one such architecture in which the user submits
jobs through the web server and the DDM server will execute these jobs using the underlying
P2P architecture.
Fig. 1. System diagram showing the different components.
A. Notations
Let V = {P1, . . . , Pn} be a set of nodes connected to one another via an underlying commu-
nication infrastructure such that the set ofPi’s neighbors,Γi, is known toPi. Additionally, at
any time,Pi is given a time-varying data matrixMi of size |Mi| where the rows correspond
to the instances and the columns correspond to attributes orfeatures. Mathematically,Mi =
[−→xi,1−→xi,2 . . . ]
T, where each−→xi,ℓ = [xi,ℓ1 xi,ℓ2 . . . xi,ℓd] ∈ Rd is a row (tuple). The covariance
matrix of the data at nodePi, denoted byCi, is the matrix whose(i, j)-th entry corresponds
to the covariance between thei-th andj-th feature (column) ofMi. The global data set of all
the nodes isG =⋃n
i=1Mi and the global covariance matrix isC. Let−→V , Θ and−→µ denote the
eigenvector (assumed to be of length one), eigenvalue and mean of the global data respectively.
Throughout this discussion we have dropped the explicit time subscript.
B. Problem formulation
The problem that we want to solve in this paper can be stated asfollows:
Definition 5.1 (Problem Statement):Given a time-varying data setMi at each node, maintain
an up-to-date set of eigenvectors (−→V ) and eigenvalues (Θ) of the global covariance matrixC i.e.
February 9, 2011 DRAFT
13
find V andΘ such that
C−→V = Θ−→V
In our scenario, since the data is constantly changing, we relax this requirement and use the
following as an admissible solution.
Definition 5.2 (Relaxed Problem Statement):Given a time-varying data setMi at each node,
maintain an up-to-date set of eigenvectors (−→V ) and eigenvalues (Θ) of the global covariance
matrix C such that,∥
∥
∥C−→V −Θ
−→V
∥
∥
∥< ǫ
whereǫ is a user chosen parameter denoting the error threshold.
C. Distributed PCA monitoring
One way of keeping the model up-to-date is by periodically rebuilding the model. However,
this wastes resources if the data is stationary. Alternatively, one may risk model inaccuracy if
the period of recomputation is too long and the data changes in between.
In this work, we take a different approach. Starting with an arbitrary model at each node, we
propose an algorithm which raises an alert whenever the global data of the nodes can no longer
fit this model. If the data has changed enough, we use a feedback loop to collect data from the
network (using convergecast), rebuild a new model and then distribute this new model to all the
nodes to be again tracked by the peers against the current data. Below, we reduce the problem of
monitoring the eigenvectors and eigenvalues to checking ifa local vector at each peer is inside
a circle of radiusǫ.
Note that, if all the columns ofG are mean reduced (using the global mean) by the respective
columns, i.e. the mean of each column is subtracted from each entry of that column, the
covariance matrix is decomposable:C =∑n
i=1 Ci. With abuse of symbols, letG andMi denote
February 9, 2011 DRAFT
14
the mean reduced versions of the variables themselves. Then,
C = 1∑n
i |Mi|GTG =
1∑n
i |Mi|
M1
M2
...
Mn
T
M1
M2
...
Mn
=1
∑n
i |Mi|(
MT1 MT
2 . . . MTn
)
M1
M2
...
Mn
=1
∑n
i |Mi|
n∑
i=1
MTiMi
=1
∑n
i |Mi|
n∑
i=1
Ci (1)
Thus, for horizontally partitioned mean reduced data distributed amongn nodes, the covariance
matrix is completely decomposable. Assuming that each peeris provided with an initial estimate
of−→V (with ||−→V || = 1) andΘ, the eigen monitoring instance (denoted byI1) can be reformulated
as:
∥
∥
∥C−→V −Θ
−→V
∥
∥
∥< ǫ1 ⇔
∥
∥
∥
∥
∥
(
1∑n
i |Mi|
n∑
i
Ci)
−→V −Θ
−→V
∥
∥
∥
∥
∥
< ǫ1
⇔∥
∥
∥
∥
∥
1∑n
i |Mi|
n∑
i
[
Ci−→V −Θ
−→V |Mi|
]
∥
∥
∥
∥
∥
< ǫ1
⇔∥
∥
∥
∥
∥
n∑
i
|Mi|∑n
i |Mi|
[
Ci−→V
|Mi|−Θ−→V
]∥
∥
∥
∥
∥
< ǫ1
⇔∥
∥
∥
∥
∥
n∑
i
( |Mi|∑n
i |Mi|
)
[
I1.−→Ei]
∥
∥
∥
∥
∥
< ǫ1 (2)
where I1.−→Ei is a local error vector at nodePi (based onMi,
−→V and Θ) defined asI1.
−→Ei =
Ci−→V
|Mi|− Θ−→V . Let I1.
−→E G =∑n
i
(
|Mi|∑ni |Mi|
) [
I1.−→Ei]
denote the convex combination ofI1.−→Ei ’s.
Checking if the norm ofI1.−→Ei is less thanǫ is equivalent to checking if the vectorI1.
−→Ei is
inside a sphere of radiusǫ1. Now if each peer determines that their own vectorI1.−→Ei is inside
February 9, 2011 DRAFT
15
the sphere, then so is their convex combinationI1.−→E G. This is the crux of the idea used in
developing the distributed algorithm. However, this argument falls apart when the vectors are
outside the sphere. To circumvent this problem and apply thesame methodology, the area outside
the circle is approximated by a set of tangents to the sphere.As before, if all peer’sI1.−→Ei lies in
the same hyperplane,I1.−→E G will also lie there. This paves the way for the distributed algorithm
which is discussed next.
Note that, in the above formulation we have assumed that the data is mean shifted. In a
dynamic setting, it may be expensive to recompute the mean ateach time step. Given an initial
estimate of the mean−→µ to all the peers (may be a random choice), we set up another monitoring
instanceI2 for checking if the (column wise) average vector over all peers exceeds a threshold
ǫ2:∥
∥
∥
∥
∥
∥
1∑n
i=1Mi
n∑
i=1
|Mi|∑
j=1
−→xi,j −−→µ
∥
∥
∥
∥
∥
∥
< ǫ2 ⇔
∥
∥
∥
∥
∥
∥
1∑n
i=1Mi
n∑
i=1
|Mi|∑
j=1
−→xi,j −−→µ |Mi|
∥
∥
∥
∥
∥
∥
< ǫ2
⇔∥
∥
∥
∥
∥
n∑
i=1
|Mi|∑n
i=1Mi
[
∑|Mi|j=1−→xi,j
|Mi|− −→µ
]∥
∥
∥
∥
∥
< ǫ2
⇔∥
∥
∥
∥
∥
n∑
i=1
( |Mi|∑n
i=1Mi
)
[
I2.−→Ei]
∥
∥
∥
∥
∥
< ǫ2 (3)
where, as before,I2.−→Ei =
(∑|Mi|j=1
−−→xi,j
|Mi|−−→µ
)
is the local vector andI2.−→EG =
∑n
i=1
(
|Mi|∑ni=1
Mi
) [
I2.−→Ei]
is a convex combination ofI2.−→Ei -s. The same convex methodology for checking inside and
outside of the circle can be applied here.
Satisfying the relaxed problem statement: In the appendix, we show that if the bounds (2)
and (3) hold, then the problem statement in the above definition holds withǫ = ǫ1 + ǫ22.
D. Notations and thresholding criterion
In our algorithm, each node sends messages to its immediate neighbors to converge to a
globally correct solution. There are three kinds of messages that can be transmitted: (i)monitoring
messages which are used by the algorithm to check if the modelis up-to-date, (ii)datamessages
which are used to sample data for rebuilding a model (convergecast), and (iii)modelmessages
which are used to disseminate the newly built model in the entire network (broadcast). Any
monitoring message sent by nodePi to Pj contains information thatPi has gathered about
February 9, 2011 DRAFT
16
the network whichPj may not know. In our case, the message sent byPi to Pj consists of
a set of vectors or matrix1 Si,j with each row corresponding to observations and each column
corresponding to features.
We know that if each peer’sI1.−→Ei (or I2.
−→Ei ) lies in the same convex region,I1.−→E G (or I2.
−→E G)
also lies in the same convex region. Therefore, each peer needs information about the state of its
neighbors. The trick is to do this computation without collecting all of the data of all the peers.
We define three sufficient statistics on sets of vectors (average vector of the set and the number
of points in the set) at each node and for each instance of the monitoring problem separately,
based on which a peer can do this thresholding more efficiently. For the rest of the paper, we
only discuss the computations with respect toI1 since the other instance is very similar.
• Knowledge−→Ki: This is all the information thatPi has about the network.
• Agreement−−→Ai,j: This is whatPi andPj have in common.
• Held−−→Hi,j : This is whatPi has not yet communicated toPj .
We can write
• |Ki| = |Mi|+∑
Pj∈Γi
|Sj,i|
• |Ai,j| = |Si,j|+ |Sj,i|• |Hi,j | = |Ki| − |Ai,j|
Similarly for the average of the sets we can write,
•−→Ki =
1|Ki|
|Mi|−→Ei +
∑
Pj∈Γi
|Sj,i|−→Sj,i
•−−→Ai,j =
1|Ai,j |
(
|Si,j|−→Si,j + |Sj,i|
−→Sj,i)
•−−→Hi,j =
1|Hi,j |
(
|Ki|−→Ki − |Ai,j|
−−→Ai,j
)
In this work we assume that the communication takes place over an overlay tree. This
is to ensure that vectors sent to a node is never sent back to itto avoid double counting.
Interested readers are urged to see [38] and [8] for a discussion of how this assumption can be
accommodated or, if desired, removed.
At each peer, we need to check if the local vector−→Ki lies in a convex region. To achieve this,
we need to split the domain of monitoring function into non-overlapping convex regions. Since
1we use them interchangeably here
February 9, 2011 DRAFT
17
the monitoring function is L2 norm inRd, checking if the norm is less thanǫ is equivalent to
checking if it is inside the sphere which is a convex region bydefinition. However, the outside
of the sphere is not convex. So we make it convex by drawing tangents to the sphere at arbitrary
points. Each of these half spaces is again convex and so the general rule can be applied here.
However, the area between the sphere and each of these half spaces is not convex. These small
uncovered spaces are known as thetie regions. Denoting the area inside the sphere asRin and
each of the half spaces as{Rh1, Rh2
, . . . }, the entire set of convex regions covering the space
is Cω = {Rin, Rh1, Rh2
, . . . }. Fig. 2 shows the convex regions inRd, the tangent lines and the
tie region. Given this convex region and the local vectors, we now state a Theorem based on
which any peer can stop sending messages and output the correct result.
Theorem 5.1:[38] Let−→EG,−→Ki,−−→Ai,j, and
−−→Hi,j be as defined in the previous section. LetR be
any region inCω. If at time t no messages traverse the network, and for eachPi,−→Ki ∈ R and
for everyPj ∈ Γi,−−→Ai,j ∈ R and either
−−→Hi,j ∈ R or Hi,j = ∅, then−→EG ∈ R.
Proof: For proof the readers are referred to [38].
Using this theorem, each node can check if∥
∥
∥
−→Ki
∥
∥
∥< ǫ. If the result holds for every node, then
their convex combination−→EG will also be inR. If there is any disagreement, it would be between
any two neighbors. In that case, messages will be exchanged and they will converge to the correct
result. In either case, eventual global correctness is guaranteed.
B
C
D
A
Fig. 2. (A) the area inside anǫ circle (B) A random vector (C) A tangent defining a half-space(D) The areas between thecircle and the union of half-spaces are the tie areas.
February 9, 2011 DRAFT
18
E. Algorithm
Both the mean monitoring and the eigenvector monitoring algorithms rely on the results of
Theorem 5.1 to output the correct result. For the eigenvector monitoring, the inputs to each node
are the eigenvector−→V , the eigenvalueΘ and the error threshold isǫ1. From Section V-C, for
this monitoring instance, the input is:
• I1.−→Ei = ([MT
iMi]·
−→V )
|Mi|−Θ−→V
• I1. |Ei| = |Mi|Similarly for the mean monitoring algorithm, the inputs arethe mean−→µ ∈ R
d and the error
thresholdǫ2. In this case, each node subtracts the mean−→µ from its local average input vector−→xi,j . For this problem instance denoted byI2, the following are the inputs:
• I2.−→Ei =
(∑|Mi|j=1
−−→xi,j
|Mi|−−→µ
)
• I2. |Ei| = |Mi|Algorithm 1 presents the pseudo-code of the monitoring algorithm while Alg. 2 presents the
pseudo-code for convergecast/broadcast process. The inputs to the monitoring algorithm areMi,−→Ei (depending on how it is defined),Γi, ǫ1 or ǫ2, Cω. For each problem instanceI1 andI2, each
node initializes its local vectors−→Ki,−−→Ai,j and
−−→Hi,j. The algorithm is entirely event-driven. Events
can be one of the following: (i) change in local dataMi, (ii) receipt of a message, and (iii)
change inΓi. In any of these cases, the node checks if the condition of thetheorem holds. Based
on the value of its knowledge−→Ki, the node selects the active regionR ∈ Cω such that
−→Ki ∈ R.
If no such region exists,R = ∅. If R = ∅, then−→Ki lies in the tie region and hencePi has to send
all its data. On the other hand, ifR 6= ∅ the node can rely on the result of Theorem 5.1 to decide
whether to send a message. If for allPj ∈ Γi, both−−→Ai,j ∈ R and
−−→Hi,j ∈ R, Pi does nothing;
else it needs to set−→Si,j and |Si,j|. Based on the conditions of the Theorem, note that these are
the only two cases when a node needs to send a message. Whenever it receives a message (−→S
and |S|), it sets−→Sj,i ←
−→S and |Sj,i| ← |S|. This may trigger another round of communication
since its−→Ki can now change.
To prevent message explosion, in our event-based system we employ a “leaky bucket” mech-
anism which ensures that no two messages are sent in a period shorter than a constantL. Note
that this mechanism does not enforce synchronization or affect correctness; at most it might
delay convergence. This technique has been used elsewhere also [38][7].
February 9, 2011 DRAFT
19
Algorithm 1: Monitoring eigeivector/eigenvalues.
Input: ǫ1, Cω,−→Ei , Γi andL
Output: 0 if∣
∣
∣
∣
∣
∣
−→Ki
∣
∣
∣
∣
∣
∣< ǫ, 1 otherwise
Initialization: Initialize vectors;if MessageRecvdFrom
(
Pj,−→S , |S|
)
then−→Sj,i ←
−→S ;|Sj,i| ← |S|;Update vectors;
if Mi, Γi or Ki changesthenforall the Pj ∈ Γi do
if LastMsgSent > L time units agothenif R = ∅ then−→Si,j ← |Ki|
−→Ki−|Sj,i|
−−→Sj,i
|Ki|−|Sj,i|;
|Si,j | ← |Ki| − |Sj,i|;if−−→Ai,j 6∈ R or
−−→Hi,j 6∈ Rℓ thenSet−→Si,j and |Si,j| such that
−−→Ai,j and−−→Hi,j ∈ Rℓ;
MessageSentTo(
Pj ,−→Si,j, |Si,j |
)
;
LastMsgSent← CurrentTime;Update all vectors;
else Wait L time units and then check again;
The monitoring algorithm raises a flag whenever either∥
∥
∥I1.−→Ki
∥
∥
∥> ǫ or
∥
∥
∥I2.−→Ki
∥
∥
∥> ǫ. Once
the flag is set to 1, the nodes engage in a convergecast-broadcast process to accumulate data up
the root of the tree, recompute the model and disseminate it in the network.
For the mean monitoring algorithm in the convergecast phase, whenever a flag is raised, each
leaf node in the tree forwards its local mean up the root of thetree. In this phase, each node
maintains a user selected alert mitigation constant,τ which ensures that an alert is stable for a
given period of timeτ for it to send the data. Experimental results show that this is crucial in
preventing a false alarm from progressing, thereby saving resources. In order to implement this,
whenever the monitoring algorithm raises a flag, the node notes the time, and sets a timer toτ
time units. Now, if the timer expires, or a data message is received from one of its neighbors,Pi
first checks if there is an existing alert. If it has been recordedτ or more time units ago, the node
does one of the following. If it has received messages from all its neighbors, it recomputes the
February 9, 2011 DRAFT
20
new mean, sends it to all its neighbors and restarts its monitoring algorithm with the new mean.
On the other hand, if it has received the mean from all but one of the neighbors, it combines
its data with all of its neighbors’ data and then sends it to the neighbor from which it has not
received any data. Other than any of these cases, a node does nothing.
For the eigenvector monitoring, in place of sending a local mean vector, each node forwards the
covariance matrixCi. Any intermediate node accumulates the covariance matrix of its children,
adds it local matrix and sends it to its parent up the tree. Theroot computes the new eigenvectors
and eigenvalues. The first eigenstate is passed to the monitoring algorithm.
F. Correctness and complexity analysis
The eigen monitoring algorithm is eventually correct.
Theorem 5.2:The eigen monitoring algorithm iseventually correct.
Proof: For the eigen monitoring algorithm, the computation will continue for each node
unless one of the following happens:
• for every node,−→Ki =
−→EG
• for everyPi and every neighborPj ,−→Ki,−−→Ai,j,and
−−→Hi,j are in the same convex regionR ∈ Cω.
In the former case, every node obviously computes the correct output since the knowledge of
each node becomes equal to the global knowledge. In the latter case, Theorem 5.1 dictates that−→EG ∈ R. Note that by construction, the output of the thresholding function (in this case‖−→x ‖ > ǫ)
is invariant inside anyR ∈ Cω. In other words, the binary function∥
∥
∥
−→EG∥
∥
∥< ǫ and
∥
∥
∥
−→Ki
∥
∥
∥< ǫ will
have the same output insideR. Therefore in either of the cases, the eigen monitoring algorithm
is correct.
Moreover, sinceC = 1∑nj=1
|Mi|
∑n
j=1 Ci (see Eqn. 1) and−→µ =∑n
i=1
∑|Mi|j=1
−−→xi,j
∑|Mi|j=1
|Mi|the models built
are also the same compared to a centralized algorithm havingaccess to all the data.
Determining the communication complexity of local algorithms in dynamic environments is
still an open research issue. Researches have proposed definitions of locality [7][38]. Note that for
an exact algorithm as the eigen monitoring algorithm, the worst case communication complexity
is O(size of network). This can happen, for example, when the each node has a vector in a
different convex region and the global average is in anotherdifferent region. However, as shown
in this paper and also by several authors [38][7] there are several problem instances for which
the resource consumption becomes independent of the size ofthe network. Interested readers
February 9, 2011 DRAFT
21
Algorithm 2: P2P Eigen-monitoring Algorithm.Input: ǫ1, ǫ2, Cω, Mi, Γi, L, τOutput: (i)
−→V ,Θ such that
∥
∥
∥C · −→V −Θ · −→V∥
∥
∥ < ǫ1 and ||−→V || = 1, (ii) −→µ such that∥
∥
∥
∥
∑ni=1
∑|Mi|
j=1
−−→xi,j
∑|Mi|
j=1|Mi|
−−→µ∥
∥
∥
∥
< ǫ2
Initializationbegin
Initialize vectors;MsgType= MessageRecvdFrom(Pj);
if MsgType = Monitoring Msg then Pass Message to appropriate Monitoring Algorithm;if MsgType = New Model Msg then
Update−→V , Θ, −→µ ;
Forward new model to all neighbors;Datasent=false;Restart Monitoring Algorithm with new models;
if MsgType = Dataset Msg thenif Received from all but one neighborthen
flag=Output Monitoring Algorithm();if Datasent = false and flag = 1 then
if DataAlert stable forτ time thenD1=Ci+ Recvd Covariance;
D2=∑|Mi|
j=1
−−→xi,j
|Mi|+Recvd Mean;
Datasent=true;SendD1, D2 to remaining neighbor
else DataAlert=CurrentTime;
if Received from all neighborsthenD1=Ci+ Recvd Covariance;
D2=∑|Mi|
j=1
−−→xi,j
|Mi|+ Recvd Mean;
(−→V ,Θ)=EigAnalysis(D1) where||−→V || = 1;−→µ =Mean(D2);
Forward new−→V ,Θ,−→µ to all neighbors;
Datasent=false;Restart Monitoring Algorithm with new models;
if Mi, Γi or−→Ki changesthen
Run Monitoring Algorithm;flag=OutputMonitoring Algorithm();if flag=1 and Pj=IsLeaf() then
Execute the same conditions asMsgType = Dataset Msg
are referred to [6] for a detailed discussion on communication complexity and locality of such
algorithms.
February 9, 2011 DRAFT
22
VI. EXPERIMENTAL RESULTS
In this section we demonstrate the experimental results of the distributed eigen monitoring
algorithm. Before doing that, we describe centralized experiments showing how the fundamental
plane changes with variations in local galactic density. Then we describe distributed experiments
showing the performance of the eigen monitoring algorithm for a distributed streaming scenario
when the same data is streamed at multiple nodes. Our goal is to demonstrate that, using
our distributed eigen monitoring algorithm to compute the principal components and monitor
them in a streaming scenario, we can find very similar resultsas were obtained by applying
a centralized PCA. As an interesting aside, even though our goal was not to make a new
discovery in astronomy, the results are astronomically noteworthy. We argue that our distributed
algorithm could have found very similar results to the centralized approach at a fraction of
the communication cost. Also, we want to emphasize that thisdistributed eigen monitoring
algorithm can be applied to a number of change-detection applications in high-throughput
streaming scenarios (such as the LSST) for important astronomical discoveries of many types.
The importance and novelty of this algorithm compared to existing distributed PCA algorithms
is that this is an exact algorithm that deterministically converges to the correct result.
A. Fundamental Plane results
As noted in Section IV-A, we divide the entire dataset into 30bins. The bins are arranged from
low to high density. In this section we present the results ofour fundamental plane experiments
for those 30 bins. We have only used the elliptical galaxies in our experiments from the SDSS
and 2MASS dataset.
15 16 17 18 19 20 21 22 23 24 2594.5
95
95.5
96
96.5
97
97.5
98
Log of mean density across bins
% v
aria
nce
capt
ured
by
PC
1+P
C2
Fig. 3. Variance captured by PCs 1 and 2 w.r.t. log of mean density of each bin. Bin 1 has the lowest mean density and Bin30 the highest.
February 9, 2011 DRAFT
23
15 16 17 18 19 20 21 22 23 24 251.9
1.95
2
Log of mean density across bins
θ
15 16 17 18 19 20 21 22 23 24 25
0.1
0.12
0.14
φ
Log of mean density across bins
(a) Variation ofθ andφ independently w.r.t. log of bin density for30 bins
1517
1921
2325
1.91.92
1.941.96
1.982
0.09
0.1
0.11
0.12
0.13
0.14
0.15
Log of mean density across binsΘ
φ
(b) Variation ofθ andφ w.r.t. log of bin density for 30 bins
Fig. 4. Plot of variation ofθ andφ independently with bin number. The bins are numbered in increasing order of density.
Figure 3 provides the most significant scientific result. It demonstrates the dependence of the
variance captured by the first 2 PC’s with respect tolog of bin density (the x-axis shows mean
density of each bin in log-scale). As seen, the variance increases monotonically from almost
95% to 98% with increase in galactic bin density. This clearly demonstrates a new astrophysical
effect, beyond that traditionally reported in the astronomical literature. This results from the
application of distributed data mining (DDM) on a significantly (by 1000 times) larger set of
data. More such remarkable discoveries can be anticipated when DDM algorithms of the type
reported here are applied to massive scientific (and non-scientific) data streams of the future.
To analyze more deeply the nature of the variation of the firsttwo PCs with respect to
increasing galactic density, we plot the direction of the normal to the plane defined by the first
2 PCs i.e. pc1 and pc2. Since each of these PC’s are vectors in 3-d, so is the normal to the
plane. The normal vector is represented by its two directional angles: the spherical polar angles
February 9, 2011 DRAFT
24
θ andφ. Figure 4 shows a plot ofθ andφ for 30 bins. Figure 4(a) shows the variation ofθ and
φ independently withlog of mean galactic density. Figure 4(b) shows the variation ofboth with
log of mean density. The systematic trend in the change of direction of the normal vector seen
in Figure 4(b) is a new astronomy result. This represents exactly the type of change detection
from eigen monitoring that will need to be applied to massivescientific data streams, including
large astronomy applications (LSST) and large-scale geo-distributed sensor networks, in order
to facilitate knowledge discovery from these petascale data collections.
B. Results of distributed PCA algorithm
The distributed PCA implementation makes use of a Java-based simulated environment for
simulating thousands of peers on a single computer. For generating realistic topologies the
simulator uses BRITE [12], which is a universal topology generator from Boston University. In
our simulations we used topologies generated according to the Barabasi Albert (BA)model. On
top of the network generated by BRITE, we overlayed a spanning tree. We have experimented
with varying network sizes ranging from 50 to 1000 nodes. We report all times in terms of
simulator ticks since wall time is meaningless when simulating thousands of nodes on a single
PC. We set up the simulator such that an edge delay ofx msecs in BRITE topology corresponds
to x simulator ticks. We make the assumption that the time required for local processing is trivial
compared to the overall network latency and therefore, convergence time for the distributed PCA
algorithm is reported in terms of the average edge delay.
We have divided the data of the centralized experiments into5 bins (instead of 30), sorted
by galactic density. Each bin represents the data distribution at a certain time in the streaming
scenario and the distribution changes every 200,000 simulation ticks which we call anepoch.
This implies that every 200,000 simulation ticks we supply the nodes with a new bin of data.
The whole experiment therefore executes for 200,000×5=1,000,000 simulator ticks. Furthermore,
within each epoch, we stream the data at a rate of 10% of the binsize for every 10,000 simulation
ticks which we call thesub-epochinterval. Thus, starting from the beginning of any epoch, the
whole data is changed by 100,000 ticks and no data is changed for the later 100,000 ticks of
that epoch. In other words, all 10,000 points are received simultaneously by all nodes at the first
tick of each sub-epoch (except during the last 100,000 ticksof each epoch).
The two quantities measured in our experiments are thequality of the result and thecost of
February 9, 2011 DRAFT
25
the algorithm. For the eigen monitoring algorithm, qualitycan be measured as (1) the number of
peers which report an agreement between the model at each node and the datai.e.∥
∥
∥I1.−→Ki
∥
∥
∥< ǫ1
or∥
∥
∥I2.−→Ki
∥
∥
∥< ǫ2 for each time instance, and (2) the average L2 norm distance between the
principal eigen vector and the and the computed eigen vectorin the distributed scenario over all
the bins. For cost we measure the number of monitoring messages and the number of computation
messages separately.
We have used the following default values for the algorithm:size of leaky bucketL = 500,
error thresholdǫ1 = 2.0 ǫ2 = 0.02, alert mitigation constantτ = 500, and number of peers = 50.
2 4 6 8 10x 10
5
0
5
10
15
20
25
Time
||I1.K
i||
(a)∥
∥
∥I1. ~Ki
∥
∥
∥vs. Time for eigenmonitoring
1 2 3 4 5 6 7 8 9 10x 10
5
0
0.01
0.02
0.03
0.04
0.05
0.06
Time
||I2.K
i||
(b)∥
∥
∥I2. ~Ki
∥
∥
∥vs. Time for mean monitoring
Fig. 5. Variation of∥
∥
∥I1. ~Ki
∥
∥
∥(left) and
∥
∥
∥I2. ~Ki
∥
∥
∥(right) across all the peers vs. Time.
Figure 5 shows the variation of the local knowledges of each peer throughout the execution of
the experiment and the thresholds (red dotted lines). The left figure shows∥
∥
∥I1.−→Ki
∥
∥
∥(eigenmoni-
toring) while the right figure shows∥
∥
∥I2.−→Ki
∥
∥
∥(mean monitoring). For both the figures, the norm of
the knowledge vectors exceed the respective thresholds at the beginning of each epoch (200,000,
400,000, 600,000, and 800,000 ticks), because the data corresponds to a new bin. The peers then
jointly infer this disagreement using the monitoring algorithm and the convergecast/broadcast
round is invoked which rebuilds and distributes a new set of eigenvectors and eigenvalues. As
a result, the norm of the local knowledge at each peer drops below the corresponding threshold
and only the monitoring algorithm operates for the rest of this epoch.
Accuracy and convergence of the distributed eigen monitoring algorithm is shown in Figure
6. The left figure shows the accuracy of eigen monitoring while the right one shows the same
for mean monitoring. As shows, accuracy is low for the first 100,000 ticks of each epoch since
the data is changing during that time. Accuracy increases to100% during the later 100,000 ticks
February 9, 2011 DRAFT
26
since the model is in accordance with the data. This pattern is repeated for all the epochs. The
convergence rate of the algorithm is shown in Figure 7 by zooming in to the second epoch.
The data is changed at every 10,000 ticks between 200,000 and300,000 ticks. This is why the
accuracy is low during this period. The algorithm convergesto 100% accuracy within 330,000
ticks i.e. within 30,000 ticks after the data stops changing. The average edge delay is 1000
simulator ticks Hence the algorithm converges in approximately 30 times the average edge delay.
Figure 8 shows the messages exchanged per peer throughout the experiment. The monitoring
messages, shown in the left figure, increase whenever the data changes but decreases once the
algorithm converges. The number of messages exchanged during the stationary period is very
low compared to an algorithm which broadcasts all the information every sub-epoch. The rate
of messages of the latter is 2 per sub-epoch (considering twoneighbors per peer on average).
The data messages is shown as cumulative plot in the right figure. As shown there is an high
number of data messages for each epoch change and it decreases for the later 100,000 of all
epochs. For any experiment, new models are build 2 to 3 times per epoch.
0 1 2 3 4 5 6 7 8 9 10x 10
5
0
20
40
60
80
100
Time
% p
eers
rep
ortin
g ||I
1.Ki||<
ε 1
(a) Accuracy vs. Time for eigen monitoring
0 1 2 3 4 5 6 7 8 9 10x 10
5
0
20
40
60
80
100
Time
% p
eers
rep
ortin
g ||I
2.Ki||<
ε 2
(b) Accuracy vs. Time for mean monitoring
Fig. 6. Percentage of peers agreeing to∥
∥
∥I1. ~Ki
∥
∥
∥< ǫ1 (left figure) and
∥
∥
∥I2. ~Ki
∥
∥
∥< ǫ2 (right figure). As clearly shown, the
algorithm shows high accuracy.
The last set of experiments show that the quality of the models built by the algorithm and
its communication complexity is independent of the number of nodes in the network, thereby
guaranteeing high scalability. We first compare the qualityof the models build by the distributed
eigen monitoring algorithm to that of a centralized algorithm having access to all the data.
Since we compute the principal eigen vector for each bin separately, we plot the average
L2 norm distance between the centralized and distributed eigen vectors for every experiment.
The experiments have been repeated for 10 independent trials. Figure 9 shows the quality of
February 9, 2011 DRAFT
27
2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.5x 10
5
0
20
40
60
80
100
Time%
pee
rs r
epor
ting
||I1.K
i||<ε 1
Fig. 7. Convergence of the monitoring algorithm to 100% accuracy.
0 1 2 3 4 5 6 7 8 9 10x 10
5
0
1
2
3
4
Time
Mes
sage
s pe
r pe
er
(a) Monitoring messages vs. Time
0 1 2 3 4 5 6 7 8 9 10x 10
5
0
10
20
30
40
Time
Cum
ulat
ive
data
mes
sage
s/pe
er
(b) Cumulative data messages vs. Time
Fig. 8. Messages exchanged by the eigen monitoring algorithm per peer throughout the experiment.
the computed models for different network sizes. As shown inthe figure, the proposed eigen
monitoring algorithm produces results which are highly accurate compared to their centralized
counterpart. Moreover, quality does not degrade with increasing network sizes. Because our
algorithm is provably correct, the number of nodes has no influence on the quality of the result.
Figures 10 and 11 show the number of messages exchanged per node when the number of
nodes is increased from 50 to 1000. In this context, normalized message per node means the
number of messages sent by a node per unit of sub-epoch (i.e. every data change). This is the
maximal rate at which any node can send messages in our distributed algorithm. Since the length
of each sub-epoch is 10,000 ticks andL=500, this maximal rate is therefore, 10,000/500×2=40,
assuming two neighbors per node, on average. Also, for an algorithm which uses broadcast as the
communication model, its normalized messages will be 2 per sub-epoch assuming two neighbors
per node, on average. In all our experiments, the normalizedmessages per peer is close to 0.3,
well below these maximal rates. Thus the proposed algorithmis highly efficient with respect to
February 9, 2011 DRAFT
28
communication. Also as shown, the monitoring messages remain constant even if the number of
nodes is increased. This demonstrates excellent scalability of the algorithm.
Finally, we also plot the number of times data is collected per epoch. In most cases, the
number of such convergecast rounds is 3 per epoch. Note that this can be reduced further by
using a larger alert mitigation constantτ , or larger error thresholdǫ1 or ǫ2.
50 100 200 500 10000.022
0.024
0.026
0.028
0.03
0.032
0.034
Dis
tanc
e to
cen
tral
ized
eig
enve
ctor
Number of peers
Fig. 9. L2 norm distance between distributed and centralized eigenvectors vs. number of nodes. This remains the same therebyshowing good accuracy. Plotted are average and standard deviation over multiple trials.
50 100 200 500 10000.22
0.24
0.26
0.28
0.3
0.32
0.34
Number of peers
Nor
mal
ized
mon
itorin
g m
essa
ges
Fig. 10. L2 messages vs. number of nodes. Number of messages remain constant showing excellent scalability.
VII. PADMINI–A PEER-TO-PEER ASTRONOMY DATA MINING SYSTEM
PADMINI is a web based Peer to Peer data mining system that aims at being a computation
tool for the researchers and users related to the field of astronomy and data mining. There
are several challenges to centralizing the massive astronomy catalogs (some of which has been
elucidated in the previous section) and running traditional data mining algorithms. To solve this
data avalanche, PADMINI is powered by a back end peer to peer computation network to provide
February 9, 2011 DRAFT
29
50 100 200 500 1000
1.5
2
2.5
3
3.5
Number of peers
Dat
a ro
unds
per
epo
ch
Fig. 11. Number of convergecast rounds per epoch vs. number of nodes. In most cases the convergecast round is less than 3per epoch.
the required scalability. The back end computation networksupports two distributed computation
frameworks, namely Distributed Data Mining Toolkit (DDMT)[17] and Hadoop [24]. The web
based PADMINI system is available online athttp://padmini.cs.umbc.edu/. In the
next few sections we first describe the different componentsof the system and then describe the
implementation details.
A. System components
The system architecture is shown in Figure 1. It consists of aweb server, DDM server, server
database, jobs database and the back-end P2P network. Each of the components are discussed
in details next.
1) Web server:The web server hosts the main website and is the primary interface for
submitting jobs and retrieving results of the submitted jobs. Each new user signs up for an
account on the website and sets up a job to be run on the system.Every user has a dedicated
profile page where the user can keep a track of the jobs submitted by him. The current status of
the jobs and a projected time for the completion of the jobs are also displayed on the same page.
Each job submitted by the user will trigger a distributed algorithm to run on the back-end P2P
network. The results of the algorithm will be pushed back to the web server. The user can then
download a copy of the results of their jobs. The web service methods exposed by the DDM
server are used by the web server to start a job and receive results. The web server is thus the
consumer of the web service methods exposed by the DDM server.
2) Server database:The server database primarily deals with user and identity management.
The database stores the information related to the registered users of the system and the privileges
February 9, 2011 DRAFT
30
they have. The job activity details of an user are also storedin this database. These include the
inputs submitted by the user, the algorithm selected, the output of the job etc. The list of the
supported astronomy data catalogs and their attributes that a user can use as inputs are also
stored in this database.
3) DDM server: The Distributed Data Mining (DDM) Server is an intermediatetier between
the web server and the back-end peer to peer computation network. The multiple job requests
coming in from the web server are directed to the DDM Server and stored in a job queue. The
jobs are then submitted serially for completion to the back-end computation network. The DDM
server exposes a set of methods that can be used to set up jobs and for pushing the results of
the completed jobs onto the web server. The web service methods encourage openness. Hence,
a new system can be easily built around the available back-end P2P computation network.
4) Jobs database:The jobs database persists the book-keeping information related to the
jobs. This includes the list of all the jobs that are submitted by the user, including the ones not
yet submitted to the computation network. The status of the running and the waiting jobs and
the results of the recently completed jobs is stored here. The database also stores information
related to the back-end P2P computation network. This includes the information pertaining to
the total number of active nodes, failed nodes etc.
5) P2P network:The peer to peer network forms the backbone of the back-end computation
framework. All the peers in this network are configured to support two computation frameworks,
namely Distributed Data Mining Toolkit (DDMT) and Hadoop. The type of jobs the user can
submit is restricted by the algorithms supported by the system. Some algorithms are implemented
using the DDMT while some are built on top of the Hadoop framework. The DDM server picks
up a job from the queue and assigns it to be executed on top of the appropriate framework. This
information depends on the type of the job and hence is implicitly set by the user.
B. Implementation details
1) Language:The website is developed using HTML, Javascript and JSPs. DDMT is imple-
mented in Java and is based on Java Agent Development (JADE) Framework. The important
methods like starting a job, stopping it, providing input etc. have been exposed as web service
methods. This enables future systems to be built around the existing computation network.
Hadoop provides an extensive Java API using which highly scalable Map Reduce algorithms
February 9, 2011 DRAFT
31
can be created. For running either DDMT or Hadoop, Java support is the only expected feature
from a peer. Thus, the P2P computation network can be easily expanded.
2) Databases:MySQL is used as the database in the web server database as well as in the
jobs database. Hibernate is used for object-relational mapping at the web server database end.
Classes corresponding to the database tables make sure thatoperations made on the class objects
get reflected and persisted in the database. Such a system notonly saves development time, but
also guarantees a robust database system.
3) Web service:Axis2 is used as the core engine for web services. Axis2 is built on a new
architecture that was designed to be a much more flexible, efficient, and configurable. With the
new Object Model defined by Axis2, it is easier to handle SOAP messages. All the web service
requests are directed to the DDM Server. The DDM Server then calls the corresponding methods
and starts the requested job. Axis2 also has excellent support for sending binary data or files
using SOAP messages. This eases moving the inputs and outputs between the web server and
the DDM server.
4) User interface:An user needs to sign up on the home page to get an account and start
submitting jobs. On signing up, each user gets a personal profile page. Each algorithm supported
by the website has a dedicated page on which the user can create and submit a specific job. The
user can then track the status of the submitted jobs and also store the results of the most recently
completed jobs on the profile page. The Google Maps interfaceon the PADMINI website aids an
astronomer in specifying an area of the sky intuitively and effectively. The controls to select the
astronomy catalogs and the supported attributes are also provided. Thus, a job can be specified
with only a few clicks and the user does not to need to wait for the results.
VIII. C ONCLUSION
This paper presents a local and completely asynchronous algorithm for monitoring the eigen-
states of distributed and streaming data. The algorithm is efficient and exact in the sense that
once computation terminates, each node in the network computes the globally correct model. We
have taken a relatively well understood problem in astronomy — that of galactic fundamental
plane computation and shown how our distributed algorithm can be used to arrive at the same
results without any data centralization. We argue that thismight become extremely useful when
petabyte scale data repositories such as the LSST project start to generate high throughput data
February 9, 2011 DRAFT
32
streams which need to be co-analyzed with other data repositories located at diverse geographic
location. For such large scale tasks, distributing the dataand running the algorithm on a number
of nodes might prove to be cost effective. Our algorithm is a first step to achieving this goal.
Experiments on current SDSS and 2MASS dataset show that the proposed algorithm is efficient,
accurate, and highly scalable.
ACKNOWLEDGMENTS
This research is supported by the NASA Grant NNX07AV70G and the AFOSR MURI Grant
2008-11. K. Das completed the research for this paper at University of Maryland, Baltimore
County. C. Giannella completed the research for this paper while being an Assistant Professor
of Computer Science at Loyola College in Maryland and New Mexico State University. The
paper was approved for Public Release, unlimited distribution, by The MITRE Corporation: 10-
0814; c2010-The MITRE Corporation, all rights reserved. The authors would also like to thank
Sugandha Arora and Wesley Griffin for helping with the experimental setup.
REFERENCES
[1] H. Ang, V. Gopalkrishnan, S. Hoi, and W. Ng. Cascade RSVM in Peer-to-Peer Networks. InProceedings of PKDD’08,
pages 55–70, 2008.
[2] The AUTON Project.http://www.autonlab.org/autonweb/showProject/3/.
[3] W. Balke, W. Nejdl, W. Siberski, and U. Thaden. Progressive Distributed Top-K Retrieval in Peer-to-Peer Networks. In
Proceedings of ICDE’05, pages 174–185, 2005.
[4] N. M. Ball and R. J. Brunner. Data Mining and Machine Learning in Astronomy. arXiv:0906.2173v1, 2009.
[5] M. Bawa, A. Gionis, H. Garcia-Molina, and R. Motwani. ThePrice of Validity in Dynamic Networks.Journal of Computer
and System Sciences, 73(3):245–264, 2007.
[6] K. Bhaduri. Efficient Local Algorithms for Distributed Data Mining in Large Scale Peer to Peer Environments: A
Deterministic Approach. PhD thesis, University of Maryland, Baltimore County, March 2008.
[7] K. Bhaduri and H. Kargupta. An Scalable Local Algorithm for Distributed Multivariate Regression.Statistical Analysis
and Data Mining, 1(3):177–194, November 2008.
[8] K. Bhaduri, R. Wolff, C. Giannella, and H. Kargupta. Distributed Decision Tree Induction in Peer-to-Peer Systems.
Statistical Analysis and Data Mining, 1(2):85–103, June 2008.
[9] K. Borne. Scientific Data Mining in Astronomy. InNext Generation of Data Mining, chapter 5, pages 91–114. CRC press,
2009.
[10] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip Algorithms: Design, Analysis, and Applications. InIEEE Infocom,
volume 3, pages 1653–1664, 2005.
[11] J. Branch, B. Szymanski, C. Giannella, R. Wolff, and H. Kargupta. In-Network Outlier Detection in Wireless Sensor
Networks. InProceedings of ICDCS’06, page 51, 2006.
February 9, 2011 DRAFT
33
[12] Boston University Representative Internet Topology Generator.http://www.cs.bu.edu/brite/.
[13] The ClassX Project: Classifying the High-Energy Universe. http://heasarc.gsfc.nasa.gov/classx/.
[14] DAME: DAta Mining and Exploration. http://voneural.na.infn.it/.
[15] S. Datta, C. Giannella, and H. Kargupta. Approximate Distributed K-Means Clustering over a Peer-to-Peer Network.IEEE
Transactions on Knowledge and Data Engineering, 21(10):1372–1388, 2009.
[16] Digital Dig - Data Mining in Astronomy.http://www.astrosociety.org/pubs/ezine/datamining.html.
[17] The Distributed Data Mining Toolkit.http://www.umbc.edu/ddm/wiki/software/DDMT.
[18] CGAL: Delaunay Triangulation. http://www.cgal.org/.
[19] Data Mining Grid.http://www.datamininggrid.org/.
[20] H. Dutta, C. Giannella, K. Borne, and H. Kargupta. Distributed Top-K Outlier Detection from Astronomy Catalogs using
the DEMAC System. InProceedings of SDM’07, 2007.
[21] Elliptical Galaxies: Merger Simulations and the Fundamental Plane.http://irs.ub.rug.nl/ppn/244277443.
[22] Framework for Mining and Analysis of Space Science Data. http://www.itsc.uah.edu/f-mass/.
[23] GRIST: Grid Data Mining for Astronomy.http://grist.caltech.edu.
[24] Hadoop Home Page.http://hadoop.apache.org/.
[25] T. Hinke and J. Novotny. Data Mining on NASA’s Information Power Grid. InProceedings of HPDC’00, page 292, 2000.
[26] L. Huang, X. Nguyen, M. Garofalakis, M. Jordan, A. Joseph, and N. Taft. Distributed PCA and Network Anomaly
Detection. Technical Report UCB/EECS-2006-99, EECS Department, University of California, Berkeley, 2006.
[27] International Virtual Observatory.http://www.ivoa.net.
[28] H. Kargupta and P. Chan, editors.Advances in Distributed and Parallel Knowledge Discovery. MIT Press, 2000.
[29] H. Kargupta, W. Huang, K. Sivakumar, and E. L. Johnson. Distributed Clustering Using Collective Principal Component
Analysis. Knowledge and Information Systems, 3(4):422–448, 2001.
[30] H. Kargupta and K. Sivakumar.Existential Pleasures of Distributed Data Mining. Data Mining: Next Generation Challenges
and Future Directions. AAAI/MIT press, 2004.
[31] D. Kempe, A. Dobra, and J. Gehrke. Computing Aggregate Information Using Gossip. InProceedings of FOCS’03, pages
482–491, 2003.
[32] D. Krivitski, A. Schuster, and R. Wolff. A Local Facility Location Algorithm for Large-Scale Distributed Systems.Journal
of Grid Computing, 5(4):361–378, 2007.
[33] P. Luo, H. Xiong, K. Lu, and Z. Shi. Distributed Classification in Peer-to-Peer Networks. InProceedings of KDD’07,
pages 968–976, 2007.
[34] C. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial and Applied Mathematics (SIAM), 2001.
[35] US National Virtual Observatory.http://www.us-vo.org/.
[36] W.E. Schaap.The Delaunay Tessellation Field Estimator. PhD thesis, University of Groningen, 2007.
[37] I. Sharfman, A. Schuster, and D. Keren. A Geometric Approach to Monitoring Threshold Functions Over Distributed Data
Streams.ACM Transactions on Database Systems, 32(4):23, 2007.
[38] R. Wolff, K. Bhaduri, and H. Kargupta. A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems.
IEEE Transactions on Knowledge and Data Engineering, 21(4):465–478, 2009.
[39] R. Wolff and A. Schuster. Association Rule Mining in Peer-to-Peer Systems.IEEE Transactions on Systems, Man and
Cybernetics, Part B, 34:2426–2438, 2004.
February 9, 2011 DRAFT
34
APPENDIX
Now we show that if the bounds (2) and (3) hold, then the relaxed problem statement holds
with ǫ = ǫ1 + ǫ22. To do so, we must introduce more notation.
Let−−→µ(G) denote the column mean vector forG, the global dataset. Let−→µ denote the column
mean vector computed the last time the model was rebuilt (thelast convergecast) – an estimation
of−−→µ(G). Let ∆−→µ denote
−−→µ(G)−−→µ . If bound (3) holds, then||∆−→µ || < ǫ2.
Let C(G) denote the covariance matrix ofG, the global dataset. LetC denote the estimation
of the covariance matrix generated by mean-shifting using−→µ . Specifically, the(i, j) entry of Cis defined to be
C(i, j) =∑|G|
k=1(xk,i − µi)(xk,j − µj))
|G|
whereµi andµj are theith andjth components of−→µ ; xk,i andxk,j are theith andjth components
of the kth data vector inG. Let−→V and θ denote a vector and number computed the last time
the model was built such that||−→V || = 1 and which satisfies bound (2),||C−→V − θ−→V || < ǫ1.
Now we can state precisely the statement we will prove: if||C−→V − θ−→V || < ǫ1 (bound (2)) and
||−→V || = 1 and ||∆−→µ || < ǫ2 (bound (3)), then||C(G)−→V −−→V θ|| ≤ ǫ1 + ǫ22. The proof proceeds as
follows.
Straight-forward algebraic manipulations show thatC(G) is a rank-one update ofC.
C(G) = C +∆−→µ (∆−→µ )T (4)
Thus, with||.||F denoting the Frobenius norm andTr(.) the matrix trace,||C(G)−→V −−→V θ|| equals
February 9, 2011 DRAFT
35
||C(G)−→V −−→V θ|| = ||[C−→V −−→V θ] + [−∆−→µ (∆−→µ )T−→V ]|| [By (4)]
≤ ǫ1 + ||∆−→µ (∆−→µ )T−→V ]|| [By the triangle inequality and bound (2)]
≤ ǫ1 + ||∆−→µ (∆−→µ )T ||F ||−→V || [By (5.2.2) in [34]]
= ǫ1 +√
Tr[(∆−→µ (∆−→µ )T )2] [By (5.2.1) in [34] and||−→V || = 1]
= ǫ1 + ||∆−→µ ||2 [By straight-forward algebraic manipulations]
≤ ǫ1 + ǫ22 [By bound (3)]
February 9, 2011 DRAFT