+ All Categories
Home > Data & Analytics > Vivarana literature survey

Vivarana literature survey

Date post: 15-Jul-2015
Category:
Upload: tharindu-ranasinghe
View: 165 times
Download: 0 times
Share this document with a friend
Popular Tags:
57
CS4202 Research & Development Project Literature Survey on Data Visualization and Complex Event Processing Rule Generation Project Group Name : Vivarana Project Supervisers Prof. Gihan Dias Eng. Charith Chitraranjan Group Members 100112V - E.A.S.D.Edirisinghe 100132G - W.V.D.Fernando 100440A - R.H.T.D.Ranasinghe 100444N - M.C.S.Ranatunga
Transcript

CS4202 Research & Development Project

Literature Survey on Data Visualization and

Complex Event Processing Rule Generation

Project Group Name : Vivarana

Project Supervisers

Prof. Gihan Dias

Eng. Charith Chitraranjan

Group Members

100112V - E.A.S.D.Edirisinghe

100132G - W.V.D.Fernando

100440A - R.H.T.D.Ranasinghe

100444N - M.C.S.Ranatunga

Table of Content

1. Introduction 01

2. Multidimensional data visualization 02

3. Visualization techniques 05

3.1 Scatter Plots 05

3.1.1 Rank-by-feature framework 06

3.1.2 Rolling Dice Framework 09

3.1.3 Shortcomings of Scatterplot Matrix (SPLOM) 12

3.2 Parallel Coordinates 13

3.2.1 Definition and Representation 14

3.2.3 Brushing 16

3.2.4 Axis Reordering 19

3.2.5 Data Clustering 20

3.2.6 Statistical Coloring 23

3.2.7 Scaling 24

3.2.8 Limitations 25

3.3 Radviz 29

3.4 Mosaic Plots 30

3.5 Self Organizing Maps 32

3.6 Sunburst Visualization 34

3.7 Trellis Visualization 35

3.8 Grand Tour 37

3.8.1 Tours 37

3.8.2 Tour methods 38

4. CEP Rule generation 40

4.1 iCEP 41

4.2 Tuning rule parameters using the Prediction-Correction Paradigm 45

4.2.1 Model 46

4.2.2 System State 47

4.2.3 Rule Tuning Mechanism 47

References 50

List of Figures.

Figure 1: A scatterplot of the distribution of drivers’ visibility range against their age

Figure 2: A scatterplot matrix displays of data with three variants X, Y, and Z.

Figure 3: Rank-by-feature framework interface for scatterplots (2D).

Figure 4: Rank by feature visualization for a data set of a demographic and health related

statistics for 3138 U.S. counties

Figure 5: Scatterplot matrix navigation for a digital camera dataset.

Figure 6: Stage-by-stage overview of the scatterplot animated transition

Figure 7: Scatterplot matrix for the “Nuts-and-bolts” dataset

Figure 8: Generalized Plot Matrix for the “Nuts-and-bolts” dataset

Figure 9: Parallel coordinate plot with 8 variables for 250 cars

Figure 10: Parallel Coordinate plot for a point

Figure 11: Parallel Coordinate plot for points in a line with m < 0

Figure 12: Parallel Coordinate plot for points in a line with 0<m<1

Figure 13: Negative correlation between Car Weight and the Year

Figure 14: Using brushing to filter Cars with 6 cylinders

Figure 15: Using composite brushing to Filter Cars with 6 cylinders made in 76’

Figure 16: An example of smooth brushing

Figure 17: Angular Brushing

Figure 18: Multiple ways of ordering N axes in parallel coordinates

Figure 19: Two clusters represented in parallel coordinates

Figure 20: Multiple clusters visualized in parallel coordinates in different colors

Figure 21: Variable length Opacity Bands representing a cluster in parallel coordinate

Figure 22: Parallel-coordinates plot using polylines and using bundled curves

Figure 23: Statistically colored Parallel Coordinates plot on weight of cars

Figure 24: Three scaling options for visualizing the stage times in the Tour de France

Figure 25: Parallel Coordinates plot for a data set with 8000 rows

Figure 26: Parallel coordinates for the “Olive Oils” data

Figure 27: Parallel Coordinates visualization with Z score coloring

Figure 28: Parallel Coordinates drawn on same data set using data selection

Figure 29: Radviz Visualization for multi-dimensional data

Figure 30: Mosaic plot for the Titanic data showing the distribution of passenger’s survival based

on their class and sex

Figure 31: Double decker plot for the Titanic data

Figure 32: Training a self-organizing map.

Figure 33: A self-organizing map trained on the poverty levels of countries

Figure 34: A sunburst visualization summarizing user paths through a fictional e-commerce site.

Figure 35: Trellis Chart for a dates set on sales

Figure 36: Trellis Display of Scatter Plots (Relationship of Gifts Given/Received on Revenue)

Figure 37: Grand Tours

Figure 38: 1D grand tour path in 3D

Figure 39: Structure of the iCEP framework

Figure 40: Prediction Correction Paradigm

Figure 41: An overview of rules tuning method

1

1. Introduction

Nowadays every action/event occurring in the real world, whether it be a change of

temperature detected by a sensor, changes in stock market prices or even the movement of

objects tracked through GPS coordinates is digitally collected and stored for further exploration

and analysis and sometimes pre-specified action is triggered in real-time when a particular

action/event occurs. Complex Event Processing (CEP) engines are used to analyze these events

on the fly and to execute appropriate pre-specified actions.

But one of the downside of this real-time event monitoring and processing using a CEP is

that a domain expert must write necessary CEP rules in-order to detect interesting event and to

trigger an appropriate response. Sometimes the domain expert might lack the knowledge to write

efficient CEP rules for a particular CEP engine using its query language or he might need to

explore, understand and analyze the incoming event stream prior to writing any rules.

By providing an interactive visualization of data to the domain experts, we can help them

in their process of generating CEP rules. Hence, this literature survey mainly contains two

sections. Section 3 presents our findings on interactive visualization techniques. In this sections

we have described about Scatterplots (section 3.1) and parallel coordinated (section 3.2) in

details and we have introduced other promising visualization techniques briefly. Section 4

contains our findings on two methods of generating CEP rule generation namely iCEP and rule

parameter tuning. Further section 2 contains an overview of multidimensional visualization

(principles, techniques, problems) for the sake of completeness.

2

2. Multidimensional data visualization

Recent advances in technology has enabled the generation of vast amounts of data

in a wide range of fields. These data also keep getting more complex. Data analysts want to

look for patterns, anomalies and structures in data. Analyzing the data can lead to important

knowledge discoveries which is valuable to users. The benefits of such understanding reflect in

business decision making, more accurate medical diagnosis, finer engineering and more refined

conclusions in a general sense.

Visualizing these complex data can provide an overview of the data, summary of the data

and also can provide and help in identifying areas of interest in the data. Good data visualization

techniques that allows users to explore and manipulate the data can empower them in analyzing

the data and identifying important patterns and trends in the data that may have been hidden

otherwise.

Multi-dimensional data visualization is a very active research area that goes back many

years [1]. In this survey we have focused on 2D multi-dimensional data visualization techniques,

because 2D visualizations will make it easy for the users to analyze and interact with the data as

2D surfaces present a surface that more familiar to users and is easy to navigate.

There are multiple challenges that needs overcoming in multidimensional data

visualization. Finding a good visualization includes finding a good compromise that can

overcome some of these challenges are

● Mapping - Finding a good mapping from a multi-dimensional space to a two

dimensional space is not a simple task. The final representation of the data should be

intuitive and interpretable. Users should be able to identify patterns and trends in the

multi-dimensional data using the two dimensional representation.

● Large amounts of data - Modern dataset contain very large amounts of data that can

lead to very dense data visualizations. This causes the loss of information in the

visualization because the users lose the ability to distinguish between small differences in

the data.

● Dimensionality - Displaying the information of multiple dimensions in two dimensional

space can also lead to very dense and cluttered visualizations. Techniques need to be

developed to allow users to reduce the clutter and identify important information in the

3

data. Techniques such as principle component analysis [2] can help in identifying

important dimensions in the data.

● Assessing effectiveness - Information needs from data varies widely with each data set.

So there is no “silver bullet” in visualization technique that can solve all the problem.

Different datasets and requirements can yield to varying visualization methods. There is

no method to access the effectiveness of a visualization method over another so there is

process that can be followed to come up with a visualization method that works for any

dataset.

Further according to E.R. Tufte [3] a good visualization comprises of below qualities

● Show data variations instead of design variations. This quality encourages the viewer to

think about the substance rather than about methodology, graphic design, the tech of

graphic production, etc. One way to achieve this quality in a visualization is to have a

high data-to-ink ratio [4] and a high data density.

● Clear, detailed and thorough labeling and appropriate scales. A visualization can use

layering and separation techniques to show the labels of the data items

● Size of the graphic effect should be directly proportional to the numeric quantities. This

can be achieved by avoiding chart junks such as unnecessary 3D, shadowing effects and

by reducing the lie factor[5]

In-order to make the visualization more user friendly, a number of interaction techniques

have been proposed [6]. It should be noted that that the behavior of these interaction techniques

differ from one visualization technique to another. However, interaction techniques allows the

user to directly interact with the visualization and to change the visualization according to the

exploration objective. Below list contains the major interactive techniques we have identified.

● Dynamic Projections

Dynamic projection means dynamically changing the projection in-order to explore a

multidimensional data set. A classic example would be the Grand Tour [7] which tries to

show all interesting pairs of dimensions of a multidimensional dataset as a series of

4

scatterplots. However, the sequence of projection can be random, manual, pre-computed,

or even data driven depending on the visualization technique.

● Interactive Filtering

When exploring large dataset interactively partitioning and focusing on interesting

subsets is a must. This can be achieved through direct selection of the desired subset

(browsing) or through specifying the properties of the desired subset (querying).

However, browsing and querying becomes difficult and inaccurate respectively when the

dataset becomes larger. As a solution to this problem a number of techniques such as

Magic Lens [8], InfoCrystal [9] have been developed in-order to improve interactive

filtering in data exploration.

● Interactive Zooming

Zooming is used in almost all the interactive visualizations. When dealing with large

amount of data, sometimes the data is highly compressed in-order to provide an overview

of it. In such cases zooming does not only mean to display the data objects larger, but it

also means that the data representation should automatically change to present more

details on higher zoom levels (decompressing). The initial view (compressed view) will

allow the user to identify patterns, correlations and outliers and by zooming in to the

interested area user can study the data objects within that region in more detail.

● Interactive Distortion

Interactive distortion techniques will help in data exploration process by providing a way

for focusing on details while preserving an overview of data. The basic idea of distortion

is to show a portion of the data with high level of details while other portion is shown in

lower level of detail.

● Interactive Linking and Brushing

The idea of linking and brushing is to combine different visualization methods to

overcome the shortcomings of single techniques. As an example one could visualize a

scatterplot matrix (section 3.1) for a data set and when some points in a particular

scatterplot is brushed those points will get highlighted in all other scatterplots. Hence

interactive changes made in one visualization are automatically reflected in the other

visualizations.

5

3. Visualization techniques

3.1 Scatter Plots

Scatterplots are a commonly used visualization technique to deal with multivariate data

sets. Mainly there are 2D and 3D scatter plot visualizations. In a 2D scatterplot, data points from

two dimensions of a dataset are plotted in a Cartesian coordinate system where the two axes

represent the selected dimensions resulting in a scattering of points. An example of a scatterplot

showing the distribution of drivers visibility with their age is shown if Figure 1.

Figure 1: A scatterplot of the distribution of drivers’ visibility range against their age

The positions of the data points represent the corresponding dimension values.

Scatterplots are useful for visually identifying correlations between two selected variables of a

multidimensional data set, or finding clusters of individuals (outliers) in the dataset. One single

scatterplot can only depict the correlation between two dimensions. Additional limited

dimensions can be mapped to color, size or shape of the plotting points.

Advocates of 3D scatterplots argue that since the natural world is three dimensional,

users can readily grasp 3D representations. However, there is substantial empirical evidence that

for multidimensional ordinal data (rather than 3D real objects such as chairs or skeletons), users

struggle with occlusion and the cognitive burden of navigation as they try to find desired

6

viewpoints [10]. Advocates of higher dimensional displays have demonstrated attractive

possibilities, but their strategies are still difficult to grasp for most users.

Since two-dimensional scatterplot presentation offer ample power while maintaining

comprehensibility, many variations have been proposed. One of the method used to visualize

multivariate data using 2D scatterplots is scatterplot matrix (SPLOM) [1].

Figure 2: A scatterplot matrix displays of data with three variants X, Y, and Z [1].

Each individual plot in the SPLOM is identified by its row and column number in the

matrix [1]. For example, the identity of the upper left plot of the matrix in Figure 2 is (1, 3) and

the lower right plot is (3, 1). The empty diagonals displays the variable names. Plot (2, 1) is the

scatter plot of parameter X against Y while plot (1, 2) is the reverse, i.e. Y versus X.

One of the major disadvantage of SPLOM is that as the number of dimensions of the data

set grow the n-by-n SPLOM grows and each individual scatterplot in the SPLOM will have less

space. Following frameworks provide a solution to that problem by incorporating interactive

techniques with the traditional SPLOM.

3.1.1 Rank-by-feature framework

Many variations have been proposed to the initial SPLOM to enhance its interactivity and

interpretability. One such enhancement is presented with the rank-by-feature framework [10].

Instead of directly visualizing the data point against all pairs of dimensions, this framework

allows the user to select an interesting ranking criterion which will be described later in this

section.

7

A B C D

Figure 3: Rank-by-feature framework interface for scatterplots (2D). All pairs of dimensions are

sorted according to the current ordering criterion (Correlation coefficient) (A) in the ordered list

(C). The score overview (B) shows an overview of scores of pairs of dimensions. A mouse over

event activates a cell in the score overview, highlights the corresponding item in the ordered list

(C) and shows the corresponding scatterplot in the scatterplot browser (D) simultaneously. A

scatterplot is shown in the scatterplot browser (D), where it is also easy to traverse scatterplot

space by changing X or Y axis using item sliders on the horizontal or vertical axis. (A

demographic and health related statistics for 3138 U.S. counties with 17 attributes.)

Figure 3 shows a dataset of demographic and health related statistics for 3138 U.S.

counties with 17 attributes, visualized through the rank-by-feature framework and its interface

consists of four coordinated components: control panel (Figure 3A), score overview (Figure 3B),

ordered list (Figure 3C), and scatterplot browser (Figure 3D).

User can select an ordering criterion in the control panel (Figure 3A), and the ordered list

(Figure 3C) shows the pairs of dimensions (scatterplots) sorted according to the score of the

criteria with the scores color-coded on the background. But users cannot see an overview of

entire relationships between variables at a glance in the ordered list. Hence the score overview

(Figure 3B), an m-by-m grid view where all dimensions are aligned in the rows and columns has

been implemented. Each cell of score overview represents a scatterplot whose horizontal and

vertical axes are dimensions at the corresponding column and row respectively.

Since this matrix is symmetric, only the lower-triangular part is shown. Each cell is color-

coded by its score value using the same mapping scheme as in ordered list. The scatterplot

corresponding to the cell is shown in the scatterplot browser (Figure 3D) simultaneously, and the

corresponding item is highlighted in the ordered list (Figure 3C). In the scatterplot browser,

users can quickly take a look at scatterplots by using item sliders attached to the scatterplot view.

8

Simply by dragging the vertical or horizontal item slider bar, users can change the dimension for

either the horizontal or vertical axis respectively while preserving the other axis.

Below list contains the ranking criterions suggested by this framework.

● Correlation coefficient (-1 to 1):

The Pearson’s correlation coefficient (r) for a scatterplot (S) with n points [12] is defined

in Equation 1

Equation 1: Pearson’s correlation coefficient (r) for a scatterplot (S) with n points

Pearson’s r is a number between -1 and 1. The sign and magnitude tells the direction and

the strength of the relationship respectively. Although correlation doesn’t necessarily

imply causality, it can provide a good clue to the true cause, which could be another

variable. Linear relationships are more common and simple to understand. As a visual

representation of the linear relationship between two variables, the line of best fit or the

regression line is drawn over scatterplots.

● Least square error for curvilinear regression (0 to 1)

This criterion sort scatterplots in terms of least-square errors from the optimal quadratic

curve fit so that the user can isolate the scatterplots where all points are closely/loosely

arranged along a quadratic curve. In some scenarios it might be interesting to find

nonlinear relationships in the data set in addition to linear relationship.

● Quadracity (0 to infinity)

The "Quadricity" criterion is added to emphasize the real quadratic relationships. It ranks

scatterplots according to the coefficient of the highest degree term, so that users can

easily identify ones that are more quadratic than others.

● The number of potential outliers (0 to n)

Distance-based outlier detection methods such as DB-out [13] or Density based outlier

detection methods such as Local Outlier Factor (LOF)-based method [14] can be used to

detect outliers in a scatterplot and rank by-feature framework uses LOF-based method

(Figure 4), since it is more flexible and dynamic in terms of outlier definition and

9

detection. The outliers are highlighted with yellow triangles in the scatterplot browser

view.

Figure 4: Rank by feature visualization for a data set of a demographic and health related

statistics for 3138 U.S. counties with 17 attributes, visualized with the Number of Potential

Outliers ranking criteria.

● The number of items in the region of interest (0 to n)

This criterion allows the user to draw a free-formed polygon region of interest on the

scatterplot. Then the framework will use the number of data points in the region to order

all scatterplots so that user can easily find the ones with most/least number of items in the

specified region.

● Uniformity of scatterplots (0 to infinity)

To calculate this criterion the two-dimensional space is divided into regular grid cells and

then each cell is used as a bin. For example, if k-by-k grid has been generated, the

entropy of a scatterplot S would be

Where Pij is the probability that an item belongs to the cell at (i, j) of the grid.

3.1.2 Rolling Dice Framework

Rolling dice is another framework which utilizes SPLOM to visualize multidimensional

data [15]. In this framework, transitions from one scatterplot to another is performed as animated

rotations in 3D space, similar to a rolling dice. Rolling dice framework suggest a visual querying

10

technique so that a user can refine his requirement by exploring how the same query would result

in any scatterplot.

Figure 5: Scatterplot matrix navigation for a digital camera dataset [15]. The main interface

proposed by rolling dice framework consist of a scatterplot matrix component (A), a scatterplot

component (B) and a query layer component (C)

The interface proposed by the framework mainly consist of three components: Scatterplot

component (Figure 5B), scatterplot matrix component (Figure 5A) and query layer component

(Figure 5C). The scatterplot component shows the currently viewed cell of the scatterplot matrix

with the name and labels of the two displayed axes. The scatterplot matrix component can be

used both as an overview and a navigational tool. Navigation in the scatterplot matrix is

restricted to orthogonal movement along the same row or column in the matrix so that one

dimension in the focused scatterplot is always preserved while the other changes. The change is

visualized using a 3D rotation animation which gives a semantic meaning to the movement of

the points, allowing human mind to interpret the motion as shape [16].

11

The transition of scatterplots is performed as a three-stage animation: extrusion into 3D,

rotation and projection into 2D. More specifically, given two current visualized dimensions x and

y and a vertical transition to a new dimension y', will follow below mentioned steps (also

depicted in Figure 6).

Figure 6: Stage-by-stage overview of the scatterplot animated transition: Extrusion (A, B),

rotation (C), Projection (D, E)

● Extrusion: The scatterplot visualizing x and y axes is extruded to 3D where y’ becomes

the new depth coordinate for each data point. At the end of this step the 2D scatterplots

has become 3D (Figure 6A and 6B)

● Rotation : The scatterplot is rotated 90 degrees up or down, causing the axis previously

along the depth dimension to become the new vertical axis (Figure 6C)

● Projection : The 3D plot is projected back into 2D with x and y as the new horizontal and

vertical axes (Figure 6D and 6E)

Further, rolling dice framework suggest a method called query sculpting which allows

selecting data items in the main scatterplot visualization using 2D bounding shapes (convex

hulls) and iteratively refining that selection from other viewpoints while navigating the

scatterplot matrix. As shown in Figure 5C the query layer component is used for selecting,

naming and clearing color-coded queries during the visual exploration. Clicking and dragging

one query onto another will perform union or intersection operation (by dragging using the left

or right mouse button respectively). Each query layer also provides a visual indication of the

percentage of items currently selected by it.

12

3.1.3 Shortcomings of Scatterplot Matrix (SPLOM)

In-order to discuss the shortcomings of SPLOM let's consider a fictitious "nuts-and-bolts"

dataset. This dataset shown in Table 1 involves 3 (independent) categorical variables: Region

(North, Central, and South), Month (January, February...), and Product (Nuts or Bolts). It also

consists of 3 (dependent) continuous variables: Sales, Equipment costs, and Labor costs.

Region Month Product Sales Equipment

costs

Labor costs

North Jan Nuts 2.78 0.92 4.30

North Feb Nuts 4.92 1.64 4.30

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

South Dec Bolts 9.50 2.44 5.20

Table 1: “Nuts-and-Bolts” dataset

Figure 7 shows the SPLOM for the "nuts-and-bolts" dataset and the top three scatterplots

(e.g. Month vs Region) each show a crossing of two categorical variables, resulting in an

uninformative grid of points. Further, scatterplots showing continuous vs categorical variables

suffers from over plotting (e.g.: Sales vs. product)

Figure 7: Scatterplot matrix for the “Nuts-and-bolts” dataset

13

In-order to overcome this issue Generalized Plot Matrix (GPLOM) [17] has been

proposed. In the GPLOM it is suggested to use heatmaps to visualize pairs of categorical

variable, bar-charts to visualize continuous vs categorical variables and scatterplots to visualize

pairs of continuous variables. It is important to note that in this scenario scatterplots show

individual tuples, whereas the barchars and heatmaps show aggregated data. Figure 8 shows the

GPLOM for the “nuts-and-bolts” dataset. Even though GPLOM is a better choice than SPLOM

to visualize a combination of continuous and categorical variables, since it uses 3 types of charts

it loses the consistency of the matrix.

Figure 8: Generalized Plot Matrix for the “Nuts-and-bolts” dataset

3.2 Parallel Coordinates

Parallel coordinates introduced by Inselberg and Dimsdale [11, 19] is a popular technique

for transforming multidimensional data into a 2D image. The m-dimensional data items are

represented as lines crossing m parallel axes, each axis corresponding to one dimension of the

14

original data. Fundamentally parallel coordinates differ from all other visualization

methodologies since it yields graphical representation of multidimensional data rather than just

visualizing a finite set of points [19].

Figure 9 displays Parallel Coordinate plot with 8 variables using a dataset[58] which

contains information about cars such as economy (mpg), cylinders, displacement (cc)and etc. for

a selected sample of cars manufactured within 1970 to 1982.

Figure 9: Parallel coordinate plot with 8 variables for 250 cars

3.2.1 Definition and Representation

On the plane with xy-Cartesian coordinates starting on the y-axis, N copies of the real

line, labeled x1,x2, x3.... xn are places equi-distant and perpendicular to the x axis, They are the

axes of the parallel coordinate system for Euclidean N-Dimensional Space RN all having the

same positive orientation as the y axis. [11]

Figure 10: Parallel Coordinate plot for a point

15

In the figure 10 it is shown how a point C with coordinates (c1, c2, c3...cn) can be

represented by a polygonal line. As in the aforementioned way m number of data points can be

represented by m polygonal lines.

A point in 2d Cartesian space is represented by a single line in parallel coordinates.

Extending on this, a line in 2d Cartesian space is represented in parallel coordinates by selecting

a set of collinear points on the line and representing each of those points in the parallel

coordinates visualization. The lines in the parallel coordinates visualization that represent those

points intersect at some point in the visualization. If the distance between the axes is d, the

intersecting point 𝑙 for line l is,

For lines with negative slope (m < 0) the interesting point lies between the axes as in

Figure 11.

Figure 11: Parallel Coordinate plot for points in a line with m < 0

For m > 1 the intersecting point lies left of the X1 axis while intersecting point for the

lines with m (0 < m < 1), lies right of the X2 axis as in the Figure 12.

Figure 12: Parallel Coordinate plot for points in a line with 0<m<1

16

The above property can be considered as one of the main advantages in parallel

coordinates. Parallel Coordinates representations can provide statistical data interpretations. In

the statistical setting, the following interpretations can be made: For highly negatively correlated

pairs, the dual line segments in Parallel Coordinates tend to cross near a single point between the

two Parallel Coordinates axes. Parallel or almost parallel lines between axes indicate positive

correlation between variables [20, 21]. For an example we can see that there is a highly negative

correlation between weight and year in the Figure 13.

Figure 13: Negative correlation between Car Weight and the Year

Over the years parallel coordinates have been enhanced by multiple people. Data

Scientists have been working on improving this technique for better data investigation and for

easier, user-friendly interaction by adding brushing, data clustering, real-time re-ordering of

coordinate axes, etc.

3.2.3 Brushing

Brushing is considered to be a very effective technique for specifying an explicit focus

during information visualization [22]. The user actively marks subsets of the data-set as being

especially interesting and the points that are contained by the brush are colored differently from

other points to make them standout [23]. For example if the user is interested in cars having 6

cylinders he can use brushing as depicted in the Figure 14.

17

Figure 14: Using brushing to filter Cars with 6 cylinders

The introduction of composite brushes [23] allows users to more specifically define their

focus. Composite brushes are a combination of single brushes which result the conjunction of

those single brushes. For an example if the user is interested in cars having 6 cylinders that were

produced on 76’ he can use composite brushing as depicted in Figure 15.

Figure 15: Using composite brushing to Filter Cars with 6 cylinders made in 76’

Brushing technique we have seen up to now uses a discrete distinction between focus and

context. With that we don’t understand the similarity of other data points to the focused data

points. The solution that had brought forward for this is called smooth brushing [22] where a

multi-valued or even continuous transition is allowed, which inherently supports the similarity

between data-points in focus and their context. This corresponds to a degree-of-interest (DOI)

function which non-binarily maps into the [0, 1] range. Often, such a non-binary DOI function is

defined by means of spatial distances, i.e., the DOI-value reflects the distance of a data-point

from a so-called center-of interest.

18

Figure 16: An example of Smooth brushing: note the gradual changes of drawing intensity which

reflect the respective degree of interest, after smooth brushing of the 2nd axis.

The standard brushing primarily acts along the axes, but the technique called angular

brushing enables the space between axes for brushing [22]. The user can interactively specify a

sub-set of slopes which then yields all those data-points to be marked as part of the current focus,

which exhibit the matching correlation in between the brushed axes. For an example if the user is

interested on data that only has a negative correlation between Horsepower and Acceleration he

can use angular brushing as shown in Figure 17.

19

Figure 17: Angular Brushing: Reading between the lines whereas most line-segments go up in-

between the 2nd and the 3rd axis (visualizing a positive correlation of values there), just a few go

down – those have been emphasized through angular brushing

3.2.4 Axis Reordering

One strength of parallel coordinates as described in section 3.2.1, is its effectiveness of

visualizing relations between coordinate axes. By bringing axes next to each other in an

interactive way, the user can investigate how values are related to each other with special respect

to two of the data dimensions. Order of the axes clearly affects the patterns revealed by parallel

coordinate plots. Figure 18 shows 3 ways out of N! (N = 8 in this case) ways of reordering axes.

But only the plot C in Figure 18 is capable of showing that there is a highly negative correlation

between weight and economy.

Many Researchers address this problem using some measure to score an order of axes

while others discuss how to visualize multiple orderings in a single display [24]. Many

approaches for this which are based on the combination of Nonlinear Correlation Coefficient and

Singular Value Decomposition algorithm [25] are suggested. By using these approaches, the first

20

remarkable axe can be selected based on mathematics theory and all axis are re-ordered in line

with the degree of similarities among them [25].

Figure 18: Multiple ways of ordering N axes in parallel coordinates: (A): The default Order of

the Axes, (B): Axes are re-ordered to see the correlation between Year and Power - highly

negative correlation is observed. (C): Axes are reordered to see the correlation between Weight

and the Economy - highly negative correlation is observed.

3.2.5 Data Clustering

Parallel Coordinates are a good technique to show clusters in the data set. There are

many techniques that researchers have used to show clusters in parallel coordinates.

21

Coloring is one method that has been used to show clusters in parallel coordinates [26].

Different colors will be assigned to different clusters. As in the figure 19 it shows two clusters

that had been given explicitly is represented with 2 different colors.

Figure 19: Two clusters represented in parallel coordinates with two different colors (red and

blue)

Figure 20 shows the same cluster visualization technique for more many clusters for the data set

taken from USDA National Nutrient Database.

Figure 20: Multiple clusters visualized in parallel coordinates in different colors

Variable length Opacity bands [26] is another technique of showing clusters in Parallel

Coordinates. Figure 21 shows a graduated band faded from a dense middle to transparent edges

that visually encodes information for a cluster. The mean stretches across the middle of the band

and is encoded with the deepest opacity. This allows the user to differentiate sparse, broad

clusters and narrow, dense clusters. The top and bottom edges of the band have full transparency.

The opacity across the rest of the band is linearly interpolated. The thickness of the band across

each axis section represents the extents of the cluster in that dimension.

22

Figure 21: Variable length Opacity Bands representing a cluster in parallel coordinate

Curved bundling [27] is also used to visualize clusters in parallel coordinates. Bundled

curve plots extend the traditional polyline plots and are designed to reveal the structure of

clusters previously identified in the input data. Given a data point (P1, P2,...,PN),its corresponding

polyline is replaced by a piecewise cubic Bezier curve preserving following properties. (Denote

the main axes by X1, X2, X3 … XN to avoid the confusion between them and the added axes.)

● The curve interpolates P1, P2,..., PN at the main axes

● Curves corresponding to data points that belong to the same cluster are bundled between

adjacent main axes. This is accomplished by inserting a virtual axis midway between the

main axes and by appropriately positioning the Bézier control points along the virtual

axis. To support curve bundling, control points that define curves within the same cluster

are attracted toward a cluster centroid along the virtual axis.

Figure 22 compares a polyline plot with its counterpart using bundled curves. Polylines

require color coding to distinguish clusters, whereas curve bundles rely on geometrical proximity

to naturally represent cluster information. The cluttered visualization in color-coded polylines,

which is the standard approach to cluster-membership visualization, motivates the new geometry

based method.

23

Figure 22: Parallel-coordinates plot (A) using polylines with color coding to show clusters, and

(B) using bundled curves

Bundling violates the point-line duality discussed in section 3.2.1, but can be used to

visualize clusters using geometry only, leaving the color channel free for other uses such as

statistical coloring which is described in section 3.2.6. To adjust the shape of Bézier curves there

are many algorithms proposed by many researchers [27, 28, 29].

3.2.6 Statistical Coloring

Coloring polygonal lines can be used to display statistical coloring of axes. A popular

color scheme is to color by z-score for that dimension, so that we can understand the data

distribution of that dimension. Figure 23 shows how z-score coloring has been used on weight

dimension in that data set.

Figure 23: Statistically colored Parallel Coordinates plot on weight of cars - Cars that have a high

weight will be blue in color while low weight vehicles are colored red.

24

3.2.7 Scaling

Scaling of the axes are also an interesting property in the parallel coordinates. Default

scaling is to plot all values over the full range of each axis between the minimum and the

maximum of the variable. Several other scaling methods have been suggested by researchers

[21]. A common one would be to use a common scale over all axes. Figure 24 shows the

difference between two scaling methods. The data taken is individual stage times of the 155

cyclists who finished the 2005 Tour De France bicycle race. Figure 24A is plotted with default

scaling and Figure 24B is plotted using a common scale over all axes. But it is obvious that the

both Figure 24A and Figure 24B are not capable enough to reveal correlations between axes

even though Figure 24B shows the outliers clearly. But the spread between the first and the last

cyclist is almost invisible for most of the stages. In the Figure 24C, a common scale for all stages

is used, but each stage is aligned at the median value of that stage. It is the user experience, his

domain knowledge and the use case that defines the scale and alignment on the parallel

coordinates [21].

Figure 24: Three scaling options for visualizing the stage times in the Tour de France 2005: (A):

All stages are scaled individually between minimum and maximum value of the stage (usual

25

default for parallel coordinate plots). (B): A common scale is used, i.e., the minimum/maximum

time of all stages is used as the global minimum/maximum for all axes. (C): Common scale for

all stages, but each stage is aligned at the median value of that stage.

3.2.8 Limitations

Even though Parallel coordinates are a great tool to visualize high dimensional data, it

soon reached its limits. When using a very large dataset there are some identified weaknesses in

parallel coordinates such as:

1. Cross-over Problem - The zigzagging polygonal lines used for data representation are not

continuous. They generally lose visual continuation across the parallel-coordinates axes,

making it difficult to follow lines that share a common point along an axis.

2. When two or more data points have the same or similar values for a subset of the

attributes, the corresponding polylines may overlap and clutter the visualization.

Figure 25 depicts the aforementioned two problems - A parallel coordinate plot drawn for 8000

data points.

Figure 25: Parallel Coordinates plot for a data set with 8000 rows. (Food information taken from

USDA National Nutrient Database)

Given a very large data set, with this two problems it is not easy to come to a conclusion

about the correlation in axes and brushing also will not give a clear idea about the data.

One solution to above problems is to use α-blending [21]. When α-blending is used, each

polygon is plotted with only α percent opacity. With smaller α values, areas of high line density

are more visible and hence are better contrasted to areas with a small density.

26

The data in Figure 26 are real data from Forina et al.[32] on the fatty acid content of

Italian olive oil samples from nine regions. Figure 26 A, B, C shows the same plot of all eight

fatty acids with α-values of 0.5, 0.1, and 0.01 respectively. Depending on the amount of α-

blending applied, the group structure of some of the nine regions is more or less visible [21].

It is hard to come to a conclusion about a value for α. The user must adjust the α value

until the graph gain enough insight.

Figure 26: Parallel coordinates for the “Olive Oils” data with different alpha values. α = 0.5 (A),

α = 0.1 (B), and α = 0.05 (C)

27

Clustering and statistical coloring were mentioned in the sections 3.2.5 and 3.2.6 will also

reduce the weaknesses in Parallel Coordinates.

Figure 27: Parallel coordinates visualization with Z score coloring: Z score coloring based on the

amount of water - foods with high water percentage will have blue color while foods with lower

water percentage will have red color

As in the Figure 27, point line duality is preserved more when statistical coloring is used.

Data preprocessing techniques can also be used to overcome the limitations in parallel

coordinates: data selection and data aggregation. Data selection means that a display does not

represent a dataset as a whole but only a portion of it, which is selected in a certain way [30].The

display is supplied with interactive controls for changing the current selection, which results in

showing another portion of the data [30].

The Figure 28 shows how to display portion of the data and to overcome the weaknesses

in Parallel Coordinates. The Figure 28A only displays food group of sausages and luncheon

meats. Respectively, Figure 28B and Figure 28C displays food groups of beef products and

spices and herbs, which is a better visualization than visualizing whole data set.

Data aggregation reduces the amount of data under visualization by grouping individual

items into subsets, often called ‘aggregates’, and some collective characteristics of the

aggregates can be computed. The aggregates and their characteristics (jointly called ‘aggregated

data’) are then explored instead of the original data. For an example in parallel coordinates there

is just one polygonal line for the whole cluster so that mentioned limitations at the beginning of

this section will be reduced.

28

Figure 28: Parallel Coordinates drawn on same data set using data selection: (A): Displays food

group of sausages and luncheon meats. (B): Displays food groups of beef products. (C): Displays

food groups of spices and herbs

Parallel Coordinates might be the least affected plot from curse of dimensionality since it

can represent many dimensions as long as the screen width permits. But that also comes to a

limitation when it comes to high dimensional data because the distance d between two

coordinates gets decreased with the increase in number of dimensions. As a result the correlation

between axes might not be clear in the plot. Most of the applications assume it is up to the user to

decide which attributes should be kept in, or removed from a visualization. This approach will

not be a good approach for a user who does not have domain knowledge, parallel coordinates

itself can be used to reduce dimensions of the data set [31].

29

When we were discussing about axis reordering in section 3.2.4 we talked about getting a

measure to the axis similarity. Once the most similar axes are identified through that algorithm

the application can suggest user to remove them and keep one significant axe to all those

identified similar axes [31]. In that way redundant attributes can be removed from the

visualization and the space can be used efficiently to represent the remaining attributes.

Parallel Coordinates are a good technique to visualize data. It support many user

interactions and data analytic techniques. Even though it has limits researchers have found many

ways to overcome those limitations. Parallel Coordinates are still a hot topic for data

visualization research work.

3.3 Radviz

The Radviz (Radial Visualization) visualization method [33] maps a set of n dimensional

data points onto a two dimensional space. All dimensions are represented by a set of equally

spaced anchor points on the circumference of a circle.

For each data instance, imagine a set of springs that connects the data point to the anchor

point for each dimension. The spring constant for the spring that connects to the ith anchor

corresponds to the value of the ith dimension of the data instance. Each data point is then

displayed where the sum of all the spring forces equals 0. All the data point values are usually

normalized to have values between 0 and 1.

Consider the example in Figure 29.A, this data has 8 dimensions {d1, d2. … dn}. Each

data point is connected as shown in the diagram using springs. Following this procedure for all

the records in the dataset leads to the Radviz display. Figure 29.B shows a Radviz representation

for a dataset on transitional cell carcinoma (TCC) of the bladder generated by Clifford Lab at

LSUHSC-S [34].

One major disadvantage of this method is the overlap of points. Consider the following

two points on a 4 dimensional data space, (1, 1, 1, 1) and (10, 10, 10, 10). These two data records

will overlap in a Radviz display even though they are clearly different because the dimensions

pull them both equally.

30

A B

Figure 29: Radviz Visualization for multi-dimensional data. (A): Shows the set of springs and

the forces exerted by those springs on a single data point. (B): A Radviz representation for a

dataset on transitional cell carcinoma

Categorical dimensions cannot be visualized with Radviz and require additional

preprocessing. First each categorical dimension needs to be flattened to create a new dimension

for each possible category. This becomes problematic as the number of possible categories

increase and may lead to poor visualizations.

Another challenge in generating good visualizations with this method is identifying a

good ordering for the anchor points that correspond to the dimensions. A good ordering needs to

be found that makes it easy to identify patterns in the data. An interactive approach that allows

for changing the position of anchor points can be used to help users overcome this issue.

3.4 Mosaic Plots

Mosaic plots [35, 36] are a popular method of visualizing categorical data. They provide

a way of visualizing the counts in a multivariate n-way contingency table. The frequencies in the

contingency table are represented by a group of rectangles whose areas are proportional to the

frequency of each cell in the contingency table.

A mosaic plot starts as a rectangle. Then at each stage of plot creation, the rectangles are

split parallel to one of the two axes based on the proportions of data belonging to a category. An

31

example of a mosaic plot is shown in Figure 30. It shows a mosaic plot for the Titanic dataset,

which describes the attributes of passengers on Titanic details of their survival.

Figure 30: Mosaic plot for the Titanic data showing the distribution of passenger’s survival based

on their class and sex

The process of creating a mosaic display can be described as below [37].

Let us assume that we want to construct a mosaic plot for p categorical variables X1,...,

Xp. Let ci be the number of categories of variable Xi, i = 1, . . . , p.

1. Start with one single rectangle r (of width w and height h), and let i = 1.

2. Cut rectangle ri-1 into ci pieces: find all observations corresponding to rectangle

ri−1, and find the breakdown for each variable Xi (i.e., count the number of

observations that fall into each of the categories). Split the width (height) of

rectangle ri−1 into ci pieces, where the widths (heights) are proportional to the

breakdown, and keep the height (width) of each the same as ri−1. Call these new

rectangles rji, with j = 1, . . . ,ci.

3. Increase i by 1.

4. While i<= p, repeat steps 2 and 3 for all rji−1 with j =1 , . . . ,ci−1

In standard mosaic plots the rectangle is divided both horizontally and vertically. A

variation of mosaic plots that only divide the rectangle horizontally has been proposed called

Double Decker plots [38]. These can be used to visualize association rules. An example of a

32

double decker plot is show in Figure 31 for the same data as in Figure 30. There are other

variations of mosaic plots such as fluctuation diagrams that try to increase the usability of them.

Figure 31: Double decker plot for the Titanic data showing the distribution of passenger’s

survival based on their class and sex

Mosaic plots are an interesting visualization technique for categorical data but they can't

handle continuous data. To display continuous data using a mosaic plot the data needs to be first

converted to categorical through a process such as binning. Mosaic plots require the visual

comparison of rectangle and their sizes to understand the data. But this becomes complicated as

the number of rectangles increase and the distance between two increases. So they are harder to

interpret and understand. Vastly different aspect ratios of the rectangles also compound the

difficulty in comparing their sizes.

Another issue with Mosaic plots is that they become more complex as the number of

dimensions in the data increase. Each additional dimension requires the rectangles to be split

again which at least doubles the possible number of rectangles leading to a final visualization

that is not very user friendly.

3.5 Self Organizing Maps

Self-organizing maps (SOM) [39] is a type of neural network that has been used widely

in data exploration and visualization among its many other uses. SOMs use an unsupervised

learning algorithm to perform a topology preserving mapping from a high dimensional data

space to a lower dimensional map (usually a two dimensional lattice). The mapping preserves the

33

topology of the high dimensional data space such that data points lying near each other in the

original multidimensional space maps to nearby units in the output space.

Generating self-organizing maps consists of training a set of neurons with the dataset. At

each step of the training an input data item is matched against the neurons from which the closest

one is chosen as the winner. Then the weights of the winner and the neighborhood of the winner

is updated to reinforce this behavior. the final result is a topology preserving ordering where

similar new data entry will match to neurons nearer to each other.

Figure 32: Training a self-organizing map. For each data item, the closest neuron is selected

using some distance metric

An example of a self-organizing map is shown in Figure 33. This shows a self-organizing

map trained on the poverty levels of countries [40]. As can be seen clearly countries with similar

poverty levels got matched to neurons close to each other. USA, Canada and other countries with

lower poverty are together in the yellow and green areas while countries such as Afghanistan and

Mali which have high poverty levels are grouped together in the purple areas. This shows the

topology preserving aspect of SOMs.

Figure 33: A self-organizing map trained on the poverty levels of countries

34

There are some challenges with using self-organizing maps for multidimensional data

visualization.

1. SOMs are not unique. The same data can lead to widely different outcomes based on the

initialization of the SOM. So the same data may yield different visualizations and lead to

confusion.

2. While similar data points are grouped together in SOMs, similar groups are not

guaranteed to be close to each other. Some SOMs may be created that have similar

groups in multiple places in the map.

3. SOMs are not very user friendly when compared with other visualization techniques. Its

not easy to look at a SOM and interpret the data.

4. The process of creating a SOM is computationally expensive. The computational

requirements grow as the dimensionality of data increases. In modern data sources that

are highly complex and detailed this becomes a major drawback.

3.6 Sunburst Visualization

The Sunburst technique, like Tree Map [44] is a space-filling visualization that uses a

radial rather than a rectangular layout to visualize hierarchical information [43]. It is comparable

to a nested pie charts. It can be used to show hierarchical information such as elements of a

decision tree. This compact visualization avoids the problem of decision trees getting too wide to

fit the display area. It’s akin to visualizing the tree in a top down manner. The center represents

the root of the decision tree and the ring around it as its children.

In SunBurst, the top of the hierarchy is at the center and deeper levels farther away from

the center. The angle swept out by an item and its color correspond to some attribute of the data.

For instance, in a visualization of a file system, the angle may correspond to the file/directory

size and the color may correspond to the file type. An example Sunburst display is shown in

Figure 34. This visualization has been used to summarize user navigation paths through a

website [41]. Further this visualization has been used to visualize frequent item sets [42].

35

Figure 34: A sunburst visualization summarizing user paths through a fictional e-commerce site.

The inner ring represents the first event in the visit (showing here, for example, that most visits

start on the homepage and approximately one-third start on a product page). The outer rings

represent the subsequent events.

3.7 Trellis Visualization

Trellis chart Also known as: Small Multiples [45], Panel Chart, Lattice Chart, Grid Chart,

is a layout of smaller charts in a grid with consistent scales. Each smaller chart represents an item

in a category, named “conditions” [48]. The data displayed on each smaller chart is conditional

on items in the category. Trellis Charts are useful for finding the structure and patterns in

complex data. The grid layout looks similar to a garden trellis, hence the name Trellis Chart.

36

Figure 35: Trellis Chart for a dates set on sales

Main aspects of trellis displays are columns, rows, panels and pages [46]. The figure 35

consists of 4 columns, 1 row, 4 panels and 1 page. Trellised visualizations enable the user to

quickly recognize similarities or differences between different categories in the data. Each

individual panel in a trellis visualization displays a subset of the original data table, where the

subsets are defined by the categories available in a column or hierarchy. To make plots

comparable across rows and columns, the same scales are used in all the panel plots [47].

Benefits of trellis chart are;

● They are easy to understand. A Trellis Chart is a basic chart type repeated many times. If

you understand the basic chart type, you can understand the whole Trellis Chart.

● Having many small charts enables you to view complex multi-dimensional data in a flat

2D layout avoiding the need for confusing 3D charts.

● The grid layout combined with consistent scales makes data comparison simple. Just look

up/down or across the charts.

Figure 36 contains a trellis chart for Minnesota Barley Data from The Design of Experiments

[59] by R.A. Fisher. The trial involved planting: 10 varieties of barley, in 6 different sites over

two different years. The researchers measured yield in bushels per acre for each of the 120

possibilities.

37

Figure 36: Minnesota Barley Data Trellis Chart

3.8 Grand Tour

Grand tour is one of the tour methods which is used to find structure of multidimensional

data. This method can be applied to show multidimensional data in a 2D computer display. Tour

is a subset of all the possible projections of multidimensional data. The different tour methods

combine several static projections using different interpolation techniques into a movie, which is

called a tour [50].

3.8.1 Tours

In a static projection some of the information of the dataset is lost to the user. But if

several projections in different planes can be shown to the user step by step, user can get the idea

of overview of structure of the multivariate data.

38

Tours provide a general approach to choose and view data projections, allowing the

viewer to mentally connect disparate views, and thus supporting the exploration of a high-

dimensional space.

(A) (B)

Figure 37: (A): The scatterplot shows a multidimensional data set (some census data [49]). The

data is mapped to coordinates in a multidimensional space. A snapshot of the grand tour, a

projection of the data to single plane is illustrated in (B).

3.8.2 Tour methods

● Grand Tour - Shows all projections of the multivariate data by a random walk through

the landscape.

● Projection Pursuit (PP) guided tour - Tour gives more concentration to more interesting

views based on a PP index.

● Manual Control - User can decide the tour direction to take.

The grand tour method for choosing the target plane is to use random selection. A frame

is randomly selected from the space of all possible projections. A target frame is chosen

randomly by standardizing a random vector from a standard multivariate normal distribution:

sample p values from a standard univariate normal distribution, resulting in a sample from a

39

standard multivariate normal. Standardizing this vector to have length equal to one gives a

random value from a (p−1) dimensional sphere, that is, a randomly generated projection vector.

Do this twice to get a 2D projection, where the second vector is orthonormalized on the first.

Figure 38 illustrates the tour path.

Figure 38: grand tour path in 3D space.

The solid circle in Figure 38 indicates the first point on the tour path corresponding to the

starting frame. The solid square indicates the last point in the tour path, or the last projection

computed. Each point corresponds to a projection from 3 dimensions to one dimension. The

projection will look as if the data space is viewed from that direction. In grand tour this point is

chosen randomly.

40

4. CEP Rule generation

Recent advances in technology has enabled the generation of vast amounts of data

in a wide range of fields. This data is created continuously in large quantities overtime

as data streams. Complex Event Processing (CEP) can be used to analyze and process these

large data streams to identify interesting situations and respond to them as quickly as possible.

Complex event processors are used in almost every domain : vehicular traffic analysis,

network monitoring, sensor details analyzing[51], analyzing trends in stock market[52], fraud

detection[53]. Any system that requires real time monitoring can use a complex event processor.

In CEP, the processing takes place according to user-defined rules, which specify the

relations between the observed events and the actions required by the user. For an example in a

network monitoring system a complex event processor can be used to notify the system admin

about an excessive internet usage of an user in that particular network. An example rule will look

like this,

from currentsums[bandwidth>100000]

select User_IP

insert into shouldNotify;

Where if a user's bandwidth exceeds the limit, the admin will receive a notification. The

value of the "limit" in this example should be low enough to catch high usage as well as it should

be high enough to ignore normal users.

Any complex event processing rule will have a condition to check, and an action

associated with that condition. So regardless of the domain, any system using a CEP heavily

depends on the rules defined by the user.

In current complex event processing applications, users need to manually specify the

rules that are used to identify and act on important patterns in the event streams. This is a

complex and arduous task that is time consuming, includes a lot of trial and error and typically

requires domain specific information that is hard to identify accurately.

So the rule writing is typically done by domain experts who study the parameters

available in the event streams manually or using external data analysis tools to identify the

41

events that need to be specially handled. Needless to say that incorrect estimation of relevant

parameters in the rules negatively impacts the utility of the systems that depend on accurate

processing of these events. Even for domain experts manually specifying textual rules in CEP

specific rule language is not a very user friendly experience. Maintaining the system after a rule

is specified to provide the same functionality through changing data and behavior may require

periodical updates to the specified rule that may require the same effort as initially spent.

Several approaches [54, 55, 56] have been proposed to overcome these difficulties using

data mining and knowledge discovery techniques to generate rules based on available data. This

provide users the ability to automatically generate rules based on their requirements.

Two approaches have been proposed that can help in generating CEP rules. One is Using

a framework that learns, from historical traces, the hidden causality between the received events

and the situations to detect, and uses them to automatically generate CEP rules [54]. Another

approach is to use a skeleton of the rule and use historical traces to tune the parameters of the

final rule [55].

4.1 iCEP

iCEP [54] analyzes historical traces and learns from them. It adopts a highly modular

design, with different components considering different aspects of the rule.

Following terminology and definitions are used in the framework.

Each event notification is assumed to be characterized by a type and a set of attributes.

The event type defines the number, order, names, and types of the attributes that compose the

event itself. It is also assumed that events occur instantaneously at some points in time.

Accordingly, each notification includes a timestamp, which represents the time of occurrence of

the event it encodes. Author of the paper uses the following example event of type ‘Temp’

Temp@10(room=123, value=24.5)

This event contains the fact that the air temperature measured inside room 123 at time 10 was

24.5 0C.

Another aspect of the terminology used by the authors is the difference between primitive

and composite events. Simple events similar to the one given above are considered as primitive

events. A composite event is defined using a pattern of primitive events. When such a pattern is

42

identified the CEP engine derives that a composite event has occurred and notifies the interested

components. An event trace that end with the occurrence of the composite event is called a

positive event trace.

iCEP framework uses the following basic building blocks used in most CEP systems to

generate filters for events.

➔ Selection: filters relevant event notifications according to the values of their attributes.

➔ Conjunction: combines event notifications together

➔ Parameterization: introduces constraints involving the values carried by different events.

➔ Sequence: introduces ordering relations among events.

➔ Window: defines the maximum timeframe of a pattern.

➔ Aggregation: constraints involving some aggregated value.

iCEP uses a set of modules that generates a combination of above building blocks to

generate CEP rules. The framework uses a training data set created using historical traces to

generate rules using a supervised learning technique.

The learning method uses the following consideration.

Consider the following positive event trace

ε1 : A@0, B@2, C@3

This implies the following set of constraints Sε1

- A: an event of type A must occur

- B: an event of type B must occur

- C: an event of type C must occur

- A→B: the event of type A must occur before that of type B

- A→C: the event of type A must occur before that of type C

- B→C: the event of type B must occur before that of type C

We can assert that for each rule r and event trace ε, r fires if and only if Sr ⊆ Sε where Sr

is the complete set of constraints that needs to be satisfied for the rule to fire.

Using these considerations the problem of rule generation can be expressed as the

problem of identifying Sr. Given a positive trace ε, Sε can be considered as an over constraining

43

approximation of Sr. To produce an approximation of Sr we can consider the set of all positive

traces collectively and consider the conjunction of all the sets of constraints generated.

Using these intuitions the iCEP framework follows the following steps in generating rules.

1. Determine the relevant timeframe to consider (window size)

2. Identify the relevant event types and attributes

3. Determine the selection and parameter constraints

4. Discover ordering constraints (sequences)

5. Identify aggregate and negation constraints.

Figure 39: Structure of the iCEP framework

The final structure of the framework is shown in figure 39. The problem is broken down

to sub problems and solved using different modules (described below) that work together.

● Event Learner: The event learner tries to determine which primitive event types are

required for the composite event to occur. It considers the window size as an optional

input parameter. It cuts each positive trace such that it ends with the occurrence. For each

positive trace, the event learner extracts the set of event types it contains. Then, according

to the general intuition described above, it computes and outputs the intersection of all

these sets.

● Window Learner: The window learner is responsible for learning the size of the window

that includes all primitive events required for a composite event. If the required event

types are knows the window learner tries to identify a window size that would ensure all

required primitive events are present is all positive traces. If the required event types are

not known, window learner and event learner uses an iterative approach where increasing

window sizes are fed to the event learner until a required accuracy in the rule is reached.

44

● Constraint Learner: This module receives the filtered event traces from the above two

modules and tries to identify possible constraints in the parameters. For all parameters it

tries to look for equality constraints where all possible traces contain a single value and

failing that generates an inequality constraint the looks for values between the minimum

and maximum value available all positive traces.

● Aggregate Learner: As shown in Figure 39, the aggregate learner runs in parallel with

the constraint learner. Instead of looking for single value constraints the aggregate learner

uses aggregation functions such as ‘sum’ and ‘average’ over the time window over all the

events of a certain type to generate constraints.

Other modules in the framework uses the same methods to identify different aspects of the rule.

The effectiveness of the framework has been assessed using the following steps.

1. Use an existing rule created by a domain expert that identifies a set of composite

events in a data stream and collect the positive traces.

2. Use iCEP with the data collected in the above step to generate a rule

3. Run the data again through the CEP with the generated rule and capture the

composite events triggered.

4. Compare the two versions and calculate precision and recall

The results have been promising with a precision of around 94% based on some of the

tests that were run by the authors. But the system is far from perfect and the following are some

of the challenges that needs to be overcome.

1. A large training dataset with many positive traces are required to generate good rules

with high precision. The training methodology considers only the conjunction of all the

positive traces to generate rules. So without a large number of positive traces that cover

the variations in the data generating accurate rules is difficult.

2. High computational requirements. The iterative approach used with the windows learner

and event learner translates to a lot of computations that needs to be done. So without

hints from a domain expert on the window size or the required events and parameters the

runtime and computational cost increases rapidly.

45

3. The generated rules require tuning and cleanup from the user. As the rules created are

generated automatically the constraints may be over constraining or may contain

mistakes when used with previously unseen conditions. So they require a final cleanup by

the users.

4.2 Tuning rule parameters using the Prediction-Correction Paradigm

A mechanism has been proposed by Yulia Turchin in order to automate the definition of

the rules at the beginning and automate the update of rules with the time [55]. It consists of 2

main repetitive stages - namely rule parameter prediction and rule parameter correction.

Parameter prediction is performed by updating the parameters using available expert knowledge

regarding the future changes of parameters. The rule parameter correction utilizes expert

feedback regarding the actual past occurrence of events and the events materialized by the CEP

framework to tune rule parameters.

For an example in an Intrusion detection system [57] a domain expert can specify the rule

as follow. “If the size of the received packet from user has a high level of deviation from

“normal” packet size with estimated size of m1 and standard deviation of 𝜎1, infer an event E1

representing the anomaly level of the packet size”. It is a hard task to determine the values for m1

and 𝜎1and moreover the specified values can change with the time due to dynamic nature of

network traffic.

Rule parameter determination and tuning can be done as following: Given a set of rules,

provide an initial value for rule parameters and then modify it as required. For example for a

given rule, rule tuning algorithm might suggest to replace values m1 with values m2 such that m2

< m1. Initial prediction of m1 value can be done as special case of tuning where arbitrary value is

corrected to m1 by the rule tuning algorithm. This rule tuning algorithm should be tied with

ability of the system to correctly predict events. So that rule tuning algorithm can see that

parameter m1 is too high and because of that many intrusions were not detected, therefore it

needs to be reduced to m2.

46

Figure 40: Prediction Correction Paradigm

The proposed framework is based on the Kalman Estimator which is a simple type of

supervised, Bayesian, and predict-correct estimator [18]. As shown in figure 40 the framework

learns and updates the system state in two stages, namely rule parameter prediction and rule

parameter update. Unsupervised learning is carried out in rule parameter prediction - rule

parameters are updated without any user feedback and it depends on preexisting knowledge

about how the parameters might change over time and events created by the inference algorithm

to predict rule parameters. In rule parameter update stage, the parameters are tuned in a

supervised manner using domain expert’s feedback and recently generated events to update rule

parameters to next stage. User feedback can be given through two forms - direct and indirect

feedback. Direct feedback involves changes to the system state while indirect feedback provides

an assessment on the correctness of the estimated event history.

4.2.1 Model

The model of this methods consists of events, rules and system. In here event means a

significant (of interest to the system) actual occurrence of the system. Examples of events

include notifications of login attempt, failures of IT components. Therefore, we can define an

event history h to be a set of all events (of interest to the system), as well as their associated data.

And event notification to be an estimation of an occurrence of an event. Some event may not be

notified and some non-occurring events may be notified because of faulty equipment. Therefore

we can define estimated event history h’ of notified events (of interest to the system). Events can

47

be of two types: explicit events and inferred events. Explicit events are signaled by event

sources. For example a new network connection request is an explicit event. Inferred events are

the events materialized by the system based on other events, for example an illegal connection

attempt event is an inferred event materialized by the network security system, based on the

explicit event of a new network connection, and an inferred event of unsuccessful user

authorization. Inferred events, just like explicit events, belong to event histories. Inferred events

that actually occurred in the real world belong to event history h and those who are only

estimated to occur in h’ estimated event history.

Events can be inferred by rules. Rule can be represented by quadruple r = <sr , pr , ar ,

mr>. sr is a selection function that filters events according to rule r. Events selected by selection

function are said to be relevant events. Input to this function is an event history h. pr is a

predicate, defined over a filtered event history, determining when events become candidates for

materialization. The ar is an association function, which defines how many events should be

materialized, as well as which subsets of selectable events are associated with each materialized

event. mr is a mapping function that determines attribute values for the materialized events in ar.

4.2.2 System State

It is expected that expert can provide the form of sr , pr , ar , and mr but providing accurate

values will be difficult. These are called rule parameters and set of all parameters will be called

system state. This system state will be updated by this system as shown in figure 40. In predict

stage parameters are updated using the knowledge how the rule might change over time and

updated event history h. In update stage parameters are updated by direct feedback where exact

rule parameter is mentioned, or in an indirect manner where events in estimated event history h’

are marked whether they actually occurred or not.

4.2.3 Rule Tuning Mechanism

In-order to tune rule parameters this framework uses discrete Kalman filter technique.

The filter estimates the process state at some time and then obtains feedback in the form of

(noisy) measurements.

48

Rule tuning model consists of two recursive equations: time equation which shows how

parameters change over time and history equation which shows outcome of a set of rules and

their parameters. Time equation is a function of previous system state (set of rule parameters)

and actual event history of that time period and output of this equation is current system state.

History equation is a function of current set of rule parameters, set of explicit event during that

time period and actual event history of previous time period and output this equation is actual

event history. But since current system state is not known, another equation which is known as

estimated event history equation which differs from original history equation by using estimated

current system state (estimated current set of rule parameters) and its output is estimated current

event history. This can be used to evaluate performance of our inference mechanism.

Performance evaluation will be based on the comparison of the estimated event history received

from the inference mechanism and the actual event history, provided by expert feedback at the

end of time interval k. By that we can measure the performance measures of precision and recall.

The precision is the percentage of correctly inferred events relative to the total number of events

inferred in this time interval. Recall measures the percentage of correctly inferred events (i.e.,

true positive) relative to the actual total number of events occurred in this time interval.

Figure 41: An overview of rules tuning method

The Rule Tuning Method consists of a repetitive sequence of actions that should be

performed for correct evaluation and dynamic update of rule parameters. The sequence is

illustrated in Figure 41.

49

Above model is a generic model for automating rule parameter tuning in CEP’s. Further,

it serves as a proof of concept of automatic rule parameter tuning when doing that manually

becomes a cognitive challenge.

However the model introduced here is more generic and actual implementation will

require lot of work and tailoring for that specific requirement (such as example mentioned here

intrusion detection in IDS). But this model can work as a theoretical basis for any such work

because of the promising results of the empirical study.

50

References

1. Wong, Pak Chung, and R. Daniel Bergeron. "30 Years of Multidimensional Multivariate

Visualization." In Scientific Visualization, pp. 3-33. 1994.

2. Jolliffe, Ian. Principal component analysis. John Wiley & Sons, Ltd, 2005.

3. Tufte, E. R., & Graves-Morris, P. R. (1983). The visual display of quantitative

information (Vol. 2). Cheshire, CT: Graphics press.

4. Data-Ink Ratio. [ONLINE] Available at: http://www.infovis-wiki.net/index.php/Data-

Ink_Ratio. [Last Accessed 5 Nov. 2014].

5. Lie Factor. [ONLINE] Available at: http://www.infovis-

wiki.net/index.php?title=Lie_Factor. [Last Accessed 5 Nov. 2014].

6. Keim, D. A. (2002). Information visualization and visual data mining. Visualization and

Computer Graphics, IEEE Transactions on, 8(1), 1-8.

7. Asimov, D. (1985). The grand tour: a tool for viewing multidimensional data. SIAM

Journal on Scientific and Statistical Computing, 6(1), 128-143.

8. Bier, E. A., Stone, M. C., Pier, K., Buxton, W., & DeRose, T. D. (1993, September).

Toolglass and magic lenses: the see-through interface. In Proceedings of the 20th annual

conference on Computer graphics and interactive techniques (pp. 73-80). ACM.

9. Spoerri, A. (1995). InfoCrystal, a visual tool for information retrieval (Doctoral

dissertation, Massachusetts Institute of Technology).

10. Seo, J., & Shneiderman, B. (2005). A rank-by-feature framework for interactive

exploration of multidimensional data. Information Visualization, 4(2), 96-113.

11. Inselberg, A., & Dimsdale, B. (1987). Parallel coordinates for visualizing multi-

dimensional geometry (pp. 25-44). Springer Japan.

12. Pearson, K. (1895). Note on regression and inheritance in the case of two parents.

Proceedings of the Royal Society of London, 58(347-352), 240-242.

13. Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: algorithms and

applications. The VLDB Journal—The International Journal on Very Large Data Bases,

8(3-4), 237-253.

14. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying

density-based local outliers. In ACM Sigmod Record (Vol. 29, No. 2, pp. 93-104). ACM.

15. Elmqvist, N., Dragicevic, P., & Fekete, J. D. (2008). Rolling the dice: Multidimensional

visual exploration using scatterplot matrix navigation. Visualization and Computer

Graphics, IEEE Transactions on, 14(6), 1539-1148.

16. Ullman, S. (1979). The interpretation of visual motion. Massachusetts Inst of Technology

Pr.

17. Im, J. F., McGuffin, M. J., & Leung, R. (2013). Gplom: The generalized plot matrix for

visualizing multidimensional multivariate data. Visualization and Computer Graphics,

IEEE Transactions on, 19(12), 2606-2614.

51

18. R. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic

Engineering, 82(1):35 – 45, 1960.

19. A. Inselberg and B. Dimsdale. Parallel Coordinates: A Tool for Visualizing Multi-

dimensional Geometry , 1990

20. Savoska, S., & Loskovska, S. (2009, November). Parallel Coordinates as Tool of

Exploratory Data Analysis. In 17th Telecommunications Forum TELFOR, Belgrade,

Serbia (pp. 24-26).

21. Chen, C. H., Härdle, W., & Unwin, A. (2008). Handbooks of Computational Statistics:

Data Visualization. 164 - 174

22. Hauser, H., Ledermann, F., & Doleisch, H. (2002). Angular brushing of extended parallel

coordinates. In Information Visualization, 2002. INFOVIS 2002. IEEE Symposium on

(pp. 127-130). IEEE.

23. Martin, A. R., & Ward, M. O. (1995, October). High dimensional brushing for interactive

exploration of multivariate data. In Proceedings of the 6th Conference on

Visualization'95 (p. 271). IEEE Computer Society.

24. Heinrich, J., & Weiskopf, D. (2012). State of the art of parallel coordinates. In

Eurographics 2013-State of the Art Reports (pp. 95-116). The Eurographics Association.

25. Lu, L. F., Huang, M. L., & Huang, T. H. (2012, December). A new axes re-ordering

method in parallel coordinates visualization. In Machine Learning and Applications

(ICMLA), 2012 11th International Conference on (Vol. 2, pp. 252-257). IEEE.

26. Fua, Y. H., Ward, M. O., & Rundensteiner, E. A. (1999, October). Hierarchical parallel

coordinates for exploration of large datasets. In Proceedings of the conference on

Visualization'99: celebrating ten years (pp. 43-50). IEEE Computer Society Press.

27. Yuan Luo, Daniel Weiskopf, Member, IEEE Computer Society, Hao Zhang, Member,

IEEE Computer Society, and Arthur E. Kirkpatrick : Cluster Visualization in Parallel

Coordinates Using Curve Bundles

28. Heinrich, J., Luo, Y., Kirkpatrick, A. E., Zhang, H., & Weiskopf, D. (2011). Evaluation

of a bundling technique for parallel coordinates. arXiv preprint arXiv:1109.6073.

29. Zhou, H., Yuan, X., Qu, H., Cui, W., & Chen, B. (2008, May). Visual clustering in

parallel coordinates. In Computer Graphics Forum (Vol. 27, No. 3, pp. 1047-1054).

Blackwell Publishing Ltd.

30. Andrienko, G., & Andrienko, N. (2005). Blending aggregation and selection: Adapting

parallel coordinates for the visualization of large datasets. The Cartographic Journal,

42(1), 49-60.

31. Artero, A. O., de Oliveira, M. C. F., & Levkowitz, H. (2006, July). Enhanced high

dimensional data visualization through dimension reduction and attribute arrangement.

In Information Visualization, 2006. IV 2006. Tenth International Conference on (pp. 707-

712). IEEE.

52

32. Forina, M., Armanino, C., Lanteri, S. and Tiscornia, E. (1983). Classification of olive oils

from their fatty acid composition, in H. Martens and H. Russwurm (eds), Food Research

and Data Analysis, Applied Science Publishers, London UK, pp. 189-214

33. Hoffman, Patrick, Georges Grinstein, Kenneth Marx, Ivo Grosse, and Eugene Stanley.

"DNA visual and analytic data mining." In Visualization'97., Proceedings, pp. 437-441.

IEEE, 1997.

34. R. Stone II, A.L. Sabichi, J. Gill, I.Lee, R. Loganatharaj, M. Trutschl, U. Cvek, J.L.

Clifford. Identification of genes involved in early stage bladder cancer progression

[Unpublished].

35. Hartigan, J. A., and Kleiner, B. (1981). Mosaics for contingency tables. In W. F. Eddy

(Ed.), Computer Science and Statistics: Proceedings of the 13th Symposium on the

Interface. New York: Springer-Verlag.

36. Friendly, M. (2002). A brief history of the mosaic display. Journal of Computational and

Graphical Statistics, 11(1).

37. Hofmann, H. (2008). Mosaic plots and their variants. In Handbook of data visualization

(pp. 617-642). Springer Berlin Heidelberg.

38. Hofmann, H., Siebes, A. P., & Wilhelm, A. F. (2000, August). Visualizing association

rules with interactive mosaic plots. In Proceedings of the sixth ACM SIGKDD

international conference on Knowledge discovery and data mining (pp. 227-235). ACM.

39. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-

1480.

40. Kaski, S., & Kohonen, T. (1996). Exploratory data analysis by the self-organizing map:

Structures of welfare and poverty in the world. In Neural networks in financial

engineering. Proceedings of the third international conference on neural networks in the

capital markets.

41. K. Rodden, “Applying a sunburst visualization to summarize user navigation sequences”,

IEEE Comput. Graph. Appl. Mag., Vol. 34, iss. 5, pp. 36-40, Sept.-Oct. 2014.

42. Keim, D. A., Schneidewind, J., & Sips, M. (2005). Fp-viz: Visual frequent pattern

mining. Bibliothek der Universität Konstanz.

43. J. Stasko. SunBurst [Online]. Available: http://www.cc.gatech.edu/gvu/ii/sunburst/

44. R. Vliegen, J.J. van Wijk and E.-J. van der Linden, "Visualizing Business Data with

Generalized Treemaps", IEEE Trans. Visualization and Computer Graphics, vol. 12, no.

5, pp. 789-796, Sept./Oct. 2006.

45. E. Tufte, “Small Multiples,” in Envisioning Information, Cheshire, CT: Graphics Press,

ch. 4,pp. 67-80.

46. R.A. Becker, et al., "The visual design and control of trellis display,” J. Comp. Graph.

Stat., vol. 5, iss. 2, pp. 123-155, 1996.

47. M. Theus, “High Dimensional Data Visualizations,” in C. Chen et al. Handbook of Data

Visualization, Berlin: Springer, part II¸ch. 6, sec. 3, pp. 156-163.

48. What is a Trellis Chart [Online]. Available: http://trellischarts.com/what-is-a-trellis-chart

53

49. M.Y. Huh, K. Kiyeol. "Visualization of multidimensional data using modifications of the

Grand Tour," J. Appl. Statist.,vol. 29, no. 5, pp. 721-728,2002.

50. D. Cook et al., “Grand Tours, Projection, Pursuit Guided Tours, and Manual Controls,”

in C. Chen et al. Handbook of Data Visualization, Berlin: Springer, part III¸ch. 2, pp.

296-312.

51. Broda, K., Clark, K., Miller, R., & Russo, A. (2009). SAGE: a logical agent-based

environment monitoring and control system (pp. 112-117). Springer Berlin Heidelberg.

52. A. J. Demers, J. Gehrke, M. Hong, M. Riedewald, and W. M. White. Towards expressive

publish/subscribe systems. In EDBT, pages 627–644, 2006.

53. N. P. Schultz-Møller, M. Migliavacca, and P. Pietzuch. Distributed complex event

processing with query rewriting. In DEBS, pages 4:1–4:12. ACM, 2009.

54. Margara, A., Cugola, G., & Tamburrelli, G. (2014, May). Learning from the past:

automated rule generation for complex event processing. In Proceedings of the 8th ACM

International Conference on Distributed Event-Based Systems (pp. 47-58). ACM.

55. Turchin, Yulia, Avigdor Gal, and Segev Wasserkrug. "Tuning complex event processing

rules using the prediction-correction paradigm." In Proceedings of the Third ACM

International Conference on Distributed Event-Based Systems, p. 10. ACM, 2009.

56. Mutschler, C., & Philippsen, M. (2012). Learning event detection rules with noise hidden

Markov models. In AHS (pp. 159-166).

57. Axelsson, S. (2000). Intrusion detection systems: A survey and taxonomy (Vol. 99).

Technical report.

58. A selected set of attributes for a sample of cars manufactured within 1970 to 1982.

[ONLINE] Available at: http://web.pdx.edu/~gerbing/data/cars.csv. [Last Accessed 5

Nov. 2014].

59. Fisher, R. A. (1935). The design of experiments.


Recommended