Post on 30-Nov-2015
description
transcript
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
13 Years of Experience
Automated Services
24/7 Help Desk Support
Experience & Expertise Developers
Advanced Technologies & Tools
Legitimate Member of all Journals
Having 1,50,000 Successive records in
all Languages
More than 12 Branches in Tamilnadu,
Kerala & Karnataka.
Ticketing & Appointment Systems.
Individual Care for every Student.
Around 250 Developers & 20
Researchers
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
227-230 Church Road, Anna Nagar, Madurai – 625020.
0452-4390702, 4392702, + 91-9944793398.
info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam,
Chennai - 600034. 044-42072702, +91-9600354638,
chennai@elysiumtechnologies.com
15, III Floor, SI Towers, Melapudur main Road, Trichy – 620001.
0431-4002234, + 91-9790464324.
trichy@elysiumtechnologies.com
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641002
0422- 4377758, +91-9677751577.
coimbatore@elysiumtechnologies.com
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli-
627007. 0462-2532104, +919677733255,
tirunelveli@elysiumtechnologies.com
1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram
- 623501. 04567-223225,
+919677704922.ramnad@elysiumtechnologies.com
74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur
Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +91-
9677748477 erode@elysiumtechnologies.com
No: 88, First Floor, S.V.Patel Salai, Pondicherry – 605 001. 0413–
4200640 +91-9677704822
pondy@elysiumtechnologies.com
TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem
– 636007, 0427-4042220, +91-9894444716.
salem@elysiumtechnologies.com
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
DM-001 Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data
Abstract: Feature selection involves identifying a subset of the most useful features that produces compatible
results as the original entire set of features. A feature selection algorithm may be evaluated from both the
efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of
features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast
clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper.
The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-
theoretic clustering methods. In the second step, the most representative feature that is strongly related to target
classes is selected from each cluster to form a subset of features. Features in different clusters are relatively
independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and
independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST)
clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical
study. Extensive experiments are carried out to compare FAST and several representative feature selection
algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known
classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the
rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-
dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of
features but also improves the performances of the four types of classifiers.
ETPL
DM-002
Graph-Based Consensus Maximization Approach for Combining Multiple Supervised
and Unsupervised Models
Ensemble learning has emerged as a powerful method for combining multiple models. Well-known methods,
such as bagging, boosting, and model averaging, have been shown to improve accuracy and robustness over
single models. However, due to the high costs of manual labeling, it is hard to obtain sufficient and reliable
labeled data for effective training. Meanwhile, lots of unlabeled data exist in these sources, and we can readily
obtain multiple unsupervised models. Although unsupervised models do not directly generate a class label
prediction for each object, they provide useful constraints on the joint predictions for a set of related objects.
Therefore, incorporating these unsupervised models into the ensemble of supervised models can lead to better
prediction performance. In this paper, we study ensemble learning with outputs from multiple supervised and
unsupervised models, a topic where little work has been done. We propose to consolidate a classification
solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We
cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the
smoothness of the predictions over the graph, but penalizes the deviations from the initial labeling provided by
the supervised models. We solve this problem through iterative propagation of probability estimates among
neighboring nodes and prove the optimality of the solution. The proposed method can be interpreted as
conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on
different applications with heterogeneous data sources demonstrate the benefits of the proposed method over
existing alternatives.
ETPL
DM-003
Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and Rule-
Based Model
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: Recent increase in the use of video-based applications has revealed the need for extracting the
content in videos. Raw data and low-level features alone are not sufficient to fulfill the user 's needs; that is, a
deeper understanding of the content at the semantic level is required. Currently, manual techniques, which are
inefficient, subjective and costly in time and limit the querying capabilities, are being used to bridge the gap
between low-level representative features and high-level semantic content. Here, we propose a semantic
content extraction system that allows the user to query and retrieve objects, events, and concepts that are
extracted automatically. We introduce an ontology-based fuzzy video semantic content model that uses
spatial/temporal relations in event and concept definitions. This metaontology definition provides a wide-
domain applicable rule construction standard that allows the user to construct an ontology for a given domain.
In addition to domain ontologies, we use additional rule definitions (without using ontology) to lower spatial
relation computation cost and to be able to define some complex situations more effectively. The proposed
framework has been fully implemented and tested on three different domains. We have obtained satisfactory
precision and recall rates for object, event and concept extraction.
ETPL
DM-004 Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm
In comparison with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering
algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in
domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic
present within a document or set of documents. However, because most sentence similarity measures do not
represent sentences in a common metric space, conventional fuzzy clustering approaches based on prototypes
or mixtures of Gaussians are generally not applicable to sentence clustering. This paper presents a novel fuzzy
clustering algorithm that operates on relational input data; i.e., data in the form of a square matrix of pairwise
similarities between data objects. The algorithm uses a graph representation of the data, and operates in an
Expectation-Maximization framework in which the graph centrality of an object in the graph is interpreted as a
likelihood. Results of applying the algorithm to sentence clustering tasks demonstrate that the algorithm is
capable of identifying overlapping clusters of semantically related sentences, and that it is therefore of potential
use in a variety of text mining tasks. We also include results of applying the algorithm to benchmark data sets
in several other domains.
ETPL
DM-005 Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks
Abstract: In this paper, we introduce the notion of sufficient set and necessary set for distributed processing of
probabilistic top-k queries in cluster-based wireless sensor networks. These two concepts have very nice
properties that can facilitate localized data pruning in clusters. Accordingly, we develop a suite of algorithms,
namely, sufficient set-based (SSB), necessary set-based (NSB), and boundary-based (BB), for intercluster
query processing with bounded rounds of communications. Moreover, in responding to dynamic changes of
data distribution in the network, we develop an adaptive algorithm that dynamically switches among the three
proposed algorithms to minimize the transmission cost. We show the applicability of sufficient set and
necessary set to wireless sensor networks with both two-tier hierarchical and tree-structured network
topologies. Experimental results show that the proposed algorithms reduce data transmissions significantly and
incur only small constant rounds of data communications. The experimental results also demonstrate the
superiority of the adaptive algorithm, which achieves a near-optimal performance under various conditions.
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
DM-006
Evaluating Data Reliability: An Evidential Answer with Application to a Web-Enabled
Data Warehouse
Abstract: There are many available methods to integrate information source reliability in an uncertainty
representation, but there are only a few works focusing on the problem of evaluating this reliability. However,
data reliability and confidence are essential components of a data warehousing system, as they influence
subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a
set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided.
The chosen illustrative example comes from real-world data issued from the Sym'Previus predictive
microbiology oriented data warehouse.
ETPL
DM-007 Large Graph Analysis in the GMine System
Abstract: Current applications have produced graphs on the order of hundreds of thousands of nodes and
millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers, and
communities. These tasks are better performed in an interactive environment, where human expertise can guide
the process. For large graphs, though, there are some challenges: the excessive processing requirements are
prohibitive, and drawing hundred-thousand nodes results in cluttered images hard to comprehend. To cope with
these problems, we propose an innovative framework suited for any kind of tree-like graph visual design.
GMine integrates 1) a representation for graphs organized as hierarchies of partitions-the concepts of
SuperGraph and Graph-Tree; and 2) a graph summarization methodology-CEPS. Our graph representation
deals with the problem of tracing the connection aspects of a graph hierarchy with sub linear complexity,
allowing one to grasp the neighborhood of a single node or of a group of nodes in a single click. As a proof of
concept, the visual environment of GMine is instantiated as a system in which large graphs can be investigated
globally and locally.
ETPL
DM-008
Maximum Likelihood Estimation from Uncertain Data in the Belief Function
Framework
Abstract: We consider the problem of parameter estimation in statistical models in the case where data are
uncertain and represented as belief functions. The proposed method is based on the maximization of a
generalized likelihood criterion, which can be interpreted as a degree of agreement between the statistical
model and the uncertain observations. We propose a variant of the EM algorithm that iteratively maximizes
this criterion. As an illustration, the method is applied to uncertain data clustering using finite mixture models,
in the cases of categorical and continuous attributes.
ETPL
DM-009
Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case
Studies
Abstract: In this paper, we study sparsity-exploiting Mastermind algorithms for attacking the privacy of an
entire database of character strings or vectors, such as DNA strings, movie ratings, or social network friendship
data. Based on reductions to nonadaptive group testing, our methods are able to take advantage of minimal
amounts of privacy leakage, such as contained in a single bit that indicates if two people in a medical database
have any common genetic mutations, or if two people have any common friends in an online social network.
We analyze our Mastermind attack algorithms using theoretical characterizations that provide sublinear bounds
on the number of queries needed to clone the database, as well as experimental tests on genomic information,
collaborative filtering data, and online social networks. By taking advantage of the generally sparse nature of
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com these real-world databases and modulating a parameter that controls query sparsity, we demonstrate that
relatively few nonadaptive queries are needed to recover a large majority of each database.
ETPL
DM-010 On the Recovery of R-Trees
Abstract: We consider the recoverability of traditional R-tree index structures under concurrent updating
transactions, an important issue that is neglected or treated inadequately in many proposals of R-tree
concurrency control. We present two solutions to ARIES-based recovery of transactions on R-trees. These
assume a standard fine-grained single-version update model with physiological write-ahead logging and steal-
and-no-force buffering where records with uncommitted updates by a transaction may migrate from their
original page to another page due to structure modifications caused by other transactions. Both solutions
guarantee that an R-tree will remain in a consistent and balanced state in the presence of any number of
concurrent forward-rolling and (totally or partially) backward-rolling multiaction transactions and in the event
of process failures and system crashes. One solution maintains the R-tree in a strictly consistent state in which
the bounding rectangles of pages are as tight as possible, while in the other solution this requirement is relaxed.
In both solutions only a small constant number of simultaneous exclusive latches (write latches) are needed,
and in the solution that only maintains relaxed consistency also the number of simultaneous nonexclusive
latches is similarly limited. In both solutions, deletions are handled uniformly with insertions, and a
logarithmic insertion-path length is maintained under all circumstances.
ETPL
DM-011 Ontology Matching: State of the Art and Future Challenges
Abstract: After years of research on ontology matching, it is reasonable to consider several questions: is the
field of ontology matching still making progress? Is this progress significant enough to pursue further research?
If so, what are the particularly promising directions? To answer these questions, we review the state of the art
of ontology matching and analyze the results of recent ontology matching evaluations. These results show a
measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant
improvements can be obtained only by addressing important challenges for ontology matching. We present
such challenges with insights on how to approach them, thereby aiming to direct research into the most
promising tracks and to facilitate the progress of the field.
ETPL
DM-012 Ranking on Data Manifold with Sink Points
Abstract: Ranking is an important problem in various applications, such as Information Retrieval (IR), natural
language processing, computational biology, and social sciences. Many ranking approaches have been
proposed to rank objects according to their degrees of relevance or importance. Beyond these two goals,
diversity has also been recognized as a crucial criterion in ranking. Top ranked results are expected to convey
as little redundant information as possible, and cover as many aspects as possible. However, existing ranking
approaches either take no account of diversity, or handle it separately with some heuristics. In this paper, we
introduce a novel approach, Manifold Ranking with Sink Points (MRSPs), to address diversity as well as
relevance and importance in ranking. Specifically, our approach uses a manifold ranking process over the data
manifold, which can naturally find the most relevant and important data objects. Meanwhile, by turning ranked
objects into sink points on data manifold, we can effectively prevent redundant objects from receiving a high
rank. MRSP not only shows a nice convergence property, but also has an interesting and satisfying
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com optimization explanation. We applied MRSP on two application tasks, update summarization and query
recommendation, where diversity is of great concern in ranking. Experimental results on both tasks present a
strong empirical performance of MRSP as compared to existing ranking approaches.
ETPL
DM-013 Region-Based Foldings in Process Discovery
Abstract: A central problem in the area of Process Mining is to obtain a formal model that represents the
processes that are conducted in a system. If realized, this simple motivation allows for powerful techniques that
can be used to formally analyze and optimize a system, without the need to resort to its semiformal and
sometimes inaccurate specification. The problem addressed in this paper is known as Process Discovery: to
obtain a formal model from a set of system executions. The theory of regions is a valuable tool in process
discovery: it aims at learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is
applied on an automaton and therefore one should convert the traces into an acyclic automaton in order to
apply these techniques. Given that the complexity of the region-based techniques depends on the size of the
input automata, revealing the underlying cycles and folding the initial automaton can incur in a significant
complexity alleviation of the region-based techniques. In this paper, we follow this idea by incorporating
region information in the cycle detection algorithm, enabling the identification of complex cycles that cannot
be obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool
suggest that the techniques presented in this paper are a big step into widening the application of the theory of
regions in Process Mining for industrial scenarios.
ETPL
DM-014
Relationships between Diversity of Classification Ensembles and Single-Class
Performance Measures
Abstract: In class imbalance learning problems, how to better recognize examples from the minority class is
the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble
solutions have been proposed in the literature with varying degrees of success. It is generally believed that
diversity in an ensemble could help to improve the performance of class imbalance learning. However, no
study has actually investigated diversity in depth in terms of its definitions and effects in the context of class
imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of
minority and majority classes. In this paper, we aim to gain a deeper understanding of if and when ensemble
diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why
diversity measured by Q-statistic can bring improved overall accuracy based on two classification patterns
proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios.
Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and F-
measure, which are widely used in class imbalance learning. Six different situations of diversity's impact on
these measures are obtained through theoretical analysis. Finally, to further understand how diversity affects
the single class performance and overall performance in class imbalance problems, we carry out extensive
experimental studies on both artificial data sets and real-world benchmarks with highly skewed class
distributions. We find strong correlations between diversity and discussed performance measures. Diversity
shows a positive impact on the minority class in general. It is also beneficial to the overall performance in
terms of AUC and G-mean.
ETPL
DM-015 T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: This paper presents a smart driving direction system leveraging the intelligence of experienced
drivers. In this system, GPS-equipped taxis are employed as mobile sensors probing the traffic rhythm of a city
and taxi drivers' intelligence in choosing driving directions in the physical world. We propose a time-dependent
landmark graph to model the dynamic traffic pattern as well as the intelligence of experienced drivers so as to
provide a user with the practically fastest route to a given destination at a given departure time. Then, a
Variance-Entropy-Based Clustering approach is devised to estimate the distribution of travel time between two
landmarks in different time slots. Based on this graph, we design a two-stage routing algorithm to compute the
practically fastest and customized route for end users. We build our system based on a real-world trajectory
data set generated by over 33,000 taxis in a period of three months, and evaluate the system by conducting both
synthetic experiments and in-the-field evaluations. As a result, 60-70 percent of the routes suggested by our
method are faster than the competing methods, and 20 percent of the routes share the same results. On average,
50 percent of our routes are at least 20 percent faster than the competing approaches.
ETPL
DM-016
A Generalized Flow-Based Method for Analysis of Implicit Relationships on
Wikipedia
Abstract: We focus on measuring relationships between pairs of objects in Wikipedia whose pages
can be regarded as individual objects. Two kinds of relationships between two objects exist: in
Wikipedia, an explicit relationship is represented by a single link between the two pages for the
objects, and an implicit relationship is represented by a link structure containing the two pages. Some
of the previously proposed methods for measuring relationships are cohesion-based methods, which
underestimate objects having high degrees, although such objects could be important in constituting
relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships
because they use only one or two of the following three important factors: distance, connectivity, and
cocitation. We propose a new method using a generalized maximum flow which reflects all the three
factors and does not underestimate objects having high degree. We confirm through experiments that
our method can measure the strength of a relationship more appropriately than these previously
proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is,
objects constituting a relationship. We explain that mining elucidatory objects would open a novel
way to deeply understand a relationship.
ETPL
DM-017
A Proxy-Based Approach to Continuous Location-Based Spatial Queries in
Mobile Environments
Abstract: Caching valid regions of spatial queries at mobile clients is effective in reducing the number
of queries submitted by mobile clients and query load on the server. However, mobile clients suffer
from longer waiting time for the server to compute valid regions. We propose in this paper a proxy-
based approach to continuous nearest-neighbor (NN) and window queries. The proxy creates
estimated valid regions (EVRs) for mobile clients by exploiting spatial and temporal locality of
spatial queries. For NN queries, we devise two new algorithms to accelerate EVR growth, leading the
proxy to build effective EVRs even when the cache size is small. On the other hand, we propose to
represent the EVRs of window queries in the form of vectors, called estimated window vectors
(EWVs), to achieve larger estimated valid regions. This novel representation and the associated
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
creation algorithm result in more effective EVRs of window queries. In addition, due to the distinct
characteristics, we use separate index structures, namely EVR-tree and grid index, for NN queries and
window queries, respectively. To further increase efficiency, we develop algorithms to exploit the
results of NN queries to aid grid index growth, benefiting EWV creation of window queries.
Similarly, the grid index is utilized to support NN query answering and EVR updating. We conduct
several experiments for performance evaluation. The experimental results show that the proposed
approach significantly outperforms the existing proxy-based approaches.
ETPL
DM-018
A Rough-Set-Based Incremental Approach for Updating Approximations under
Dynamic Maintenance Environments
Abstract: Approximations of a concept by a variable precision rough-set model (VPRS) usually vary
under a dynamic information system environment. It is thus effective to carry out incremental
updating approximations by utilizing previous data structures. This paper focuses on a new
incremental method for updating approximations of VPRS while objects in the information system
dynamically alter. It discusses properties of information granulation and approximations under the
dynamic environment while objects in the universe evolve over time. The variation of an attribute's
domain is also considered to perform incremental updating for approximations under VPRS. Finally,
an extensive experimental evaluation validates the efficiency of the proposed method for dynamic
maintenance of VPRS approximations
ETPL
DM-019 A System to Filter Unwanted Messages from OSN User Walls
Abstract: One fundamental issue in today's Online Social Networks (OSNs) is to give users the ability
to control the messages posted on their own private space to avoid that unwanted content is displayed.
Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a
system allowing OSN users to have a direct control on the messages posted on their walls. This is
achieved through a flexible rule-based system, that allows users to customize the filtering criteria to
be applied to their walls, and a Machine Learning-based soft classifier automatically labeling
messages in support of content-based filtering.
ETPL
DM-020 AML: Efficient Approximate Membership Localization within a Web-Based
Abstract: In this paper, we propose a new type of Dictionary-based Entity Recognition Problem,
named Approximate Membership Localization (AML). The popular Approximate Membership
Extraction (AME) provides a full coverage to the true matched substrings from a given document, but
many redundancies cause a low efficiency of the AME process and deteriorate the performance of
real-world applications using the extracted substrings. The AML problem targets at locating
nonoverlapped substrings which is a better approximation to the true matched substrings without
generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized
algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before
generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
over a baseline method. We also study the AML in application to a proposed web-based join
framework scenario which is a search-based approach joining two tables using dictionary-based entity
recognition from web documents. The results not only prove the advantage of AML over AME, but
also demonstrate the effectiveness of our search-based approach.
ETPL
DM-021 Anonymization of Centralized and Distributed Social Networks by Sequential
Abstract: We study the problem of privacy-preservation in social networks. We consider the
distributed setting in which the network data is split between several data holders. The goal is to
arrive at an anonymized view of the unified network without revealing to any of the data holders
information about links between nodes that are controlled by other data holders. To that end, we start
with the centralized setting and offer two variants of an anonymization algorithm which is based on
sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to
Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of
clustering. We then devise secure distributed versions of our algorithms. To the best of our
knowledge, this is the first study of privacy preservation in distributed social networks. We conclude
by outlining future research proposals in that direction.
ETPL
DM-022 Clustering Large Probabilistic Graphs
Abstract: We study the problem of clustering probabilistic graphs. Similar to the problem of clustering
standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes
in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in
affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic
graphs. We establish a connection between our objective function and correlation clustering to
propose practical approximation algorithms for our problem. A benefit of our approach is that our
objective function is parameter-free. Therefore, the number of clusters is part of the output. We also
develop methods for testing the statistical significance of the output clustering and study the case of
noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show
that our methods discover the correct number of clusters and identify established protein
relationships. Finally, we show the practicality of our techniques using a large social network of
Yahoo! users consisting of one billion edges.
ETPL
DM-023 Detecting Intrinsic Loops Underlying Data Manifold
Abstract: Detecting intrinsic loop structures of a data manifold is the necessary prestep for the proper
employment of the manifold learning techniques and of fundamental importance in the discovery of
the essential representational features underlying the data lying on the loopy manifold. An effective
strategy is proposed to solve this problem in this study. In line with our intuition, a formal definition
of a loop residing on a manifold is first given. Based on this definition, theoretical properties of loopy
manifolds are rigorously derived. In particular, a necessary and sufficient condition for detecting
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
essential loops of a manifold is derived. An effective algorithm for loop detection is then constructed.
The soundness of the proposed theory and algorithm is validated by a series of experiments performed
on synthetic and real-life data sets. In each of the experiments, the essential loops underlying the data
manifold can be properly detected, and the intrinsic representational features of the data manifold can
be revealed along the loop structure so detected. Particularly, some of these features can hardly be
discovered by the conventional manifold learning methods.
ETPL
DM-024 Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)
Abstract: This paper introduces a platform for online Sensitivity Analysis (SA) that is applicable in
large scale real-time data acquisition (DAQ) systems. Here, we use the term real-time in the context
of a system that has to respond to externally generated input stimuli within a finite and specified
period. Complex industrial systems such as manufacturing, healthcare, transport, and finance require
high-quality information on which to base timely responses to events occurring in their volatile
environments. The motivation for the proposed EventTracker platform is the assumption that modern
industrial systems are able to capture data in real-time and have the necessary technological flexibility
to adjust to changing system requirements. The flexibility to adapt can only be assured if data is
succinctly interpreted and translated into corrective actions in a timely manner. An important factor
that facilitates data interpretation and information modeling is an appreciation of the affect system
inputs have on each output at the time of occurrence. Many existing sensitivity analysis methods
appear to hamper efficient and timely analysis due to a reliance on historical data, or sluggishness in
providing a timely solution that would be of use in real-time applications. This inefficiency is further
compounded by computational limitations and the complexity of some existing models. In dealing
with real-time event driven systems, the underpinning logic of the proposed method is based on the
assumption that in the vast majority of cases changes in input variables will trigger events. Every
single or combination of events could subsequently result in a change to the system state. The
proposed event tracking sensitivity analysis method describes variables and the system state as a
collection of events. The higher the numeric occurrence of an input variable at the trigger level during
an event monitoring interval, the greater is its impact on the final analysis of the system state. Expe-
iments were designed to compare the proposed event tracking sensitivity analysis method with a
comparable method (that of Entropy). An improvement of 10 percent in computational efficiency
without loss in accuracy was observed. The comparison also showed that the time taken to perform
the sensitivity analysis was 0.5 percent of that required when using the comparable Entropy-based
method.
ETPL
DM-025 Fast Activity Detection: Indexing for Temporal Stochastic Automaton-Based
Abstract: Today, numerous applications require the ability to monitor a continuous stream of fine-
grained data for the occurrence of certain high-level activities. A number of computerized systems-
including ATM networks, web servers, and intrusion detection systems-systematically track every
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
atomic action we perform, thus generating massive streams of timestamped observation data, possibly
from multiple concurrent activities. In this paper, we address the problem of efficiently detecting
occurrences of high-level activities from such interleaved data streams. A solution to this important
problem would greatly benefit a broad range of applications, including fraud detection, video
surveillance, and cyber security. There has been extensive work in the last few years on modeling
activities using probabilistic models. In this paper, we propose a temporal probabilistic graph so that
the elapsed time between observations also plays a role in defining whether a sequence of
observations constitutes an activity. We first propose a data structure called “temporal multiactivity
graph” to store multiple activities that need to be concurrently monitored. We then define an index
called Temporal Multiactivity Graph Index Creation (tMAGIC) that, based on this data structure,
examines and links observations as they occur. We define algorithms for insertion and bulk insertion
into the tMAGIC index and show that this can be efficiently accomplished. We also define algorithms
to solve two problems: the “evidence” problem that tries to find all occurrences of an activity (with
probability over a threshold) within a given sequence of observations, and the “identification”
problem that tries to find the activity that best matches a sequence of observations. We introduce
complexity reducing restrictions and pruning strategies to make the problem-which is intrinsically
exponential-linear to the number of observations. Our experiments confirm that tMAGI- has time and
space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored
activities.
ETPL
DM-026 Finding Rare Classes: Active Learning with Generative and Discriminative
Abstract: Discovering rare categories and classifying new instances of them are important data mining
issues in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in
labeling effort. There has therefore been increasing interest both in active discovery: to identify new
classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur
together in practice and are intrinsically related because examples of each class are required to train a
classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data
mining for rare classes in new domains makes inefficient use of human supervision. Developing
active learning algorithms to optimise both rare class discovery and classification simultaneously is
challenging because discovery and classification have conflicting requirements in query criteria. In
this paper, we address these issues with two contributions: a unified active learning model to jointly
discover new categories and learn to classify them by adapting query criteria online; and a classifier
combination algorithm that switches generative and discriminative classifiers as learning progresses.
Extensive evaluation on a batch of standard UCI and vision data sets demonstrates the superiority of
this approach over existing methods.
ETPL
DM-027 Halite: Fast and Scalable Multiresolution Local-Correlation Clustering
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Abstract: This paper proposes Halite, a novel, fast, and scalable clustering method that looks for
clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or
execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate
results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in
time and space regarding the data size and dimensionality, and the dimensionality of the clusters'
subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an
input parameter, and detects clusters in subspaces generated by original axes or by their linear
combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or
better quality compared to top related works; and 4) Generality: it includes a soft clustering approach.
Experiments on synthetic data ranging from five to 30 axes and up to 1 rm million points were
performed. Halite was in average at least 12 times faster than seven representative works, and always
presented highly accurate results. On real data, Halite was at least 11 times faster than others,
increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where
soft clustering is desirable.
ETPL
DM-028 k-Pattern Set Mining under Constraints
Abstract: We introduce the problem of k-pattern set mining, concerned with finding a set of k related
patterns under constraints. This contrasts to regular pattern mining, where one searches for many
individual patterns. The k-pattern set mining problem is a very general problem that can be
instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning,
redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of
constraints for use in k-pattern set mining, both at the local level, that is, on individual patterns, and
on the global level, that is, on the overall pattern set. Building general solvers for the pattern set
mining problem remains a challenge. Here, we investigate to what extent constraint programming
(CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to
constraints currently available in CP. This allows us to investigate a large number of settings within a
unified framework and to gain insight in the possibilities and limitations of these solvers. This is
important as it allows us to create guidelines in how to model new problems successfully and how to
model existing problems more efficiently. It also opens up the way for other solver technologies.
ETPL
DM-029 Minimally Supervised Novel Relation Extraction Using a Latent Relational
Abstract The World Wide Web includes semantic relations of numerous types that exist among
different entities. Extracting the relations that exist between two entities is an important step in
various Web-related tasks such as information retrieval (IR), information extraction, and social
network extraction. A supervised relation extraction system that is trained to extract a particular
relation type (source relation) might not accurately extract a new type of a relation (target relation) for
which it has not been trained. However, it is costly to create training data manually for every new
relation type that one might want to extract. We propose a method to adapt an existing relation
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
extraction system to extract new relation types with minimum supervision. Our proposed method
comprises two stages: learning a lower dimensional projection between different relations, and
learning a relational classifier for the target relation type with instance sampling. First, to represent a
semantic relation that exists between two entities, we extract lexical and syntactic patterns from
contexts in which those two entities co-occur. Then, we construct a bipartite graph between relation-
specific (RS) and relation-independent (RI) patterns. Spectral clustering is performed on the bipartite
graph to compute a lower dimensional projection. Second, we train a classifier for the target relation
type using a small number of labeled instances. To account for the lack of target relation training
instances, we present a one-sided under sampling method. We evaluate the proposed method using a
data set that contains 2,000 instances for 20 different relation types. Our experimental results show
that the proposed method achieves a statistically significant macroaverage F-score of 62.77.
Moreover, the proposed method outperforms numerous baselines and a previously proposed weakly
supervised relation extraction method.
ETPL
DM-030 Mining User Queries with Markov Chains: Application to Online Image
Abstract: We propose a novel method for automatic annotation, indexing and annotation-based
retrieval of images. The new method, that we call Markovian Semantic Indexing (MSI), is presented
in the context of an online image retrieval system. Assuming such a system, the users' queries are
used to construct an Aggregate Markov Chain (AMC) through which the relevance between the
keywords seen by the system is defined. The users' queries are also used to automatically annotate the
images. A stochastic distance between images, based on their annotation and the keyword relevance
captured in the AMC, is then introduced. Geometric interpretations of the proposed distance are
provided and its relation to a clustering in the keyword space is investigated. By means of a new
measure of Markovian state similarity, the mean first cross passage time (CPT), optimality properties
of the proposed distance are proved. Images are modeled as points in a vector space and their
similarity is measured with MSI. The new method is shown to possess certain theoretical advantages
and also to achieve better Precision versus Recall results when compared to Latent Semantic Indexing
(LSI) and probabilistic Latent Semantic Indexing (pLSI) methods in Annotation-Based Image
Retrieval (ABIR) tasks.
ETPL
DM-031 Reinforced Similarity Integration in Image-Rich Information Networks
Abstract: Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain
billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also
furnished with tremendous amounts of product-related images. In addition, images in such social
networks are also accompanied by annotations, comments, and other information, thus forming
heterogeneous image-rich information networks. In this paper, we introduce the concept of
(heterogeneous) image-rich information network and the problem of how to perform information
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
retrieval and recommendation in such networks. We propose a fast algorithm heterogeneous
minimum order k-SimRank (HMok-SimRank) to compute link-based similarity in weighted
heterogeneous information networks. Then, we propose an algorithm Integrated Weighted Similarity
Learning (IWSL) to account for both link-based and content-based similarities by considering the
network structure and mutually reinforcing link similarity and feature weight learning. Both local and
global feature learning methods are designed. Experimental results on Flickr and Amazon data sets
show that our approach is significantly better than traditional methods in terms of both relevance and
speed. A new product search and recommendation system for e-commerce has been implemented
based on our algorithm.
ETPL
DM-032 Supporting Search-As-You-Type Using SQL in Databases
Abstract: A search-as-you-type system computes answers on-the-fly as a user types in a keyword
query character by character. We study how to support search-as-you-type on data residing in a
relational DBMS. We focus on how to support this type of search using the native database language,
SQL. A main challenge is how to leverage existing database functionalities to meet the high-
performance requirement to achieve an interactive speed. We study how to use auxiliary indexes
stored as tables to increase search performance. We present solutions for both single-keyword queries
and multikeyword queries, and develop novel techniques for fuzzy search using SQL by allowing
mismatches between query keywords and answers. We present techniques to answer first-N queries
and discuss how to support updates efficiently. Experiments on large, real data sets show that our
techniques enable DBMS systems on a commodity computer to support search-as-you-type on tables
with millions of records.
ETPL
DM-033 Simple Hybrid and Incremental Postpruning Techniques for Rule Induction
: Pruning achieves the dual goal of reducing the complexity of the final hypothesis for improved
comprehensibility, and improving its predictive accuracy by minimizing the overfitting due to noisy
data. This paper presents a new hybrid pruning technique for rule induction, as well as an incremental
postpruning technique based on a misclassification tolerance. Although both have been designed for
RULES-7, the latter is also applicable to any rule induction algorithm in general. A thorough
empirical evaluation reveals that the proposed techniques enable RULES-7 to outperform other state-
of-the-art classification techniques. The improved classifier is also more accurate and up to two orders
of magnitude faster than before.
ETPL
DM-034 λ -diverse nearest neighbors browsing for multidimensional data
Abstract: Traditional search methods try to obtain the most relevant information and rank it according
to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of
applications since results very similar to each other cannot capture all aspects of the queried topic. In
this paper, we focus on the lambda-diverse k-nearest neighbor search problem on spatial and
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
multidimensional data. Unlike the approach of diversifying query results in a postprocessing step, we
naturally obtain diverse results with the proposed geometric and index-based methods. We first make
an analogy with the concept of Natural Neighbors (NatN) and propose a natural neighbor-based
method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for
higher dimensional spaces. We then introduce a diverse browsing method based on the distance
browsing feature of spatial index structures, such as R-trees. The algorithm maintains a Priority
Queue with mindivdist of the objects depending on both relevancy and angular diversity and
efficiently prunes nondiverse items and nodes. We experiment with a number of spatial and high-
dimensional data sets, including Factual's (http://www.factual.com/) US points-of-interest data set of
13M entries. On the experimental setup, the diverse browsing method is shown to be more efficient
(regarding disk accesses) than k-NN search on R-trees, and more effective (regarding Maximal
Marginal Relevance (MMR)) than the diverse nearest neighbor search techniques found in the
literature.
ETPL
DM-035 A Bound on Kappa-Error Diagrams for Analysis of Classifier Ensembles
Abstract: Kappa-error diagrams are used to gain insights about why an ensemble method is better
than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The x-
axis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study,
kappa is calculated from the 2 × 2 correct/wrong contingency matrix. We derive a lower bound on
kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments
with real data show that there is unoccupied feasible space on the diagram corresponding to
(hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the
ensemble accuracy.
ETPL
DM-036 A New Algorithm for Inferring User Search Goals with Feedback Sessions
Abstract: For a broad-topic and ambiguous query, different users may have different search goals
when they submit it to a search engine. The inference and analysis of user search goals can be very
useful in improving search engine relevance and user experience. In this paper, we propose a novel
approach to infer user search goals by analyzing search engine query logs. First, we propose a
framework to discover different user search goals for a query by clustering the proposed feedback
sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect
the information needs of users. Second, we propose a novel approach to generate pseudo-documents
to better represent the feedback sessions for clustering. Finally, we propose a new criterion
)“Classified Average Precision (CAP)” to evaluate the performance of inferring user search goals.
Experimental results are presented using user click-through logs from a commercial search engine to
validate the effectiveness of our proposed methods.
ETPL
DM-037 Annotating Search Results from Web Databases
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Abstract: An increasing number of databases have become web accessible through HTML form-based
search interfaces. The data units returned from the underlying database are usually encoded into the
result pages dynamically for human browsing. For the encoded data units to be machine processable,
which is essential for many applications such as deep web data collection and Internet comparison
shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an
automatic annotation approach that first aligns the data units on a result page into different groups
such that the data in the same group have the same semantic. Then, for each group we annotate it
from different aspects and aggregate the different annotations to predict a final annotation label for it.
An annotation wrapper for the search site is automatically constructed and can be used to annotate
new result pages from the same web database. Our experiments indicate that the proposed approach is
highly effective.
ETPL
DM-038 Building a Scalable Database-Driven Reverse Dictionary
Abstract: In this paper, we describe the design and implementation of a reverse dictionary. Unlike a
traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes
a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the
input phrase. This work has significant application not only for the general public, particularly those
who work closely with words, but also in the general field of conceptual search. We present a set of
algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and
the runtime response time performance of our implementation. Our experimental results show that our
approach can provide significant improvements in performance scale without sacrificing the quality
of the result. Our experiments comparing the quality of our approach to that of currently available
reverse dictionaries show that of our approach can provide significantly higher quality over either of
the other currently available implementations.
ETPL
DM-039 Discovering Temporal Change Patterns in the Presence of Taxonomies
Abstract: Frequent itemset mining is a widely exploratory technique that focuses on discovering
recurrent correlations among data. The steadfast evolution of markets and business environments
prompts the need of data mining algorithms to discover significant correlation changes in order to
reactively suit product and service provision to customer needs. Change mining, in the context of
frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets
from one time period to another. The discovery of frequent generalized itemsets, i.e., itemsets that 1)
frequently occur in the source data, and 2) provide a high-level abstraction of the mined knowledge,
issues new challenges in the analysis of itemsets that become rare, and thus are no longer extracted,
from a certain point. This paper proposes a novel kind of dynamic pattern, namely the History
GENeralized Pattern (HIGEN), that represents the evolution of an itemset in consecutive time
periods, by reporting the information about its frequent generalizations characterized by minimal
redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period.
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
To address HIGEN mining, it proposes HIGEN MINER, an algorithm that focuses on avoiding
itemset mining followed by postprocessing by exploiting a support-driven itemset generalization
approach. To focus the attention on the minimally redundant frequent generalizations and thus reduce
the amount of the generated patterns, the discovery of a smart subset of HIGENs, namely the
NONREDUNDANT HIGENs, is addressed as well. Experiments performed on both real and
synthetic datasets show the efficiency and the effectiveness of the proposed approach as well as its
usefulness in a real application context.
ETPL
DM-040 Extending BCDM to Cope with Proposals and Evaluations of Updates
Abstract: The cooperative construction of data/knowledge bases has recently had a significant impulse
(see, e.g., Wikipedia [1]). In cases in which data/knowledge quality and reliability are crucial,
proposals of update/insertion/deletion need to be evaluated by experts. To the best of our knowledge,
no theoretical framework has been devised to model the semantics of update proposal/ evaluation in
the relational context. Since time is an intrinsic part of most domains (as well as of the
proposal/evaluation process itself), semantic approaches to temporal relational databases (specifically,
Bitemporal Conceptual Data Model (henceforth, BCDM) [2]) are the starting point of our approach.
In this paper, we propose BCDMPV
, a semantic temporal relational model that extends BCDM to deal
with multiple update/insertion/deletion proposals and with acceptances/rejections of proposals
themselves. We propose a theoretical framework, defining the new data structures, manipulation
operations and temporal relational algebra and proving some basic properties, namely that
BCDMPV
is a consistent extension of BCDM and that it is reducible to BCDM. These properties
ensure consistency with most relational temporal database frameworks, facilitating implementations.
ETPL
DM-041 Facilitating Effective User Navigation through Website Structure Improvement
Abstract: Designing well-structured websites to facilitate effective user navigation has long been a
challenge. A primary reason is that the web developers' understanding of how a website should be
structured can be considerably different from that of the users. While various methods have been
proposed to relink webpages to improve navigability using user navigation data, the completely
reorganized new structure can be highly unpredictable, and the cost of disorienting users after the
changes remains unanalyzed. This paper addresses how to improve a website without introducing
substantial changes. Specifically, we propose a mathematical programming model to improve the user
navigation on a website while minimizing alterations to its current structure. Results from extensive
tests conducted on a publicly available real data set indicate that our model not only significantly
improves the user navigation with very few changes, but also can be effectively solved. We have also
tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we
define two evaluation metrics and use them to assess the performance of the improved website using
the real data set. Evaluation results confirm that the user navigation on the improved structure is
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
indeed greatly enhanced. More interestingly, we find that heavily disoriented users are more likely to
benefit from the improved structure than the less disoriented users.
ETPL
DM-042 Information-Theoretic Outlier Detection for Large-Scale Categorical Data
Abstract: Outlier detection can usually be considered as a pre-processing step for locating, in a data
set, those objects that do not conform to well-defined notions of expected behavior. It is very
important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional
phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is
especially challenging because of the difficulty of defining a meaningful similarity measure for
categorical data. In this paper, we propose a formal definition of outliers and an optimization model
of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into
consideration. Based on this model, we define a function for the outlier factor of an object which is
solely determined by the object itself and can be updated efficiently. We propose two practical 1-
parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined
parameters for deciding whether an object is an outlier. Users need only provide the number of
outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective
and efficient than mainstream methods and can be used to deal with both large and high-dimensional
data sets where existing algorithms fail.
ETPL
DM-043
Modeling and Solving Distributed Configuration Problems: A CSP-Based
Approach
Abstract: Product configuration can be defined as the task of tailoring a product according to the
specific needs of a customer. Due to the inherent complexity of this task, which for example includes
the consideration of complex constraints or the automatic completion of partial configurations,
various Artificial Intelligence techniques have been explored in the last decades to tackle such
configuration problems. Most of the existing approaches adopt a single-site, centralized approach. In
modern supply chain settings, however, the components of a customizable product may themselves be
configurable, thus requiring a multisite, distributed approach. In this paper, we analyze the challenges
of modeling and solving such distributed configuration problems and propose an approach based on
Distributed Constraint Satisfaction. In particular, we advocate the use of Generative Constraint
Satisfaction for knowledge modeling and show in an experimental evaluation that the use of generic
constraints is particularly advantageous also in the distributed problem solving phase.
ETPL
DM-044 On Similarity Preserving Feature Selection
Abstract: In the literature of feature selection, different criteria have been proposed to evaluate the
goodness of features. In our investigation, we notice that a number of existing selection criteria
implicitly select features that preserve sample similarity, and can be unified under a common
framework. We further point out that any feature selection criteria covered by this framework cannot
handle redundant features, a common drawback of these criteria. Motivated by these observations, we
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
propose a new "Similarity Preserving Feature Selection” framework in an explicit and rigorous way.
We show, through theoretical analysis, that the proposed framework not only encompasses many
widely used feature selection criteria, but also naturally overcomes their common weakness in
handling feature redundancy. In developing this new framework, we begin with a conventional
combinatorial optimization formulation for similarity preserving feature selection, then extend it with
a sparse multiple-output regression formulation to improve its efficiency and effectiveness. A set of
three algorithms are devised to efficiently solve the proposed formulations, each of which has its own
advantages in terms of computational complexity and selection performance. As exhibited by our
extensive experimental study, the proposed framework achieves superior feature selection
performance and attractive properties.
ETPL
DM-045 Protecting Sensitive Labels in Social Network Data Anonymization
Abstract: Privacy is one of the major concerns when publishing or sharing social network data for
social science research and business analysis. Recently, researchers have developed privacy models
similar to k-anonymity to prevent node reidentification through structure information. However, even
when these privacy models are enforced, an attacker may still be able to infer one's private
information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words,
the label-node relationship is not well protected by pure structure anonymization methods.
Furthermore, existing approaches, which rely on edge editing or node clustering, may significantly
alter key graph properties. In this paper, we define a k-degree-l-diversity anonymity model that
considers the protection of structural information as well as sensitive labels of individuals. We further
propose a novel anonymization methodology based on adding noise nodes. We develop a new
algorithm by adding noise nodes into the original graph with the consideration of introducing the least
distortion to graph properties. Most importantly, we provide a rigorous analysis of the theoretical
bounds on the number of noise nodes added and their impacts on an important graph property. We
conduct extensive experiments to evaluate the effectiveness of the proposed technique.
ETPL
DM-046 Robust Module-Based Data Management
Abstract: The current trend for building an ontology-based data management system (DMS) is to
capitalize on efforts made to design a preexisting well-established DMS (a reference system). The
method amounts to extracting from the reference DMS a piece of schema relevant to the new
application needs-a module-, possibly personalizing it with extra constraints w.r.t. the application
under construction, and then managing a data set using the resulting schema. In this paper, we extend
the existing definitions of modules and we introduce novel properties of robustness that provide
means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and
the data of the reference DMS. We carry out our investigations in the setting of description logics
which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we
focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
profile of OWL2 (i.e., DL-liteR): the W3C recommendation for efficiently managing large data sets.
ETPL
DM-047 Sampling Online Social Networks
Abstract: As online social networking emerges, there has been increased interest to utilize the
underlying network structure as well as the available information on social peers to improve the
information needs of a user. In this paper, we focus on improving the performance of information
collection from the neighborhood of a user in a dynamic social network. We introduce sampling-
based algorithms to efficiently explore a user's social network respecting its structure and to quickly
approximate quantities of interest. We introduce and analyze variants of the basic sampling scheme
exploring correlations across our samples. Models of centralized and distributed social networks are
considered. We show that our algorithms can be utilized to rank items in the neighborhood of a user,
assuming that information for each user in the network is available. Using real and synthetic data sets,
we validate the results of our analysis and demonstrate the efficiency of our algorithms in
approximating quantities of interest. The methods we describe are general and can probably be easily
adopted in a variety of strategies aiming to efficiently collect information from a social graph.
ETPL
DM-048
Supporting Flexible, Efficient, and User-Interpretable Retrieval of Similar
Time Series
Abstract: Supporting decision making in domains in which the observed phenomenon dynamics have
to be dealt with, can greatly benefit of retrieval of past cases, provided that proper representation and
retrieval techniques are implemented. In particular, when the parameters of interest take the form of
time series, dimensionality reduction and flexible retrieval have to be addresses to this end. Classical
methodological solutions proposed to cope with these issues, typically based on mathematical
transforms, are characterized by strong limitations, such as a difficult interpretation of retrieval results
for end users, reduced flexibility and interactivity, or inefficiency. In this paper, we describe a novel
framework, in which time-series features are summarized by means of Temporal Abstractions, and
then retrieved resorting to abstraction similarity. Our approach grants for interpretability of the output
results, and understandability of the (user-guided) retrieval process. In particular, multilevel
abstraction mechanisms and proper indexing techniques are provided, for flexible query issuing, and
efficient and interactive query answering. Experimental results have shown the efficiency of our
approach in a scalability test, and its superiority with respect to the use of a classical mathematical
technique in flexibility, user friendliness, and also quality of results.
ETPL
DM-049
The Minimum Consistent Subset Cover Problem: A Minimization View of Data
Mining
Abstract: In this paper, we introduce and study the minimum consistent subset cover (MCSC)
problem. Given a finite ground set X and a constraint t, find the minimum number of consistent
subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes
the traditional set covering problem and has minimum clique partition (MCP), a dual problem of
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
graph coloring, as an instance. Many common data mining tasks in rule learning, clustering, and
pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set
(MRS) problem that minimizes model complexity of decision rules, the converse k-clustering
problem that minimizes the number of clusters, and the pattern summarization problem that
minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm
CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then
performs an example-driven specific-to-general search on a dynamically maintained bipartite
assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering
the ground set..
ETPL
DM-050 Transductive Multilabel Learning via Label Set Propagation
Abstract: The problem of multilabel classification has attracted great interest in the last decade, where
each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety
of real-world applications, e.g., automatic image annotations and gene function analysis. Current
research on multilabel classification focuses on supervised settings which assume existence of large
amounts of labeled training data. However, in many applications, the labeling of multilabeled data is
extremely expensive and time consuming, while there are often abundant unlabeled data available. In
this paper, we study the problem of transductive multilabel learning and propose a novel solution,
called Trasductive Multilabel Classification (TraM), to effectively assign a set of multiple labels to
each instance. Different from supervised multilabel learning methods, we estimate the label sets of the
unlabeled instances effectively by utilizing the information from both labeled and unlabeled data. We
first formulate the transductive multilabel learning as an optimization problem of estimating label
concept compositions. Then, we derive a closed-form solution to this optimization problem and
propose an effective algorithm to assign label sets to the unlabeled instances. Empirical studies on
several real-world multilabel learning tasks demonstrate that our TraM method can effectively boost
the performance of multilabel classification by using both labeled and unlabeled data.
ETPL
DM-051
A Method for Mining Infrequent Causal Associations and Its Application in
Finding Adverse Drug Reaction Signal Pairs
Abstract: In many real-world applications, it is important to mine causal relationships where an event
or event pattern causes certain outcomes with low probability. Discovering this kind of causal
relationships can help us prevent or correct negative outcomes caused by their antecedents. In this
paper, we propose an innovative data mining framework and apply it to mine potential causal
associations in electronic patient data sets where the drug-related events of interest occur infrequently.
Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a
computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the
basis of this new measure, a data mining algorithm was developed to mine the causal relationship
between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real
patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was
employed to rank the potential causal associations between each of the three selected drugs (i.e.,
enalapril, pravastatin, and rosuvastatin) and 3,954 recorded symptoms, each of which corresponded to
a potential ADR. The top 10 drug-symptom pairs for each drug were evaluated by the physicians on
our project team. The numbers of symptoms considered as likely real ADRs for enalapril, pravastatin,
and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of
our method in finding potential ADR signal pairs for further analysis (e.g., epidemiology study) and
investigation (e.g., case review) by drug safety professionals.
ETPL
DM-052
A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in
Supervised Learning
Abstract: Discretization is an essential preprocessing technique used in many knowledge discovery
and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones,
by associating categorical values to intervals and thus transforming quantitative data into qualitative
data. In this manner, symbolic data mining algorithms can be applied over continuous data and the
representation of information is simplified, making it more concise and specific. The literature
provides numerous proposals of discretization and some attempts to categorize them into a taxonomy
can be found. However, in previous papers, there is a lack of consensus in the definition of the
properties and no formal categorization has been established yet, which may be confusing for
practitioners. Furthermore, only a small set of discretizers have been widely considered, while many
other methods have gone unnoticed. With the intention of alleviating these problems, this paper
provides a survey of discretization methods proposed in the literature from a theoretical and empirical
perspective. From the theoretical perspective, we develop a taxonomy based on the main properties
pointed out in previous research, unifying the notation and including all the known methods up to
date. Empirically, we conduct an experimental study in supervised classification involving the most
representative and newest discretizers, different types of classifiers, and a large number of data sets.
The results of their performances measured in terms of accuracy, number of intervals, and
inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of
discretizers are highlighted as the best performing ones.
ETPL
DM-053 Clustering Uncertain Data Based on Probability Distribution Similarity
Abstract: Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts
significant challenges on both modeling similarity between uncertain objects and developing efficient
computational methods. The previous methods extend traditional partitioning clustering methods like
$(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on
geometric distances between objects. Such methods cannot handle uncertain objects that are
geometrically indistinguishable, such as products with the same mean but very different variances in
customer ratings. Surprisingly, probability distributions, which are essential characteristics of
uncertain objects, have not been considered in measuring similarity between uncertain objects. In this
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
paper, we systematically model uncertain objects in both continuous and discrete domains, where an
uncertain object is modeled as a continuous and discrete random variable, respectively. We use the
well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the
continuous and discrete cases, and integrate it into partitioning and density -based clustering methods
to cluster uncertain objects . Nevertheless, a naive implementation is very costly . Particularly,
computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the
problem, we estimate KL divergence in the continuous case by kernel density estimation and employ
the fast Gauss transform technique to further speed up the computation. Our extensive experiment
results verify the effectiveness, efficiency, and scalability of our approaches.
ETPL
DM-054 Efficient Evaluation of SUM Queries over Probabilistic Data
Abstract: SUM queries are crucial for many applications that need to deal with uncertain data. In this
paper, we are interested in the queries, called ALL_SUM, that return all possible sum values and their
probabilities. In general, there is no efficient solution for the problem of evaluating ALL_SUM
queries. But, for many practical applications, where aggregate values are small integers or real
numbers with small precision, it is possible to develop efficient solutions. In this paper, based on a
recursive approach, we propose a new solution for those applications. We implemented our solution
and conducted an extensive experimental evaluation over synthetic and real-world data sets; the
results show its effectiveness.
ETPL
DM-055 Efficient Service Skyline Computation for Composite Service Selection
Abstract: Service composition is emerging as an effective vehicle for integrating existing web services
to create value-added and personalized composite services. As web services with similar functionality
are expected to be provided by competing providers, a key challenge is to find the “best” web services
to participate in the composition. When multiple quality aspects (e.g., response time, fee, etc.) are
considered, a weighting mechanism is usually adopted by most existing approaches, which requires
users to specify their preferences as numeric values. We propose to exploit the dominance
relationship among service providers to find a set of “best” possible composite services, referred to as
a composite service skyline. We develop efficient algorithms that allow us to find the composite
service skyline from a significantly reduced searching space instead of considering all possible
service compositions. We propose a novel bottom-up computation framework that enables the skyline
algorithm to scale well with the number of services in a composition. We conduct a comprehensive
analytical and experimental study to evaluate the effectiveness, efficiency, and scalability of the
composite skyline computation approaches.
ETPL
DM-056 Finding Probabilistic Prevalent Colocations in Spatially Uncertain Data Sets
Abstract: A spatial colocation pattern is a group of spatial features whose instances are frequently
located together in geographic space. Discovering colocations has many useful applications. For
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
example, colocated plant species discovered from plant distribution data sets can contribute to the
analysis of plant geography, phytosociology studies, and plant protection recommendations. In this
paper, we study the colocation mining problem in the context of uncertain data, as the data generated
from a wide range of data sources are inherently uncertain. One straightforward method to mine the
prevalent colocations in a spatially uncertain data set is to simply compute the expected participation
index of a candidate and decide if it exceeds a minimum prevalence threshold. Although this
definition has been widely adopted, it misses important information about the confidence which can
be associated with the participation index of a colocation. We propose another definition, probabilistic
prevalent colocations, trying to find all the colocations that are likely to be prevalent in a randomly
generated possible world. Finding probabilistic prevalent colocations (PPCs) turn out to be difficult.
First, we propose pruning strategies for candidates to reduce the amount of computation of the
probabilistic participation index values. Next, we design an improved dynamic programming
algorithm for identifying candidates. This algorithm is suitable for parallel computation, and
approximate computation. Finally, the effectiveness and efficiency of the methods proposed as well as
the pruning strategies and the optimization techniques are verified by extensive experiments with
“real + synthetic” spatially uncertain data sets.
ETPL
DM-057
Fuzzy Web Data Tables Integration Guided by an Ontological and
Terminological Resource
Abstract: In this paper, we present the design of ONDINE system which allows the loading and the
querying of a data warehouse opened on the Web, guided by an Ontological and Terminological
Resource (OTR). The data warehouse, composed of data tables extracted from Web documents, has
been built to supplement existing local data sources. First, we present the main steps of our
semiautomatic method to annotate data tables driven by an OTR. The output of this method is an
XML/RDF data warehouse composed of XML documents representing data tables with their fuzzy
RDF annotations. We then present our flexible querying system which allows the local data sources
and the data warehouse to be simultaneously and uniformly queried, using the OTR. This system
relies on SPARQL and allows approximate answers to be retrieved by comparing preferences
expressed as fuzzy sets with fuzzy RDF annotations
ETPL
DM-058 PMSE: A Personalized Mobile Search Engine
Abstract: We propose a personalized mobile search engine (PMSE) that captures the users'
preferences in the form of concepts by mining their clickthrough data. Due to the importance of
location information in mobile search, PMSE classifies these concepts into content concepts and
location concepts. In addition, users' locations (positioned by GPS) are used to supplement the
location concepts in PMSE. The user preferences are organized in an ontology-based, multifacet user
profile, which are used to adapt a personalized ranking function for rank adaptation of future search
results. To characterize the diversity of the concepts associated with a query and their relevances to
the user's need, four entropies are introduced to balance the weights between the content and location
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
facets. Based on the client-server model, we also present a detailed architecture and design for
implementation of PMSE. In our design, the client collects and stores locally the clickthrough data to
protect privacy, whereas heavy tasks such as concept extraction, training, and reranking are
performed at the PMSE server. Moreover, we address the privacy issue by restricting the information
in the user profile exposed to the PMSE server with two privacy parameters. We prototype PMSE on
the Google Android platform. Experimental results show that PMSE significantly improves the
precision comparing to the baseline.
ETPL
DM-059 Range-Based Skyline Queries in Mobile Environments
Abstract: Skyline query processing for location-based services, which considers both spatial and
nonspatial attributes of the objects being queried, has recently received increasing attention. Existing
solutions focus on solving point- or line-based skyline queries, in which the query location is an exact
location point or a line segment. However, due to privacy concerns and limited precision of
localization devices, the input of a user location is often a spatial range. This paper studies a new
problem of how to process such range-based skyline queries. Two novel algorithms are proposed: one
is index-based (I-SKY) and the other is not based on any index (N-SKY). To handle frequent
movements of the objects being queried, we also propose incremental versions of I-SKY and N-SKY,
which avoid recomputing the query index and results from scratch. Additionally, we develop efficient
solutions for probabilistic and continuous range-based skyline queries. Experimental results show that
our proposed algorithms well outperform the baseline algorithm that adopts the existing line-based
skyline solution. Moreover, the incremental versions of I-SKY and N-SKY save substantial
computation cost, especially when the objects move frequently.
ETPL
DM-060 Skyline Processing on Distributed Vertical Decompositions
Abstract: We assume a data set that is vertically decomposed among several servers, and a client that
wishes to compute the skyline by obtaining the minimum number of points. Existing solutions for this
problem are restricted to the case where each server maintains exactly one dimension. This paper
proposes a general solution for vertical decompositions of arbitrary dimensionality. We first
investigate some interesting problem characteristics regarding the pruning power of points. Then, we
introduce vertical partition skyline (VPS), an algorithmic framework that includes two steps. Phase 1
searches for an anchor point Panc that dominates, and hence eliminates, a large number of records.
Starting with Panc, Phase 2 constructs incrementally a pruning area using an interesting union-
intersection property of dominance regions. Servers do not transmit points that fall within the pruning
area in their local subspace. Our experiments confirm the effectiveness of the proposed methods
under various settings.
ETPL
DM-061 Spatial Query Integrity with Voronoi Neighbors
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
With the popularity of location-based services and the abundant usage of smart phones and GPS-
enabled devices, the necessity of outsourcing spatial data has grown rapidly over the past few years.
Meanwhile, the fast arising trend of cloud storage and cloud computing services has provided a
flexible and cost-effective platform for hosting data from businesses and individuals, further enabling
many location-based applications. Nevertheless, in this database outsourcing paradigm, the
authentication of the query results at the client remains a challenging problem. In this paper, we focus
on the Outsourced Spatial Database (OSDB) model and propose an efficient scheme, called VN-Auth,
which allows a client to verify the correctness and completeness of the result set. Our approach is
based on neighborhood information derived from the Voronoi diagram of the underlying spatial data
set and can handle fundamental spatial query types, such as k nearest neighbor and range queries, as
well as more advanced query types like reverse k nearest neighbor, aggregate nearest neighbor, and
spatial skyline. We evaluated VN-Auth based on real-world data sets using mobile devices (Google
Droid smart phones with Android OS) as query clients. Compared to the current state-of-the-art
approaches (i.e., methods based on Merkle Hash Trees), our experiments show that VN-Auth
produces significantly smaller verification objects and is more computationally efficient, especially
for queries with low selectivity.
ETPL
DM-062 Supporting Pattern-Preserving Anonymization for Time-Series Data
Abstract: Time series is an important form of data available in numerous applications and often
contains vast amount of personal privacy. The need to protect privacy in time-series data while
effectively supporting complex queries on them poses nontrivial challenges to the database
community. We study the anonymization of time series while trying to support complex queries, such
as range and pattern matching queries, on the published data. The conventional k-anonymity model
cannot effectively address this problem as it may suffer severe pattern loss. We propose a novel
anonymization model called (k, P)-anonymity for pattern-rich time series. This model publishes both
the attribute values and the patterns of time series in separate data forms. We demonstrate that our
model can prevent linkage attacks on the published data while effectively support a wide variety of
queries on the anonymized data. We propose two algorithms to enforce (k, P)-anonymity on time-
series data. Our anonymity model supports customized data publishing, which allows a certain part of
the values but a different part of the pattern of the anonymized time series to be published
simultaneously. We present estimation techniques to support query processing on such customized
data. The proposed methods are evaluated in a comprehensive experimental study. Our results verify
the effectiveness and efficiency of our approach.
ETPL
DM-063 Synchronization-Inspired Partitioning and Hierarchical Clustering
Synchronization is a powerful and inherently hierarchical concept regulating a large variety of
complex processes ranging from the metabolism in a cell to opinion formation in a group of
individuals. Synchronization phenomena in nature have been widely investigated and models
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
concisely describing the dynamical synchronization process have been proposed, e.g., the well-known
Extensive Kuramoto Model. We explore the potential of the Extensive Kuramoto Model for data
clustering. We regard each data object as a phase oscillator and simulate the dynamical behavior of
the objects over time. By interaction with similar objects, the phase of an object gradually aligns with
its neighborhood, resulting in a nonlinear object movement naturally driven by the local cluster
structure. We demonstrate that our framework has several attractive benefits: 1) It is suitable to detect
clusters of arbitrary number, shape, and data distribution, even in difficult settings with noise points
and outliers. 2) Combined with the Minimum Description Length (MDL) principle, it allows
partitioning and hierarchical clustering without requiring any input parameters which are difficult to
estimate. 3) Synchronization faithfully captures the natural hierarchical cluster structure of the data
and MDL suggests meaningful levels of abstraction. Extensive experiments demonstrate the
effectiveness and efficiency of our approach.
ETPL
DM-064 Transfer across Completely Different Feature Spaces via Spectral Embedding
Abstract: In many applications, it is very expensive or time consuming to obtain a lot of labeled
examples. One practically important problem is: can the labeled data from other related sources help
predict the target task, even if they have 1) different feature spaces (e.g., image versus text data), 2)
different data distributions, and 3) different output spaces? This paper proposes a solution and
discusses the conditions where this is highly likely to produce better results. It first unifies the feature
spaces of the target and source data sets by spectral embedding, even when they are with completely
different feature spaces. The principle is to devise an optimization objective that preserves the original
structure of the data, while at the same time, maximizes the similarity between the two. A linear
projection model, as well as a nonlinear approach are derived on the basis of this principle with closed
forms. Second, a judicious sample selection strategy is applied to select only those related source
examples. At last, a Bayesian-based approach is applied to model the relationship between different
output spaces. The three steps can bridge related heterogeneous sources in order to learn the target
task. Among the 20 experiment data sets, for example, the images with wavelet-transformed-based
features are used to predict another set of images whose features are constructed from color-histogram
space; documents are used to help image classification, etc. By using these extracted examples from
heterogeneous sources, the models can reduce the error rate by as much as 50 percent, compared with
the methods using only the examples from the target task.
ETPL
DM-065
Tweet Analysis for Real-Time Event Detection and Earthquake Reporting
System Development
Abstract: Twitter has received much attention recently. An important characteristic of Twitter is its
real-time nature. We investigate the real-time interaction of events such as earthquakes in Twitter and
propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we
devise a classifier of tweets based on features such as the keywords in a tweet, the number of words,
and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
that can find the center of the event location. We regard each Twitter user as a sensor and apply
particle filtering, which are widely used for location estimation. The particle filter works better than
other comparable methods for estimating the locations of target events. As an application, we develop
an earthquake reporting system for use in Japan. Because of the numerous earthquakes and the large
number of Twitter users throughout the country, we can detect an earthquake with high probability
(93 percent of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more
are detected) merely by monitoring tweets. Our system detects earthquakes promptly and notification
is delivered much faster than JMA broadcast announcements..
ETPL
DM-066
TW-$(k)$-Means: Automated Two-Level Variable Weighting Clustering
Algorithm for Multiview Data
Abstract: This paper proposes TW-k-means, an automated two-level variable weighting clustering
algorithm for multiview data, which can simultaneously compute weights for views and individual
variables. In this algorithm, a view weight is assigned to each view to identify the compactness of the
view and a variable weight is also assigned to each variable in the view to identify the importance of
the variable. Both view weights and variable weights are used in the distance function to determine
the clusters of objects. In the new algorithm, two additional steps are added to the iterative k-means
clustering process to automatically compute the view weights and the variable weights. We used two
real-life data sets to investigate the properties of two types of weights in TW-k-means and
investigated the difference between the weights of TW-k-means and the weights of the individual
variable weighting method. The experiments have revealed the convergence property of the view
weights in TW-k-means. We compared TW-k-means with five clustering algorithms on three real-life
data sets and the results have shown that the TW-k-means algorithm significantly outperformed the
other five clustering algorithms in four evaluation indices.
ETPL
DM-067 U-Skyline: A New Skyline Query for Uncertain Databases
Abstract: The skyline query, aiming at identifying a set of skyline tuples that are not dominated by
any other tuple, is particularly useful for multicriteria data analysis and decision making. For
uncertain databases, a probabilistic skyline query, called P-Skyline, has been developed to return
skyline tuples by specifying a probability threshold. However, the answer obtained via a P-Skyline
query usually includes skyline tuples undesirably dominating each other when a small threshold is
specified; or it may contain much fewer skyline tuples if a larger threshold is employed. To address
this concern, we propose a new uncertain skyline query, called U-Skyline query, in this paper. Instead
of setting a probabilistic threshold to qualify each skyline tuple independently, the U-Skyline query
searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as
the skyline answer. In order to answer U-Skyline queries efficiently, we propose a number of
optimization techniques for query processing, including 1) computational simplification of U-Skyline
probability, 2) pruning of unqualified candidate skylines and early termination of query processing, 3)
reduction of the input data set, and 4) partition and conquest of the reduced data set. We perform a
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
comprehensive performance evaluation on our algorithm and an alternative approach that formulates
the U-Skyline processing problem by integer programming. Experimental results demonstrate that our
algorithm is 10-100 times faster than using CPLEX, a parallel integer programming solver, to answer
the U-Skyline query.
ETPL
DM-068
A Novel Profit Maximizing Metric for Measuring Classification Performance of
Customer Churn Prediction Models
Abstract: The interest for data mining techniques has increased tremendously during the past decades,
and numerous classification techniques have been applied in a wide range of business applications.
Hence, the need for adequate performance measures has become more important than ever. In this
paper, a cost-benefit analysis framework is formalized in order to define performance measures which
are aligned with the main objectives of the end users, i.e., profit maximization. A new performance
measure is defined, the expected maximum profit criterion. This general framework is then applied to
the customer churn problem with its particular cost-benefit structure. The advantage of this approach
is that it assists companies with selecting the classifier which maximizes the profit. Moreover, it aids
with the practical implementation in the sense that it provides guidance about the fraction of the
customer base to be included in the retention campaign.
ETPL
DM-069
A Predictive-Reactive Method for Improving the Robustness of Real-Time Data
Services
Abstract: Supporting timely data services using fresh data in data-intensive real-time applications,
such as e-commerce and transportation management is desirable but challenging, since the workload
may vary dynamically. To control the data service delay to be below the specified threshold, we
develop a predictive as well as reactive method for database admission control. The predictive method
derives the workload bound for admission control in a predictive manner, making no statistical or
queuing-theoretic assumptions about workloads. Also, our reactive scheme based on formal feedback
control theory continuously adjusts the database load bound to support the delay threshold. By
adapting the load bound in a proactive fashion, we attempt to avoid severe overload conditions and
excessive delays before they occur. Also, the feedback control scheme enhances the timeliness by
compensating for potential prediction errors due to dynamic workloads. Hence, the predictive and
reactive methods complement each other, enhancing the robustness of real-time data services as a
whole. We implement the integrated approach and several baselines in an open-source database.
Compared to the tested open-loop, feedback-only, and statistical prediction + feedback baselines
representing the state of the art, our integrated method significantly improves the average/transient
delay and real-time data service throughput.
ETPL
DM-070 Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates
Abstract: We may want to keep sensitive information in a relational database hidden from a user or
group thereof. We characterize sensitive data as the extensions of secrecy views. The database, before
returning the answers to a query posed by a restricted user, is updated to make the secrecy views
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
empty or a single tuple with null values. Then, a query about any of those views returns no
meaningful information. Since the database is not supposed to be physically changed for this purpose,
the updates are only virtual, and also minimal. Minimality makes sure that query answers, while being
privacy preserving, are also maximally informative. The virtual updates are based on null values as
used in the SQL standard. We provide the semantics of secrecy views, virtual updates, and secret
answers (SAs) to queries. The different instances resulting from the virtually updates are specified as
the models of a logic program with stable model semantics, which becomes the basis for computation
of the SAs.
ETPL
DM -071 Co-Occurrence-Based Diffusion for Expert Search on the Web
Abstract: Expert search has been studied in different contexts, e.g., enterprises, academic communities.
We examine a general expert search problem: searching experts on the web, where millions of
webpages and thousands of names are considered. It has mainly two challenging issues: 1) webpages
could be of varying quality and full of noises; 2) The expertise evidences scattered in webpages are
usually vague and ambiguous. We propose to leverage the large amount of co-occurrence information
to assess relevance and reputation of a person name for a query topic. The co-occurrence structure is
modeled using a hypergraph, on which a heat diffusion based ranking algorithm is proposed. Query
keywords are regarded as heat sources, and a person name which has strong connection with the
query (i.e., frequently co-occur with query keywords and co-occur with other names related to query
keywords) will receive most of the heat, thus being ranked high. Experiments on the ClueWeb09 web
collection show that our algorithm is effective for retrieving experts and outperforms baseline
algorithms significantly. This work would be regarded as one step toward addressing the more general
entity search problem without sophisticated NLP techniques.
ETPL
DM -072
Efficient All Top-$(k)$ Computation—A Unified Solution for All Top-
$(k)$, Reverse Top-$(k)$ and Top-$(m)$ Influential Queries
Abstract: Given a set of objects $(P)$ and a set of ranking functions $(F)$ over $(P)$, an interesting
problem is to compute the top ranked objects for all functions. Evaluation of multiple top-$(k)$
queries finds application in systems, where there is a heavy workload of ranking queries (e.g., online
search engines and product recommendation systems). The simple solution of evaluating the top-
$(k)$ queries one-by-one does not scale well; instead, the system can make use of the fact that similar
queries share common results to accelerate search. This paper is the first, to our knowledge, thorough
study of this problem. We propose methods that compute all top-$(k)$ queries in batch. Our first
solution applies the block indexed nested loops paradigm, while our second technique is a view-based
algorithm. We propose appropriate optimization techniques for the two approaches and demonstrate
experimentally that the second approach is consistently the best. Our approach facilitates evaluation
of other complex queries that depend on the computation of multiple top-$(k)$ queries, such as
reverse top-$(k)$ and top-$(m)$ influential queries. We show that our batch processing technique for
these complex queries outperform the state-of-the-art by orders of magnitude.
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
DM - 073 Efficient and Effective Duplicate Detection in Hierarchical Data
Abstract: Although there is a long line of work on identifying duplicates in relational data, only a few
solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this
paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a
Bayesian network to determine the probability of two XML elements being duplicates, considering
not only the information within the elements, but also the way that information is structured. In
addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of
significant gains over the unoptimized version of the algorithm, is presented. Through experiments,
we show that our algorithm is able to achieve high precision and recall scores in several data sets.
XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms
of efficiency and of effectiveness.
ETPL
DM -074 Failure-Aware Cascaded Suppression in Wireless Sensor Networks
Abstract: Wireless sensor networks are widely used to continuously collect data from the
environment. Because of energy constraints on battery-powered nodes, it is critical to minimize
communication. Suppression has been proposed as a way to reduce communication by using
predictive models to suppress reporting of predictable data. However, in the presence of
communication failures, missing data are difficult to interpret because these could have been either
suppressed or lost in transmission. There is no existing solution for handling failures for general,
spatiotemporal suppression that uses cascading. While cascading further reduces communication, it
makes failure handling difficult, because nodes can act on incomplete or incorrect information and in
turn affect other nodes. We propose a cascaded suppression framework that exploits both temporal
and spatial data correlation to reduce communication, and applies coding theory and Bayesian
inference to recover missing data resulted from suppression and communication failures. Experiment
results show that cascaded suppression significantly reduces communication cost and improves
missing data recovery compared to existing approaches..
ETPL
DM -075 Multiview Partitioning via Tensor Methods
Abstract: Clustering by integrating multiview representations has become a crucial issue for
knowledge discovery in heterogeneous environments. However, most prior approaches assume that
the multiple representations share the same dimension, limiting their applicability to homogeneous
environments. In this paper, we present a novel tensor-based framework for integrating heterogeneous
multiview data in the context of spectral clustering. Our framework includes two novel formulations;
that is multiview clustering based on the integration of the Frobenius-norm objective function (MC-
FR-OI) and that based on matrix integration in the Frobenius-norm objective function (MC-FR-MI).
We show that the solutions for both formulations can be computed by tensor decompositions. We
evaluated our methods on synthetic data and two real-world data sets in comparison with baseline
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
methods. Experimental results demonstrate that the proposed formulations are effective in integrating
multiview data in heterogeneous environments.
ETPL
DM -076 Novel Biobjective Clustering (BiGC) Based on Cooperative Game Theory
Abstract: We propose a new approach to clustering. Our idea is to map cluster formation to coalition
formation in cooperative games, and to use the Shapley value of the patterns to identify clusters and
cluster representatives. We show that the underlying game is convex and this leads to an efficient
biobjective clustering algorithm that we call BiGC. The algorithm yields high-quality clustering with
respect to average point-to-center distance (potential) as well as average intracluster point-to-point
distance (scatter). We demonstrate the superiority of BiGC over state-of-the-art clustering algorithms
(including the center based and the multiobjective techniques) through a detailed experimentation
using standard cluster validity criteria on several benchmark data sets. We also show that BiGC
satisfies key clustering properties such as order independence, scale invariance, and richness.
ETPL
DM -077
On Generalizable Low False-Positive Learning Using Asymmetric Support
Vector Machines
Abstract: The Support Vector Machines (SVMs) have been widely used for classification due to its
ability to give low generalization error. In many practical applications of classification, however, the
wrong prediction of a certain class is much severer than that of the other classes, making the original
SVM unsatisfactory. In this paper, we propose the notion of Asymmetric Support Vector Machine
(ASVM), an asymmetric extension of the SVM, for these applications. Different from the existing
SVM extensions such as thresholding and parameter tuning, ASVM employs a new objective that
models the imbalance between the costs of false predictions from different classes in a novel way
such that user tolerance on false-positive rate can be explicitly specified. Such a new objective
formulation allows us of obtaining a lower false-positive rate without much degradation of the
prediction accuracy or increase in training time. Furthermore, we show that the generalization ability
is preserved with the new objective. We also study the effects of the parameters in ASVM objective
and address some implementation issues related to the Sequential Minimal Optimization (SMO) to
cope with large-scale data. An extensive simulation is conducted and shows that ASVM is able to
yield either noticeable improvement in performance or reduction in training time as compared to the
previous arts.
ETPL
DM -078 Optimal Route Queries with Arbitrary Order Constraints
Abstract: Given a set of spatial points $(DS)$, each of which is associated with categorical
information, e.g., restaurant, pub, etc., the optimal route query finds the shortest path that starts from
the query point (e.g., a home or hotel), and covers a user-specified set of categories (e.g., {pub,
restaurant, museum}). The user may also specify partial order constraints between different
categories, e.g., a restaurant must be visited before a pub. Previous work has focused on a special case
where the query contains the total order of all categories to be visited (e.g., museum $(rightarrow)$
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
restaurant $(rightarrow)$ pub). For the general scenario without such a total order, the only known
solution reduces the problem to multiple, total-order optimal route queries. As we show in this paper,
this naïve approach incurs a significant amount of repeated computations, and, thus, is not
scalable to large data sets. Motivated by this, we propose novel solutions to the general optimal route
query, based on two different methodologies, namely backward search and forward search. In
addition, we discuss how the proposed methods can be adapted to answer a variant of the optimal
route queries, in which the route only needs to cover a subset of the given categories. Extensive
experiments, using both real and synthetic data sets, confirm that the proposed solutions are efficient
and practical, and outperform existing methods by large margins.
ETPL
DM -079 Pay-As-You-Go Entity Resolution
Abstract: Entity resolution (ER) is the problem of identifying which records in a database refer to the
same entity. In practice, many applications need to resolve large data sets efficiently, but do not
require the ER result to be exact. For example, people data from the web may simply be too large to
completely resolve with a reasonable amount of work. As another example, real-time applications
may not be able to tolerate any ER processing that takes longer than a certain amount of time. This
paper investigates how we can maximize the progress of ER with a limited amount of work using
“hints,” which give information on records that are likely to refer to the same real-
world entity. A hint can be represented in various formats (e.g., a grouping of records based on their
likelihood of matching), and ER can use this information as a guideline for which records to compare
first. We introduce a family of techniques for constructing hints efficiently and techniques for using
the hints to maximize the number of matching records identified using a limited amount of work.
Using real data sets, we illustrate the potential gains of our pay-as-you-go approach compared to
running ER without using hints..
ETPL
DM -080
Single-Database Private Information Retrieval from Fully Homomorphic
Encryption
Abstract: Private Information Retrieval (PIR) allows a user to retrieve the $(i)$th bit of an $(n)$-bit
database without revealing to the database server the value of $(i)$. In this paper, we present a PIR
protocol with the communication complexity of $(O(gamma log n))$ bits, where $(gamma)$ is the
ciphertext size. Furthermore, we extend the PIR protocol to a private block retrieval (PBR) protocol, a
natural and more practical extension of PIR in which the user retrieves a block of bits, instead of
retrieving single bit. Our protocols are built on the state-of-the-art fully homomorphic encryption
(FHE) techniques and provide privacy for the user if the underlying FHE scheme is semantically
secure. The total communication complexity of our PBR is $(O(gamma log m+gamma n/m))$ bits,
where $(m)$ is the number of blocks. The total computation complexity of our PBR is $(O(mlog m))$
modular multiplications plus $(O(n/2))$ modular additions. In terms of total protocol execution time,
our PBR protocol is more efficient than existing PBR protocols which usually require to compute
$(O(n/2))$ modular multiplications when the size of a block in the database is large and a high-speed
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
network is available.
ETPL
DM -081
Toward SWSs Discovery: Mapping from WSDL to OWL-S Based on Ontology
Search and Standardization Engine
Abstract: Semantic Web Services (SWSs) represent the most recent and revolutionary technology
developed for machine-to-machine interaction on the web 3.0. As for the conventional web services,
the problem of discovering and selecting the most suitable web service represents a challenge for
SWSs to be widely used. In this paper, we propose a mapping algorithm that facilitates the
redefinition of the conventional web services annotations (i.e., WSDL) using semantic annotations
(i.e., OWL-S). This algorithm will be a part of a new discovery mechanism that relies on the semantic
annotations of the web services to perform its task. The “local ontology repository”
and “ontology search and standardization engine” are the backbone of this
algorithm. Both of them target to define any data type in the system using a standard ontology-based
concept. The originality of the proposed mapping algorithm is its applicability and consideration of
the standardization problem. The proposed algorithm is implemented and its components are
validated using some test collections and real examples. An experimental test of the proposed
techniques is reported, showing the impact of the proposed algorithm in decreasing the time and the
effort of the mapping process. Moreover, the experimental results promises that the proposed
algorithm will have a positive impact on the discovery process as a whole..
ETPL
DM -082
Trace Ratio Optimization-Based Semi-Supervised Nonlinear Dimensionality
Reduction for Marginal Manifold Visualization
Abstract: Visualizing similarity data of different objects by exhibiting more separate organizations
with local and multimodal characteristics preserved is important in multivariate data analysis.
Laplacian Eigenmaps (LAE) and Locally Linear Embedding (LLE) aim at preserving the embeddings
of all similarity pairs in the close vicinity of the reduced output space, but they are unable to identify
and separate interclass neighbors. This paper considers the semi-supervised manifold learning
problems. We apply the pairwise Cannot-Link and Must-Link constraints induced by the
neighborhood graph to specify the types of neighboring pairs. More flexible regulation on supervised
information is provided. Two novel multimodal nonlinear techniques, which we call trace ratio (TR)
criterion-based semi-supervised LAE ($({rm S}^2{rm LAE})$) and LLE ($({rm S}^2{rm LLE})$),
are then proposed for marginal manifold visualization. We also present the kernelized $({rm S}^2{rm
LAE})$ and $({rm S}^2{rm LLE})$. We verify the feasibility of $({rm S}^2{rm LAE})$ and $({rm
S}^2{rm LLE})$ through extensive simulations over benchmark real-world MIT CBCL, CMU PIE,
MNIST, and USPS data sets. Manifold visualizations show that $({rm S}^2{rm LAE})$ and $({rm
S}^2{rm LLE})$ are able to deliver large margins between different clusters or classes with
multimodal distributions preserved. Clustering evaluations show they can achieve comparable to or
even better results than some widely used methods.
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
DM -083 Update Summarization via Graph-Based Sentence Ranking
Abstract: Due to the fast evolution of the information on the Internet, update summarization has
received much attention in recent years. It is to summarize an evolutionary document collection at
current time supposing the users have read some related previous documents. In this paper, we
propose a graph-ranking-based method. It performs constrained reinforcements on a sentence graph,
which unifies previous and current documents, to determine the salience of the sentences. The
constraints ensure that the most salient sentences in current documents are updates to previous
documents. Since this method is NP-hard, we then propose its approximate method, which is
polynomial time solvable. Experiments on the TAC 2008 and 2009 benchmark data sets show the
effectiveness and efficiency of our method.
ETPL
DM -084 Change Detection in Streaming Multivariate Data Using Likelihood Detectors
Abstract: Change detection in streaming data relies on a fast estimation of the probability that the data
in two consecutive windows come from different distributions. Choosing the criterion is one of the
multitude of questions that need to be addressed when designing a change detection procedure. This
paper gives a log-likelihood justification for two well-known criteria for detecting change in
streaming multidimensional data: Kullback-Leibler (K-L) distance and Hotelling's T-square test for
equal means (H). We propose a semiparametric log-likelihood criterion (SPLL) for change detection.
Compared to the existing log-likelihood change detectors, SPLL trades some theoretical rigor for
computation simplicity. We examine SPLL together with K-L and H on detecting induced change on
30 real data sets. The criteria were compared using the area under the respective Receiver Operating
Characteristic (ROC) curve (AUC). SPLL was found to be on the par with H and better than K-L for
the nonnormalized data, and better than both on the normalized data.
ETPL
DM -085 Coping with Events in Temporal Relational Databases
Abstract: Event relations are used in many temporal relational database approaches to represent facts
occurring at time instants. However, to the best of our knowledge, none of such approaches fully
copes with the definition of events as provided, e.g., by the “consensus” temporal
database glossary. We propose a new approach which overcomes such a limitation, allowing one to
cope with multiple events occurring in the same temporal granule. This move involves major
extensions to current approaches, since indeterminacy about the time and number of occurrences of
events need to be faced. Specifically, we have introduced a new data model, and new definitions of
relational algebraic operators coping with the above issues, and we have studied their reducibility.
Last, but not least, we have shown that our approach can be easily extended in order to cope with a
general form of temporal indeterminacy. Such an extension further increases the applicability of our
approach.
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
DM -086 Cutting Plane Training for Linear Support Vector Machines
Abstract: Support Vector Machines (SVMs) have been shown to achieve high performance on
classification tasks across many domains, and a great deal of work has been dedicated to developing
computationally efficient training algorithms for linear SVMs. One approach [1] approximately
minimizes risk through use of cutting planes, and is improved by [2], [3]. We build upon this work,
presenting a modification to the algorithm developed by Franc and Sonnenburg [2]. We demonstrate
empirically that our changes can reduce cutting plane training time by up to 40 percent, and discuss
how changes in data sets and parameter settings affect the effectiveness of our method.
ETPL
DM-087 Successive Group Selection for Microaggregation
Abstract: In this paper, we propose an efficient clustering algorithm that has been applied to the
microaggregation problem. The goal is to partition $(N)$ given records into clusters, each of them
grouping at least $(K)$ records, so that the sum of the within-partition squared error (SSE) is
minimized. We propose a successive Group Selection algorithm that approximately solves the
microaggregation problem in $(O(N^2 log N))$ time, based on sequential Minimization of SSE.
Experimental results and comparisons to existing methods with similar computation cost on real and
synthetic data sets demonstrate the high performance and robustness of the proposed scheme.
ETPL
DM-088 Modeling and Computing Ternary Projective Relations between Regions
Abstract: We report a corrected version of the algorithms to compute ternary projective relations
between regions appeared in E. Clementini and R. Billen, "Modeling and computing ternary
projective relations between regions," IEEE Transactions on Knowledge and Data Engineering, vol.
18, pp. 799-814, 2006.
ETPL
DM-089
A Survival Modeling Approach to Biomedical Search Result Diversification
Using Wikipedia
Abstract: In this paper, we propose a survival modeling approach to promoting ranking diversity for
biomedical information retrieval. The proposed approach concerns with finding relevant documents
that can deliver more different aspects of a query. First, two probabilistic models derived from the
survival analysis theory are proposed for measuring aspect novelty. Second, a new method using
Wikipedia to detect aspects covered by retrieved documents is presented. Third, an aspect filter based
on a two-stage model is introduced. It ranks the detected aspects in decreasing order of the probability
that an aspect is generated by the query. Finally, the relevance and the novelty of retrieved documents
are combined at the aspect level for reranking. Experiments conducted on the TREC 2006 and 2007
Genomics collections demonstrate the effectiveness of the proposed approach in promoting ranking
diversity for biomedical information retrieval. Moreover, we further evaluate our approach in the Web
retrieval environment. The evaluation results on the ClueWeb09-T09B collection show that our
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
approach can achieve promising performance improvements.
ETPL
DM-090 Centroid-Based Actionable 3D Subspace Clustering
Abstract: Demand-side management, together with the integration of distributed energy generation and storage,
are considered increasingly essential elements for implementing the smart grid concept and balancing massive
energy production from renewable sources. We focus on a smart grid in which the demand-side comprises
traditional users as well as users owning some kind of distributed energy sources and/or energy storage
devices. By means of a day-ahead optimization process regulated by an independent central unit, the latter
users intend to reduce their monetary energy expense by producing or storing energy rather than just
purchasing their energy needs from the grid. In this paper, we formulate the resulting grid optimization
problem as a noncooperative game and analyze the existence of optimal strategies. Furthermore, we present a
distributed algorithm to be run on the users' smart meters, which provides the optimal production and/or
storage strategies, while preserving the privacy of the users and minimizing the required signaling with the
central unit. Finally, the proposed day-ahead optimization is tested in a realistic situation.
ETPL
DM-091 Constrained Text Coclustering with Supervised and Unsupervised Constraints
Abstract: In this paper, we propose a novel constrained coclustering method to achieve two goals.
First, we combine information-theoretic coclustering and constrained clustering to improve clustering
performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the
effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing
knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve
our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both
document and word constraints. We then use an alternating expectation maximization (EM) algorithm
to optimize the model. We also propose two novel methods to automatically construct and incorporate
document and word constraints to support unsupervised constrained clustering: 1) automatically
construct document constraints based on overlapping named entities (NE) extracted by an NE
extractor; 2) automatically construct word constraints based on their semantic distance inferred from
WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of
our approaches against a number of existing approaches.
ETPL GC-092 Crowdsourced Trace Similarity with Smartphones
Smartphones are nowadays equipped with a number of sensors, such as WiFi, GPS, accelerometers,
etc. This capability allows smartphone users to easily engage in crowdsourced computing services,
which contribute to the solution of complex problems in a distributed manner. In this work, we
leverage such a computing paradigm to solve efficiently the following problem: comparing a query
trace $(Q)$ against a crowd of traces generated and stored on distributed smartphones. Our proposed
framework, coined $({rm SmartTrace}^+)$, provides an effective solution without disclosing any part
of the crowd traces to the query processor. $({rm SmartTrace}^+)$, relies on an in-situ data storage
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
model and intelligent top-K query processing algorithms that exploit distributed trajectory similarity
measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to
$(Q)$. We evaluate our algorithms on both synthetic and real workloads. We describe our prototype
system developed on the Android OS. The solution is deployed over our own SmartLab testbed of 25
smartphones. Our study reveals that computations over $({rm SmartTrace}^+)$ result in substantial
energy conservation; in addition, results can be computed faster than competitive approaches.
ETPL
DM-093 Customized Policies for Handling Partial Information in Relational Databases
Abstract: Most real-world databases have at least some missing data. Today, users of such databases
are “on their own” in terms of how they manage this incompleteness. In this paper,
we propose the general concept of partial information policy (PIP) operator to handle incompleteness
in relational databases. PIP operators build upon preference frameworks for incomplete information,
but accommodate different types of incomplete data (e.g., a value exists but is not known; a value
does not exist; a value may or may not exist). Different users in the real world have different ways in
which they want to handle incompleteness—PIP operators allow them to specify a policy that
matches their attitude to risk and their knowledge of the application and how the data was collected.
We propose index structures for efficiently evaluating PIP operators and experimentally assess their
effectiveness on a real-world airline data set. We also study how relational algebra operators and PIP
operators interact with one another.
ETPL
DM-094 Decision Trees for Mining Data Streams Based on the McDiarmid's Bound
Abstract: In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the
Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting
attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic
measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's
inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the
McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in
Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a
decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the
property that its output is nearly identical to that of a conventional learner. The results of the paper
have a great impact on the state of the art of mining data streams and various developed so far
methods and algorithms should be reconsidered..
ETPL
DM-095 Discovering Characterizations of the Behavior of Anomalous Subpopulations
Abstract: We consider the problem of discovering attributes, or properties, accounting for the a priori
stated abnormality of a group of anomalous individuals (the outliers) with respect to an overall given
population (the inliers). To this aim, we introduce the notion of exceptional property and define the
concept of exceptionality score, which measures the significance of a property. In particular, in order
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
to single out exceptional properties, we resort to a form of minimum distance estimation for
evaluating the badness of fit of the values assumed by the outliers compared to the probability
distribution associated with the values assumed by the inliers. Suitable exceptionality scores are
introduced for both numeric and categorical attributes. These scores are, both from the analytical and
the empirical point of view, designed to be effective for small samples, as it is the case for outliers.
We present an algorithm, called $({rm EXPREX})$, for efficiently discovering exceptional
properties. The algorithm is able to reduce the needed computational effort by not exploring many
irrelevant numerical intervals and by exploiting suitable pruning rules. The experimental results
confirm that our technique is able to provide knowledge characterizing outliers in a natural manner.
ETPL
DM-096 FoCUS: Learning to Crawl Web Forums
Abstract: In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-
scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with
minimal overhead. Forum threads contain information content that is the target of forum crawlers.
Although forums have different layouts or styles and are powered by different forum software
packages, they always have similar implicit navigation paths connected by specific URL types to lead
users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling
problem to a URL-type recognition problem. And we show how to learn accurate and effective
regular expression patterns of implicit navigation paths from automatically created training sets using
aggregated results from weak page type classifiers. Robust page type classifiers can be trained from
as few as five annotated forums and applied to a large set of unseen forums. Our test results show that
FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums
powered by over 150 different forum software packages. In addition, the results of applying FoCUS
on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept
of implicit navigation path could apply to other social media sites.
ETPL
DM-097
Improving Word Similarity by Augmenting PMI with Estimates of Word
Polysemy
Abstract: Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a
clear explanation of how it works. We explore how PMI differs from distributional similarity, and we
introduce a novel metric, $({rm PMI}_{max})$, that augments PMI with information about a word's
number of senses. The coefficients of $({rm PMI}_{max})$ are determined empirically by
maximizing a utility function based on the performance of automatic thesaurus generation. We show
that it outperforms traditional PMI in the application of automatic thesaurus generation and in two
word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. $({rm
PMI}_{max})$ achieves a correlation coefficient comparable to the best knowledge-based
approaches on the Miller-Charles similarity rating data set.
ETPL
DM-098 Incentive Compatible Privacy-Preserving Data Analysis
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Abstract: In many cases, competing parties who have private data may collaboratively conduct
privacy-preserving distributed data analysis (PPDA) tasks to learn beneficial data models or analysis
results. Most often, the competing parties have different incentives. Although certain PPDA
techniques guarantee that nothing other than the final analysis result is revealed, it is impossible to
verify whether participating parties are truthful about their private input data. Unless proper
incentives are set, current PPDA techniques cannot prevent participating parties from modifying their
private inputs. This raises the question of how to design incentive compatible privacy-preserving data
analysis techniques that motivate participating parties to provide truthful inputs. In this paper, we first
develop key theorems, then base on these theorems, we analyze certain important privacy-preserving
data analysis tasks that could be conducted in a way that telling the truth is the best choice for any
participating party.
ETPL
DM-099 Nonnegative Matrix Factorization: A Comprehensive Review
Abstract: Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality
reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint
and thus obtains the parts-based representation as well as enhancing the interpretability of the issue
correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last
5 years, where the principles, basic models, properties, and algorithms of NMF along with its various
modifications, extensions, and generalizations are summarized systematically. The existing NMF
algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF),
Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles,
characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed
comprehensively. Some related work not on NMF that NMF should learn from or has connections
with is involved too. Moreover, some open issues remained to be solved are discussed. Several
relevant application areas of NMF are also briefly described. This survey aims to construct an
integrated, state-of-the-art framework for NMF concept, from which the follow-up research may
benefit.
ETPL
DM-100 On Identifying Critical Nuggets of Information during Classification Tasks
Abstract: In large databases, there may exist critical nuggets;small collections of records or instances
that contain domain-specific important information. This information can be used for future decision
making such as labeling of critical, unlabeled data records and improving classification results by
reducing false positive and false negative errors. This work introduces the idea of critical nuggets,
proposes an innovative domain-independent method to measure criticality, suggests a heuristic to
reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from
some real-world data sets. It seems that only a few subsets may qualify to be critical nuggets,
underlying the importance of finding them. The proposed methodology can detect them. This work
also identifies certain properties of critical nuggets and provides experimental validation of the
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
properties. Experimental results also helped validate that critical nuggets can assist in improving
classification accuracies in real-world data sets.
ETPL
DM-101
Radio Database Compression for Accurate Energy-Efficient Localization in
Fingerprinting Systems
Abstract: Location fingerprinting is a positioning method that exploits the already existing
infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy
efficient networks and the emergence of issues like green networking, we propose a clustering
technique to compress the radio database in the context of cellular fingerprinting systems. The aim of
the proposed technique is to reduce the computation cost and transmission load in the mobile-based
implementations. The presented method may be called Block-based Weighted Clustering (BWC)
technique, which is applied in a concatenated location-radio signal space, and attributes different
weight factors to the location and radio components. Computer simulations and real experiments have
been conducted to evaluate the performance of our proposed technique in the context of a GSM
network. The obtained results confirm the efficiency of the BWC technique, and show that it
improves the performance of standard k-means and hierarchical clustering methods.
ETPL
DM-102
Semi-Supervised Nonlinear Hashing Using Bootstrap Sequential Projection
Learning
Abstract: In this paper, we study the effective semi-supervised hashing method under the framework of
regularized learning-based hashing. A nonlinear hash function is introduced to capture the underlying
relationship among data points. Thus, the dimensionality of the matrix for computation is not only
independent from the dimensionality of the original data space but also much smaller than the one
using linear hash function. To effectively deal with the error accumulated during converting the real-
value embeddings into the binary code after relaxation, we propose a semi-supervised nonlinear
hashing algorithm using bootstrap sequential projection learning which effectively corrects the errors
by taking into account of all the previous learned bits holistically without incurring the extra
computational overhead. Experimental results on the six benchmark data sets demonstrate that the
presented method outperforms the state-of-the-art hashing algorithms at a large margin.
ETPL
DM-103 Spatial Approximate String Search
Abstract: This work deals with the approximate string search in large spatial databases. Specifically,
we investigate range queries augmented with a string similarity search predicate in both euclidean
space and road networks. We dub this query the spatial approximate string (Sas) query. In euclidean
space, we propose an approximate solution, the MhR-tree, which embeds min-wise signatures into an
R-tree. The min-wise signature for an index node $(u)$ keeps a concise representation of the union of
$(q)$-grams from strings under the subtree of $(u)$. We analyze the pruning functionality of such
signatures based on the set resemblance between the query string and the $(q)$-grams from the
subtrees of index nodes. We also discuss how to estimate the selectivity of a Sas query in euclidean
space, for which we present a novel adaptive algorithm to find balanced partitions using both the
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
spatial and string information stored in the tree. For queries on road networks, we propose a novel
exact method, RsasSol, which significantly outperforms the baseline algorithm in practice. The
RsasSol combines the $(q)$-gram-based inverted lists and the reference nodes based pruning.
Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our
approaches.
ETPL
DM-104 SVStream: A Support Vector-Based Algorithm for Clustering Data Streams
Abstract: In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which
is based on support vector domain description and support vector clustering. In the proposed
algorithm, the data elements of a stream are mapped into a kernel space, and the support vectors are
used as the summary information of the historical elements to construct cluster boundaries of arbitrary
shape. To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained,
each describing the corresponding data domain presented in the data stream. By allowing for bounded
support vectors (BSVs), the proposed SVStream algorithm is capable of identifying overlapping
clusters. A BSV decaying mechanism is designed to automatically detect and remove outliers (noise).
We perform experiments over synthetic and real data streams, with the overlapping, evolving, and
noise situations taken into consideration. Comparison results with state-of-the-art data stream
clustering methods demonstrate the effectiveness and efficiency of the proposed method.
ETPL
DM-105 The Move-Split-Merge Metric for Time Series
Abstract: A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric
uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied
in sequence to transform any time series into any other time series. A Move operation changes the
value of a single element, a Split operation converts a single element into two consecutive elements,
and a Merge operation merges two consecutive elements into one. Each operation has an associated
cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence
of operations that transforms the first time series into the second one. An efficient, quadratic-time
algorithm is provided for computing the MSM distance. MSM has the desirable properties of being
metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of
origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments
with public time series data sets demonstrate that MSM is a meaningful distance measure, that
oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP.
ETPL
DM-106 A User-Friendly Patent Search Paradigm
Abstract: As an important operation for finding existing relevant patents and validating a new patent
application, patent search has attracted considerable attention recently. However, many users have
limited knowledge about the underlying patents, and they have to use a try-and-see approach to
repeatedly issue different queries and check answers, which is a very tedious process. To address this
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
problem, in this paper, we propose a new user-friendly patent search paradigm, which can help users
find relevant patents more easily and improve user search experience. We propose three effective
techniques, error correction, topic-based query suggestion, and query expansion, to improve the
usability of patent search. We also study how to efficiently find relevant answers from a large
collection of patents. We first partition patents into small partitions based to their topics and classes.
Then, given a query, we find highly relevant partitions and answer the query in each of such highly
relevant partitions. Finally, we combine the answers of each partition and generate top-$(k)$ answers
of the patent-search query.
ETPL
DM-107
A Methodology for Direct and Indirect Discrimination Prevention in Data
Mining
Abstract: Data mining is an increasingly important technology for extracting useful knowledge hidden
in large collections of data. There are, however, negative social perceptions about data mining, among
which potential privacy invasion and potential discrimination. The latter consists of unfairly treating
people on the basis of their belonging to a specific group. Automated data collection and data mining
techniques such as classification rule mining have paved the way to making automated decisions, like
loan granting/denial, insurance premium computation, etc. If the training data sets are biased in what
regards discriminatory (sensitive) attributes like gender, race, religion, etc., discriminatory decisions
may ensue. For this reason, anti-discrimination techniques including discrimination discovery and
prevention have been introduced in data mining. Discrimination can be either direct or indirect. Direct
discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination
occurs when decisions are made based on nonsensitive attributes which are strongly correlated with
biased sensitive ones. In this paper, we tackle discrimination prevention in data mining and propose
new techniques applicable for direct or indirect discrimination prevention individually or both at the
same time. We discuss how to clean training data sets and outsourced data sets in such a way that
direct and/or indirect discriminatory decision rules are converted to legitimate (nondiscriminatory)
classification rules. We also propose new metrics to evaluate the utility of the proposed approaches
and we compare these approaches. The experimental evaluations demonstrate that the proposed
techniques are effective at removing direct and/or indirect discrimination biases in the original data
set while preserving data quality.
ETPL
DM-108 Anomaly Detection via Online Oversampling Principal Component Analysis
Abstract: Anomaly detection has been an important research topic in data mining and machine
learning. Many real-world applications such as intrusion or credit card fraud detection require an
effective and efficient framework to identify deviated data instances. However, most anomaly
detection methods are typically implemented in batch mode, and thus cannot be easily extended to
large-scale problems without sacrificing computation and memory requirements. In this paper, we
propose an online oversampling principal component analysis (osPCA) algorithm to address this
problem, and we aim at detecting the presence of outliers from a large amount of data via an online
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not
store the entire data matrix or covariance matrix, and thus our approach is especially of interest in
online or large-scale problems. By oversampling the target instance and extracting the principal
direction of the data, the proposed osPCA allows us to determine the anomaly of the target instance
according to the variation of the resulting dominant eigenvector. Since our osPCA need not perform
eigen analysis explicitly, the proposed framework is favored for online applications which have
computation or memory limitations. Compared with the well-known power method for PCA and
other popular anomaly detection algorithms, our experimental results verify the feasibility of our
proposed method in terms of both accuracy and efficiency.
ETPL
DM-109
CDAMA: Concealed Data Aggregation Scheme for Multiple Applications in
Wireless Sensor Networks
Abstract: For wireless sensor networks, data aggregation scheme that reduces a large amount of
transmission is the most practical technique. In previous studies, homomorphic encryptions have been
applied to conceal communication during aggregation such that enciphered data can be aggregated
algebraically without decryption. Since aggregators collect data without decryption, adversaries are
not able to forge aggregated results by compromising them. However, these schemes are not satisfy
multi-application environments. Second, these schemes become insecure in case some sensor nodes
are compromised. Third, these schemes do not provide secure counting; thus, they may suffer
unauthorized aggregation attacks. Therefore, we propose a new concealed data aggregation scheme
extended from Boneh et al.'s homomorphic public encryption system. The proposed scheme has three
contributions. First, it is designed for a multi-application environment. The base station extracts
application-specific data from aggregated ciphertexts. Next, it mitigates the impact of compromising
attacks in single application environments. Finally, it degrades the damage from unauthorized
aggregations. To prove the proposed scheme's robustness and efficiency, we also conducted the
comprehensive analyses and comparisons in the end.
ETPL
DM-110
Classification and Adaptive Novel Class Detection of Feature-Evolving Data
Streams
Abstract: Data stream classification poses many challenges to the data mining community. In this
paper, we address four such major challenges, namely, infinite length, concept-drift, concept-
evolution, and feature-evolution. Since a data stream is theoretically infinite in length, it is impractical
to store and use all the historical data for training. Concept-drift is a common phenomenon in data
streams, which occurs as a result of changes in the underlying concepts. Concept-evolution occurs as
a result of new classes evolving in the stream. Feature-evolution is a frequently occurring process in
many streams, such as text streams, in which new features (i.e., words or phrases) appear as the
stream progresses. Most existing data stream classification techniques address only the first two
challenges, and ignore the latter two. In this paper, we propose an ensemble classification framework,
where each classifier is equipped with a novel class detector, to address concept-drift and concept-
evolution. To address feature-evolution, we propose a feature set homogenization technique. We also
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
enhance the novel class detection module by making it more adaptive to the evolving stream, and
enabling it to detect more than one novel class at a time. Comparison with state-of-the-art data stream
classification techniques establishes the effectiveness of the proposed approach.
ETPL
DM-111 Comparable Entity Mining from Comparative Questions
Abstract: Comparing one thing with another is a typical part of human decision making process.
However, it is not always easy to know what to compare and what are the alternatives. In this paper,
we present a novel way to automatically mine comparable entities from comparative questions that
users posted online to address this difficulty. To ensure high precision and high recall, we develop a
weakly supervised bootstrapping approach for comparative question identification and comparable
entity extraction by leveraging a large collection of online question archive. The experimental results
show our method achieves F1-measure of 82.5 percent in comparative question identification and
83.3 percent in comparable entity extraction. Both significantly outperform an existing state-of-the-art
method. Additionally, our ranking results show highly relevance to user's comparison intents in web.
ETPL
DM-112 Cross-Space Affinity Learning with Its Application to Movie Recommendation
Abstract: In this paper, we propose a novel cross-space affinity learning algorithm over different
spaces with heterogeneous structures. Unlike most of affinity learning algorithms on the
homogeneous space, we construct a cross-space tensor model to learn the affinity measures on
heterogeneous spaces subject to a set of order constraints from the training pool. We further enhance
the model with a factorization form which greatly reduces the number of parameters of the model
with a controlled complexity. Moreover, from the practical perspective, we show the proposed
factorized cross-space tensor model can be efficiently optimized by a series of simple quadratic
optimization problems in an iterative manner. The proposed cross-space affinity learning algorithm
can be applied to many real-world problems, which involve multiple heterogeneous data objects
defined over different spaces. In this paper, we apply it into the recommendation system to measure
the affinity between users and the product items, where a higher affinity means a higher rating of the
user on the product. For an empirical evaluation, a widely used benchmark movie recommendation
data set—MovieLens—is used to compare the proposed algorithm with other state-of-
the-art recommendation algorithms and we show that very competitive results can be obtained.
ETPL
DM-113 Distributed Strategies for Mining Outliers in Large Data Sets
Abstract: We introduce a distributed method for detecting distance-based outliers in very large data
sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset
of the data set that can be also employed for predicting novel outliers. The method exploits parallel
computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the
result, the proposed schema exhibits excellent performances. From the theoretical point of view, for
common settings, the temporal cost of our algorithm is expected to be at least three orders of
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results
show that the algorithm is efficient and that its running time scales quite well for an increasing
number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to
be transferred in order to improve both the communication cost and the overall runtime. Importantly,
the solving set computed by our approach in a distributed environment has the same quality as that
produced by the corresponding centralized method.
ETPL
DM-114 Enhancing Access Privacy of Range Retrievals over $({rm B}^+)$-Trees
Abstract: Users of databases that are hosted on shared servers cannot take for granted that their queries
will not be disclosed to unauthorized parties. Even if the database is encrypted, an adversary who is
monitoring the I/O activity on the server may still be able to infer some information about a user
query. For the particular case of a $({rm B}^+)$-tree that has its nodes encrypted, we identify
properties that enable the ordering among the leaf nodes to be deduced. These properties allow us to
construct adversarial algorithms to recover the $({rm B}^+)$-tree structure from the I/O traces
generated by range queries. Combining this structure with knowledge of the key distribution (or the
plaintext database itself), the adversary can infer the selection range of user queries. To counter the
threat, we propose a privacy-enhancing $({rm PB}^+)$-tree index which ensures that there is high
uncertainty about what data the user has worked on, even to a knowledgeable adversary who has
observed numerous query executions. The core idea in $({rm PB}^+)$-tree is to conceal the order of
the leaf nodes in an encrypted $({rm B}^+)$-tree. In particular, it groups the nodes of the tree into
buckets, and employs homomorphic encryption techniques to prevent the adversary from pinpointing
the exact nodes retrieved by range queries. $({rm PB}^+)$-tree can be tuned to balance its privacy
strength with the computational and I/O overheads incurred. Moreover, it can be adapted to protect
access privacy in cases where the attacker additionally knows a priori the access frequencies of key
values. Experiments demonstrate that $({rm PB}^+)$-tree effectively impairs the adversary's ability
to recover the $({rm B}^+)$-tree structure and deduce the query ranges in all considered scenarios.
ETPL
DM-115 Inferring Statistically Significant Hidden Markov Models
Abstract: Hidden Markov models (HMMs) are used to analyze real-world problems. We consider an
approach that constructs minimum entropy HMMs directly from a sequence of observations. If an
insufficient amount of observation data is used to generate the HMM, the model will not represent the
underlying process. Current methods assume that observations completely represent the underlying
process. It is often the case that the training data size is not large enough to adequately capture all
statistical dependencies in the system. It is, therefore, important to know the statistical significance
level for that the constructed model representing the underlying process, not only the training set. In
this paper, we present a method to determine if the observation data and constructed model fully
express the underlying process with a given level of statistical significance. We use the statistics of
the process to calculate an upper bound on the number of samples required to guarantee that the
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
model has a given level significance. We provide theoretical and experimental results that confirm the
utility of this approach. The experiment is conducted on a real private Tor network.
ETPL
DM-116
Lineage Encoding: An Efficient Wireless XML Streaming Supporting Twig
Pattern Queries
Abstract: In this paper, we propose an energy and latency efficient XML dissemination scheme for the
mobile computing. We define a novel unit structure called G-node for streaming XML data in the
wireless environment. It exploits the benefits of the structure indexing and attribute summarization
that can integrate relevant XML elements into a group. It provides a way for selective access of their
attribute values and text content. We also propose a lightweight and effective encoding scheme, called
Lineage Encoding, to support evaluation of predicates and twig pattern queries over the stream. The
Lineage Encoding scheme represents the parent-child relationships among XML elements as a
sequence of bit-strings, called Lineage Code(V, H), and provides basic operators and functions for
effective twig pattern query processing at mobile clients. Extensive experiments using real and
synthetic data sets demonstrate our scheme outperforms conventional wireless XML broadcasting
methods for simple path queries as well as complex twig pattern queries with predicate conditions.
ETPL
DM-117 MKBoost: A Framework of Multiple Kernel Boosting
Abstract: Multiple kernel learning (MKL) is a promising family of machine learning algorithms using
multiple kernel functions for various challenging data mining tasks. Conventional MKL methods
often formulate the problem as an optimization task of learning the optimal combinations of both
kernels and classifiers, which usually results in some forms of challenging optimization tasks that are
often difficult to be solved. Different from the existing MKL methods, in this paper, we investigate a
boosting framework of MKL for classification tasks, i.e., we adopt boosting to solve a variant of
MKL problem, which avoids solving the complicated optimization tasks. Specifically, we present a
novel framework of Multiple kernel boosting (MKBoost), which applies the idea of boosting
techniques to learn kernel-based classifiers with multiple kernels for classification problems. Based
on the proposed framework, we propose several variants of MKBoost algorithms and extensively
examine their empirical performance on a number of benchmark data sets in comparisons to various
state-of-the-art MKL algorithms on classification tasks. Experimental results show that the proposed
method is more effective and efficient than the existing MKL techniques.
ETPL
DM-118 Mining Order-Preserving Submatrices from Data with Repeated Measurements
Abstract: Order-preserving submatrices (OPSM's) have been shown useful in capturing concurrent
patterns in data when the relative magnitudes of data items are more important than their exact values.
For instance, in analyzing gene expression profiles obtained from microarray experiments, the relative
magnitudes are important both because they represent the change of gene activities across the
experiments, and because there is typically a high level of noise in data that makes the exact values
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
untrustable. To cope with data noise, repeated experiments are often conducted to collect multiple
measurements. We propose and study a more robust version of OPSM, where each data item is
represented by a set of values obtained from replicated experiments. We call the new problem OPSM-
RM (OPSM with repeated measurements). We define OPSM-RM based on a number of practical
requirements. We discuss the computational challenges of OPSM-RM and propose a generic mining
algorithm. We further propose a series of techniques to speed up two time dominating components of
the algorithm. We show the effectiveness and efficiency of our methods through a series of
experiments conducted on real microarray data.
ETPL
DM -119 Modeling Noisy Annotated Data with Application to Social Annotation
Abstract: We propose a probabilistic topic model for analyzing and extracting content-related
annotations from noisy annotated discrete data such as webpages stored using social bookmarking
services. With these services, because users can attach annotations freely, some annotations do not
describe the semantics of the content, thus they are noisy, i.e., not content related. The extraction of
content-related annotations can be used as a prepossessing step in machine learning tasks such as text
classification and image recognition, or can improve information retrieval performance. The proposed
model is a generative model for content and annotations, in which the annotations are assumed to
originate either from topics that generated the content or from a general distribution unrelated to the
content. We demonstrate the effectiveness of the proposed method by using synthetic data and real
social annotation data for text and images.
ETPL
DM -120 Multiparty Access Control for Online Social Networks: Model and Mechanisms
Online social networks (OSNs) have experienced tremendous growth in recent years and become a de
facto portal for hundreds of millions of Internet users. These OSNs offer attractive means for digital
social interactions and information sharing, but also raise a number of security and privacy issues.
While OSNs allow users to restrict access to shared data, they currently do not provide any
mechanism to enforce privacy concerns over data associated with multiple users. To this end, we
propose an approach to enable the protection of shared data associated with multiple users in OSNs.
We formulate an access control model to capture the essence of multiparty authorization
requirements, along with a multiparty policy specification scheme and a policy enforcement
mechanism. Besides, we present a logical representation of our access control model that allows us to
leverage the features of existing logic solvers to perform various analysis tasks on our model. We also
discuss a proof-of-concept prototype of our approach as part of an application in Facebook and
provide usability study and system evaluation of our method.
ETPL
DM -121 On the Analytical Properties of High-Dimensional Randomization
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Abstract: In this paper, we will provide the first comprehensive analysis of high-dimensional
randomization. The goal is to examine the strengths and weaknesses of randomization and explore
both the potential and the pitfalls of high-dimensional randomization. Our theoretical analysis results
in a number of interesting and insightful conclusions. 1) The privacy effects of randomization reduce
rapidly with increasing dimensionality. 2) The properties of the underlying data set can affect the
anonymity level of the randomization method. For example, natural properties of real data sets such
as clustering improve the effectiveness of randomization. On the other hand, variations in data density
of nonempty data localities and outliers create privacy preservation challenges for the randomization
method. 3) The use of a public information-sensitive attack method makes the choice of perturbing
distribution more critical than previously thought. In particular, Gaussian perturbations are
significantly more effective than uniformly distributed perturbations for the high dimensional case.
These insights are very useful for future research and design of the randomization method. We use the
insights gained from our analysis to discuss and suggest future research directions for improvements
and extensions of the randomization method.
ETPL
DM -122 TACI: Taxonomy-Aware Catalog Integration
A fundamental data integration task faced by online commercial portals and commerce search engines
is the integration of products coming from multiple providers to their product catalogs. In this
scenario, the commercial portal has its own taxonomy (the “master taxonomy”), while each data
provider organizes its products into a different taxonomy (the “provider taxonomy”). In this paper, we
consider the problem of categorizing products from the data providers into the master taxonomy,
while making use of the provider taxonomy information. Our approach is based on a taxonomy-aware
processing step that adjusts the results of a text-based classifier to ensure that products that are close
together in the provider taxonomy remain close in the master taxonomy. We formulate this intuition
as a structured prediction optimization problem. To the best of our knowledge, this is the first
approach that leverages the structure of taxonomies in order to enhance catalog integration. We
propose algorithms that are scalable and thus applicable to the large data sets that are typical on the
web. We evaluate our algorithms on real-world data and we show that taxonomy-aware classification
provides a significant improvement over existing approaches.
ETPL
DM -123 The Skyline of a Probabilistic Relation
Abstract: In a deterministic relation R, tuple u dominates tuple v if u is no worse than v on all the
attributes of interest, and better than v on at least one attribute. This concept is at the heart of skyline
queries, that return the set of undominated tuples in R. In this paper, we extend the notion of skyline
to probabilistic relations by generalizing to this context the definition of tuple domination. Our
approach is parametric in the semantics for linearly ranking probabilistic tuples and, being it based on
order-theoretic principles, preserves the three fundamental properties the skyline has in the
deterministic case: 1) It equals the union of all top-1 results of monotone scoring functions; 2) it
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
requires no additional parameter; and 3) it is insensitive to actual attribute scales. We then show how
domination among probabilistic tuples (or P-domination for short) can be efficiently checked by
means of a set of rules. We detail such rules for the cases in which tuples are ranked using either the
“expected rank” or the “expected score” semantics, and explain how the approach can be applied to
other semantics as well. Since computing the skyline of a probabilistic relation is a time-consuming
task, we introduce a family of algorithms for checking P-domination rules in an optimized way.
Experiments show that these algorithms can significantly reduce the actual execution times with
respect to a naive evaluation.
ETPL
DM-124
Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-
Gaussian Data Clustering with Variational Inference
Abstract: Clustering has been a subject of extensive research in data mining, pattern recognition, and
other areas for several decades. The main goal is to assign samples, which are typically non-Gaussian
and expressed as points in high-dimensional feature spaces, to one of a number of clusters. It is well
known that in such high-dimensional settings, the existence of irrelevant features generally
compromises modeling capabilities. In this paper, we propose a variational inference framework for
unsupervised non-Gaussian feature selection, in the context of finite generalized Dirichlet (GD)
mixture-based clustering. Under the proposed principled variational framework, we simultaneously
estimate, in a closed form, all the involved parameters and determine the complexity (i.e., both model
an feature selection) of the GD mixture. Extensive simulations using synthetic data along with an
analysis of real-world data and human action videos demonstrate that our variational approach
achieves better results than comparable techniques.
ETPL
DM-125 A Context-Based Word Indexing Model for Document Summarization
Existing models for document summarization mostly use the similarity between sentences in the
document to extract the most salient sentences. The documents as well as the sentences are indexed
using traditional term indexing measures, which do not take the context into consideration. Therefore,
the sentence similarity values remain independent of the context. In this paper, we propose a context
sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli
model of randomness has been used to find the probability of the cooccurrences of two terms in a
large corpus. A new approach using the lexical association between terms to give a context sensitive
weight to the document terms has been proposed. The resulting indexing weights are used to compute
the sentence similarity matrix. The proposed sentence similarity measure has been used with the
baseline graph-based ranking models for sentence extraction. Experiments have been conducted over
the benchmark DUC data sets and it has been shown that the proposed Bernoulli-based sentence
similarity model provides consistent improvements over the baseline IntraLink and UniformLink
methods [1].
ETPL
DM-126
A Segmentation and Graph-Based Video Sequence Matching Method for Video
Copy Detection
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Abstract: We propose in this paper a segmentation and graph-based video sequence matching method
for video copy detection. Specifically, due to the good stability and discriminative ability of local
features, we use SIFT descriptor for video content description. However, matching based on SIFT
descriptor is computationally expensive for large number of points and the high dimension. Thus, to
reduce the computational complexity, we first use the dual-threshold method to segment the videos
into segments with homogeneous content and extract keyframes from each segment. SIFT features are
extracted from the keyframes of the segments. Then, we propose an SVD-based method to match two
video frames with SIFT point set descriptors. To obtain the video sequence matching result, we
propose a graph-based method. It can convert the video sequence matching into finding the longest
path in the frame matching-result graph with time constraint. Experimental results demonstrate that
the segmentation and graph-based video sequence matching method can detect video copies
effectively. Also, the proposed method has advantages. Specifically, it can automatically find optimal
sequence matching result from the disordered matching results based on spatial feature. It can also
reduce the noise caused by spatial feature matching. And it is adaptive to video frame rate changes.
Experimental results also demonstrate that the proposed method can obtain a better tradeoff between
the effectiveness and the efficiency of video copy detection.
ETPL
DM-127 Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus
Automatic classification of sentiment is important for numerous applications such as opinion mining,
opinion summarization, contextual advertising, and market analysis. Typically, sentiment
classification has been modeled as the problem of training a binary classifier using reviews annotated
for positive or negative sentiment. However, sentiment is expressed differently in different domains,
and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier
trained using labeled data for a particular domain to classify sentiment of user reviews on a different
domain often results in poor performance because words that occur in the train (source) domain might
not appear in the test (target) domain. We propose a method to overcome this problem in cross-
domain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using
labeled data for the source domains and unlabeled data for both source and target domains. Sentiment
sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the
context vectors used as the basis for measuring the distributional similarity between words. Next, we
use the created thesaurus to expand feature vectors during train and test times in a binary classifier.
The proposed method significantly outperforms numerous baselines and returns results that are
comparable with previously proposed cross-domain sentiment classification methods on a benchmark
data set containing Amazon user reviews for different types of products. We conduct an extensive
empirical analysis of the proposed method on single- and multisource domain adaptation,
unsupervised and supervised domain adaptation, and numerous similarity measures for creating the
sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical
resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
words that express similar s- ntiments.
ETPL
DM-128
Determining $(k)$-Most Demanding Products with Maximum Expected
Number of Total Customers
Abstract:In this paper, a problem of production plans, named k-most demanding products ($(k)$-
MDP) discovering, is formulated. Given a set of customers demanding a certain type of products with
multiple attributes, a set of existing products of the type, a set of candidate products that can be
offered by a company, and a positive integer $(k)$, we want to help the company to select $(k)$
products from the candidate products such that the expected number of the total customers for the
$(k)$ products is maximized. We show the problem is NP-hard when the number of attributes for a
product is 3 or more. One greedy algorithm is proposed to find approximate solution for the problem.
We also attempt to find the optimal solution of the problem by estimating the upper bound of the
expected number of the total customers for a set of $(k)$ candidate products for reducing the search
space of the optimal solution. An exact algorithm is then provided to find the optimal solution of the
problem by using this pruning strategy. The experiment results demonstrate that both the efficiency
and memory requirement of the exact algorithm are comparable to those for the greedy algorithm, and
the greedy algorithm is well scalable with respect to $(k)$.
ETPL
DM-129
Dirichlet Process Mixture Model for Document Clustering with Feature
Partition
Abstract: Finding the appropriate number of clusters to which documents should be partitioned is
crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to
discover the latent cluster structure based on the DPM model without requiring the number of clusters
as input. Document features are automatically partitioned into two groups, in particular,
discriminative words and nondiscriminative words, and contribute differently to document clustering.
A variational inference algorithm is investigated to infer the document collection structure as well as
the partition of document words at the same time. Our experiments indicate that our proposed
approach performs well on the synthetic data set as well as real data sets. The comparison between
our approach and state-of-the-art document clustering approaches shows that our approach is robust
and effective for document clustering.
ETPL
DM-130 Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension
Abstract: Data clustering is one of the fundamental research problems in data mining and machine
learning. Most of the existing clustering methods, for example, normalized cut and $(k)$-means, have
been suffering from the fact that their optimization processes normally lead to an NP-hard problem
due to the discretization of the elements in the cluster indicator matrix. A practical way to cope with
this problem is to relax this constraint to allow the elements to be continuous values. The eigenvalue
decomposition can be applied to generate a continuous solution, which has to be further discretized.
However, the continuous solution is probably mixing-signed. This result may cause it deviate severely
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
from the true solution, which should be naturally nonnegative. In this paper, we propose a novel
clustering algorithm, i.e., discriminative nonnegative spectral clustering, to explicitly impose an
additional nonnegative constraint on the cluster indicator matrix to seek for a more interpretable
solution. Moreover, we show an effective regularization term which is able to not only provide more
useful discriminative information but also learn a mapping function to predict cluster labels for the
out-of-sample test data. Extensive experiments on various data sets illustrate the superiority of our
proposal compared to the state-of-the-art clustering algorithms
ETPL
DM-131
Efficient Algorithms for Mining High Utility Itemsets from Transactional
Databases
Abstract: Mining high utility itemsets from a transactional database refers to the discovery of itemsets
with high utility like profits. Although a number of relevant algorithms have been proposed in recent
years, they incur the problem of producing a large number of candidate itemsets for high utility
itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of
execution time and space requirement. The situation may become worse when the database contains
lots of long transactions or long high utility itemsets. In this paper, we propose two algorithms,
namely utility pattern growth (UP-Growth) and UP-Growth+, for mining high utility itemsets with a
set of effective strategies for pruning candidate itemsets. The information of high utility itemsets is
maintained in a tree-based data structure named utility pattern tree (UP-Tree) such that candidate
itemsets can be generated efficiently with only two scans of database. The performance of UP-Growth
and UP-Growth+ is compared with the state-of-the-art algorithms on many types of both real and
synthetic data sets. Experimental results show that the proposed algorithms, especially UP-Growth+,
not only reduce the number of candidates effectively but also outperform other algorithms
substantially in terms of runtime, especially when databases contain lots of long transactions.
ETPL
DM-132
Entity Translation Mining from Comparable Corpora: Combining Graph
Mapping with Corpus Latent Features
Abstract: This paper addresses the problem of mining named entity translations from comparable
corpora, specifically, mining English and Chinese named entity translation. We first observe that
existing approaches use one or more of the following named entity similarity metrics: entity, entity
context, and relationship. Motivated by this observation, we propose a new holistic approach by 1)
combining all similarity types used and 2) additionally considering relationship context similarity
between pairs of named entities, a missing quadrant in the taxonomy of similarity metrics. We
abstract the named entity translation problem as the matching of two named entity graphs extracted
from the comparable corpora. Specifically, named entity graphs are first constructed from comparable
corpora to extract relationship between named entities. Entity similarity and entity context similarity
are then calculated from every pair of bilingual named entities. A reinforcing method is utilized to
reflect relationship similarity and relationship context similarity between named entities. We also
discover "latent" features lost in the graph extraction process and integrate this into our framework.
According to our experimental results, our holistic graph-based approach and its enhancement using
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
corpus latent features are highly effective and our framework significantly outperforms previous
approaches.
ETPL
DM-133 Harnessing Folksonomies to Produce a Social Classification of Resources
Abstract: In our daily lives, organizing resources like books or webpages into a set of categories to
ease future access is a common task. The usual largeness of these collections requires a vast endeavor
and an outrageous expense to organize manually. As an approach to effectively produce an automated
classification of resources, we consider the immense amounts of annotations provided by users on
social tagging systems in the form of bookmarks. In this paper, we deal with the utilization of these
user-provided tags to perform a social classification of resources. For this purpose, we have created
three large-scale social tagging data sets including tagging data for different types of resources,
webpages and books. Those resources are accompanied by categorization data from sound expert-
driven taxonomies. We analyze the characteristics of the three social tagging systems and perform an
analysis on the usefulness of social tags to perform a social classification of resources that resembles
the classification by experts as much as possible. We analyze six different representations using tags
and compare to other data sources by using three different settings of SVM classifiers. Finally, we
explore combinations of different data sources with tags using classifier committees to best classify
the resources.
ETPL
DM-134 Optimizing Multi-Top-k Queries over Uncertain Data Streams
Abstract: Query processing over uncertain data streams, in particular top-$(k)$ query processing, has
become increasingly important due to its wide application in many fields such as sensor network
monitoring and internet traffic control. In many real applications, multiple top-$(k)$ queries are
registered in the system. Sharing the results of these queries is a key factor in saving the computation
cost and providing real-time response. However, due to the complex semantics of uncertain top-$(k)$
query processing, it is nontrivial to implement sharing among different top-$(k)$ queries and few
works have addressed the sharing issue. In this paper, we formulate various types of sharing among
multiple top-$(k)$ queries over uncertain data streams based on the frequency upper bound of each
top-$(k)$ query. We present an optimal dynamic programming solution as well as a more efficient (in
terms of time and space complexity) greedy algorithm to compute the execution plan of executing
queries for saving the computation cost between them. Experiments have demonstrated that the
greedy algorithm can find the optimal solution in most cases, and it can almost achieve the same
performance (in terms of latency and throughput) as the dynamic programming approach.
ETPL
DM-135 Prequery Discovery of Domain-Specific Query Forms: A Survey
Abstract: The discovery of HTML query forms is one of the main challenges in Deep web crawling.
Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on
the web, which is done through the use of traditional/focused crawlers. The second is identifying
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
which of these forms are indeed meant for querying, which also typically involves determining a
domain for the underlying data source (and thus for the form as well). This problem has attracted a
great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit
requests through the forms and then analyze the data retrieved in response, typically requiring a great
deal of knowledge about the domain as well as semantic processing. Others do not employ form
submission, to avoid such difficulties, although some techniques rely to some extent on semantics and
domain knowledge. This survey gives an up-to-date review of methods for the discovery of domain-
specific query forms that do not involve form submission. We detail these methods and discuss how
form discovery has become increasingly more automated over time. We conclude with a forecast of
what we believe are the immediate next steps in this trend.
ETPL
DM-136 Preventing Private Information Inference Attacks on Social Networks
Abstract: Online social networks, such as Facebook, are increasingly utilized by many people. These
networks allow users to publish details about themselves and to connect to their friends. Some of the
information revealed inside these networks is meant to be private. Yet it is possible to use learning
algorithms on released data to predict private information. In this paper, we explore how to launch
inference attacks using released social networking data to predict private information. We then devise
three possible sanitization techniques that could be used in various situations. Then, we explore the
effectiveness of these techniques and attempt to use methods of collective inference to discover
sensitive attributes of the data set. We show that we can decrease the effectiveness of both local and
relational classification algorithms by using the sanitization methods we described.
ETPL
DM-137
Principal Composite Kernel Feature Analysis: Data-Dependent Kernel
Approach
Abstract: Principal composite kernel feature analysis (PC-KFA) is presented to show kernel
adaptations for nonlinear features of medical image data sets (MIDS) in computer-aided diagnosis
(CAD). The proposed algorithm PC-KFA has extended the existing studies on kernel feature analysis
(KFA), which extracts salient features from a sample of unclassified patterns by use of a kernel
method. The principal composite process for PC-KFA herein has been applied to kernel principal
component analysis [34] and to our previously developed accelerated kernel feature analysis [20].
Unlike other kernel-based feature selection algorithms, PC-KFA iteratively constructs a linear
subspace of a high-dimensional feature space by maximizing a variance condition for the nonlinearly
transformed samples, which we call data-dependent kernel approach. The resulting kernel subspace
can be first chosen by principal component analysis, and then be processed for composite kernel
subspace through the efficient combination representations used for further reconstruction and
classification. Numerical experiments based on several MID feature spaces of cancer CAD data have
shown that PC-KFA generates efficient and an effective feature representation, and has yielded a
better classification performance for the proposed composite kernel subspace using a simple pattern
classifier
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
DM-138
Revealing Density-Based Clustering Structure from the Core-Connected Tree
of a Network
Abstract: Clustering is an important technique for mining the intrinsic community structures in
networks. The density-based network clustering method is able to not only detect communities of
arbitrary size and shape, but also identify hubs and outliers. However, it requires manual parameter
specification to define clusters, and is sensitive to the parameter of density threshold which is difficult
to determine. Furthermore, many real-world networks exhibit a hierarchical structure with
communities embedded within other communities. Therefore, the clustering result of a global
parameter setting cannot always describe the intrinsic clustering structure accurately. In this paper, we
introduce a novel density-based network clustering method, called graph-skeleton-based clustering
(gSkeletonClu). By projecting an undirected network to its core-connected maximal spanning tree, the
clustering problem can be converted to detect core connectivity components on the tree. The density-
based clustering of a specific parameter setting and the hierarchical clustering structure both can be
efficiently extracted from the tree. Moreover, it provides a convenient way to automatically select the
parameter and to achieve the meaningful cluster tree in a network. Extensive experiments on both
real-world and synthetic networks demonstrate the superior performance of gSkeletonClu for
effective and efficient density-based clustering.
ETPL
DM-139 Secure Provenance Transmission for Streaming Data
Abstract: Many application domains, such as real-time financial analysis, e-healthcare systems, sensor
networks, are characterized by continuous data streaming from multiple sources and through
intermediate processing by multiple aggregators. Keeping track of data provenance in such highly
dynamic context is an important requirement, since data provenance is a key factor in assessing data
trustworthiness which is crucial for many applications. Provenance management for streaming data
requires addressing several challenges, including the assurance of high processing throughput, low
bandwidth consumption, storage efficiency and secure transmission. In this paper, we propose a novel
approach to securely transmit provenance for streaming data (focusing on sensor network) by
embedding provenance into the interpacket timing domain while addressing the above mentioned
issues. As provenance is hidden in another host-medium, our solution can be conceptualized as
watermarking technique. However, unlike traditional watermarking approaches, we embed
provenance over the interpacket delays (IPDs) rather than in the sensor data themselves, hence
avoiding the problem of data degradation due to watermarking. Provenance is extracted by the data
receiver utilizing an optimal threshold-based mechanism which minimizes the probability of
provenance decoding errors. The resiliency of the scheme against outside and inside attackers is
established through an extensive security analysis. Experiments show that our technique can recover
provenance up to a certain level against perturbations to inter-packet timing characteristics.
ETPL
DM-140
The Adaptive Clustering Method for the Long Tail Problem of Recommender
Systems
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Abstract: This is a study of the long tail problem of recommender systems when many items in the
long tail have only a few ratings, thus making it hard to use them in recommender systems. The
approach presented in this paper clusters items according to their popularities, so that the
recommendations for tail items are based on the ratings in more intensively clustered groups and for
the head items are based on the ratings of individual items or groups, clustered to a lesser extent. We
apply this method to two real-life data sets and compare the results with those of the nongrouping and
fully grouped methods in terms of recommendation accuracy and scalability. The results show that if
such adaptive clustering is done properly, this method reduces the recommendation error rates for the
tail items, while maintaining reasonable computational performance.
ETPL
DM-141 VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
Abstract: Similarity joins play an important role in many application areas, such as data integration
and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for
similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on
extracting overlapping grams from strings and considering only strings that share a certain number of
grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity
join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of
chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm,
VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to
our chunk-based method. We also design a greedy algorithm to automatically select a good chunking
scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than
alternative methods yet occupies less space.