IEEE Projects 2013-2014 - DataMining - Project Titles and Abstracts

transcript

Pondicherry | Trivandrum | Salem | Erode | Tirunelveli

http://www.elysiumtechnologies.com, info@elysiumtechnologies.com

13 Years of Experience

Automated Services

24/7 Help Desk Support

Experience & Expertise Developers

Advanced Technologies & Tools

Legitimate Member of all Journals

Having 1,50,000 Successive records in

all Languages

More than 12 Branches in Tamilnadu,

Kerala & Karnataka.

Ticketing & Appointment Systems.

Individual Care for every Student.

Around 250 Developers & 20

Researchers

227-230 Church Road, Anna Nagar, Madurai – 625020.

0452-4390702, 4392702, + 91-9944793398.

info@elysiumtechnologies.com, elysiumtechnologies@gmail.com

S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam,

Chennai - 600034. 044-42072702, +91-9600354638,

chennai@elysiumtechnologies.com

15, III Floor, SI Towers, Melapudur main Road, Trichy – 620001.

0431-4002234, + 91-9790464324.

trichy@elysiumtechnologies.com

577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641002

0422- 4377758, +91-9677751577.

coimbatore@elysiumtechnologies.com

Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli-

627007. 0462-2532104, +919677733255,

tirunelveli@elysiumtechnologies.com

1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram

- 623501. 04567-223225,

+919677704922.ramnad@elysiumtechnologies.com

74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur

Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +91-

9677748477 erode@elysiumtechnologies.com

No: 88, First Floor, S.V.Patel Salai, Pondicherry – 605 001. 0413–

4200640 +91-9677704822

pondy@elysiumtechnologies.com

TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem

– 636007, 0427-4042220, +91-9894444716.

salem@elysiumtechnologies.com

DM-001 Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data

Abstract: Feature selection involves identifying a subset of the most useful features that produces compatible

results as the original entire set of features. A feature selection algorithm may be evaluated from both the

efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of

features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast

clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper.

The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-

theoretic clustering methods. In the second step, the most representative feature that is strongly related to target

classes is selected from each cluster to form a subset of features. Features in different clusters are relatively

independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and

independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST)

clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical

study. Extensive experiments are carried out to compare FAST and several representative feature selection

algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known

classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the

rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-

dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of

features but also improves the performances of the four types of classifiers.

DM-002

Graph-Based Consensus Maximization Approach for Combining Multiple Supervised

and Unsupervised Models

Ensemble learning has emerged as a powerful method for combining multiple models. Well-known methods,

such as bagging, boosting, and model averaging, have been shown to improve accuracy and robustness over

single models. However, due to the high costs of manual labeling, it is hard to obtain sufficient and reliable

labeled data for effective training. Meanwhile, lots of unlabeled data exist in these sources, and we can readily

obtain multiple unsupervised models. Although unsupervised models do not directly generate a class label

prediction for each object, they provide useful constraints on the joint predictions for a set of related objects.

Therefore, incorporating these unsupervised models into the ensemble of supervised models can lead to better

prediction performance. In this paper, we study ensemble learning with outputs from multiple supervised and

unsupervised models, a topic where little work has been done. We propose to consolidate a classification

solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We

cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the

smoothness of the predictions over the graph, but penalizes the deviations from the initial labeling provided by

the supervised models. We solve this problem through iterative propagation of probability estimates among

neighboring nodes and prove the optimality of the solution. The proposed method can be interpreted as

conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on

different applications with heterogeneous data sources demonstrate the benefits of the proposed method over

existing alternatives.

DM-003

Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and Rule-

Based Model

http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: Recent increase in the use of video-based applications has revealed the need for extracting the

content in videos. Raw data and low-level features alone are not sufficient to fulfill the user 's needs; that is, a

deeper understanding of the content at the semantic level is required. Currently, manual techniques, which are

inefficient, subjective and costly in time and limit the querying capabilities, are being used to bridge the gap

between low-level representative features and high-level semantic content. Here, we propose a semantic

content extraction system that allows the user to query and retrieve objects, events, and concepts that are

extracted automatically. We introduce an ontology-based fuzzy video semantic content model that uses

spatial/temporal relations in event and concept definitions. This metaontology definition provides a wide-

domain applicable rule construction standard that allows the user to construct an ontology for a given domain.

In addition to domain ontologies, we use additional rule definitions (without using ontology) to lower spatial

relation computation cost and to be able to define some complex situations more effectively. The proposed

framework has been fully implemented and tested on three different domains. We have obtained satisfactory

precision and recall rates for object, event and concept extraction.

DM-004 Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm

In comparison with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering

algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in

domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic

present within a document or set of documents. However, because most sentence similarity measures do not

represent sentences in a common metric space, conventional fuzzy clustering approaches based on prototypes

or mixtures of Gaussians are generally not applicable to sentence clustering. This paper presents a novel fuzzy

clustering algorithm that operates on relational input data; i.e., data in the form of a square matrix of pairwise

similarities between data objects. The algorithm uses a graph representation of the data, and operates in an

Expectation-Maximization framework in which the graph centrality of an object in the graph is interpreted as a

likelihood. Results of applying the algorithm to sentence clustering tasks demonstrate that the algorithm is

capable of identifying overlapping clusters of semantically related sentences, and that it is therefore of potential

use in a variety of text mining tasks. We also include results of applying the algorithm to benchmark data sets

in several other domains.

DM-005 Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks

Abstract: In this paper, we introduce the notion of sufficient set and necessary set for distributed processing of

probabilistic top-k queries in cluster-based wireless sensor networks. These two concepts have very nice

properties that can facilitate localized data pruning in clusters. Accordingly, we develop a suite of algorithms,

namely, sufficient set-based (SSB), necessary set-based (NSB), and boundary-based (BB), for intercluster

query processing with bounded rounds of communications. Moreover, in responding to dynamic changes of

data distribution in the network, we develop an adaptive algorithm that dynamically switches among the three

proposed algorithms to minimize the transmission cost. We show the applicability of sufficient set and

necessary set to wireless sensor networks with both two-tier hierarchical and tree-structured network

topologies. Experimental results show that the proposed algorithms reduce data transmissions significantly and

incur only small constant rounds of data communications. The experimental results also demonstrate the

superiority of the adaptive algorithm, which achieves a near-optimal performance under various conditions.

DM-006

Evaluating Data Reliability: An Evidential Answer with Application to a Web-Enabled

Data Warehouse

Abstract: There are many available methods to integrate information source reliability in an uncertainty

representation, but there are only a few works focusing on the problem of evaluating this reliability. However,

data reliability and confidence are essential components of a data warehousing system, as they influence

subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a

set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided.

The chosen illustrative example comes from real-world data issued from the Sym'Previus predictive

microbiology oriented data warehouse.

DM-007 Large Graph Analysis in the GMine System

Abstract: Current applications have produced graphs on the order of hundreds of thousands of nodes and

millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers, and

communities. These tasks are better performed in an interactive environment, where human expertise can guide

the process. For large graphs, though, there are some challenges: the excessive processing requirements are

prohibitive, and drawing hundred-thousand nodes results in cluttered images hard to comprehend. To cope with

these problems, we propose an innovative framework suited for any kind of tree-like graph visual design.

GMine integrates 1) a representation for graphs organized as hierarchies of partitions-the concepts of

SuperGraph and Graph-Tree; and 2) a graph summarization methodology-CEPS. Our graph representation

deals with the problem of tracing the connection aspects of a graph hierarchy with sub linear complexity,

allowing one to grasp the neighborhood of a single node or of a group of nodes in a single click. As a proof of

concept, the visual environment of GMine is instantiated as a system in which large graphs can be investigated

globally and locally.

DM-008

Maximum Likelihood Estimation from Uncertain Data in the Belief Function

Framework

Abstract: We consider the problem of parameter estimation in statistical models in the case where data are

uncertain and represented as belief functions. The proposed method is based on the maximization of a

generalized likelihood criterion, which can be interpreted as a degree of agreement between the statistical

model and the uncertain observations. We propose a variant of the EM algorithm that iteratively maximizes

this criterion. As an illustration, the method is applied to uncertain data clustering using finite mixture models,

in the cases of categorical and continuous attributes.

DM-009

Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case

Studies

Abstract: In this paper, we study sparsity-exploiting Mastermind algorithms for attacking the privacy of an

entire database of character strings or vectors, such as DNA strings, movie ratings, or social network friendship

data. Based on reductions to nonadaptive group testing, our methods are able to take advantage of minimal

amounts of privacy leakage, such as contained in a single bit that indicates if two people in a medical database

have any common genetic mutations, or if two people have any common friends in an online social network.

We analyze our Mastermind attack algorithms using theoretical characterizations that provide sublinear bounds

on the number of queries needed to clone the database, as well as experimental tests on genomic information,

collaborative filtering data, and online social networks. By taking advantage of the generally sparse nature of

http://www.elysiumtechnologies.com, info@elysiumtechnologies.com these real-world databases and modulating a parameter that controls query sparsity, we demonstrate that

relatively few nonadaptive queries are needed to recover a large majority of each database.

DM-010 On the Recovery of R-Trees

Abstract: We consider the recoverability of traditional R-tree index structures under concurrent updating

transactions, an important issue that is neglected or treated inadequately in many proposals of R-tree

concurrency control. We present two solutions to ARIES-based recovery of transactions on R-trees. These

assume a standard fine-grained single-version update model with physiological write-ahead logging and steal-

and-no-force buffering where records with uncommitted updates by a transaction may migrate from their

original page to another page due to structure modifications caused by other transactions. Both solutions

guarantee that an R-tree will remain in a consistent and balanced state in the presence of any number of

concurrent forward-rolling and (totally or partially) backward-rolling multiaction transactions and in the event

of process failures and system crashes. One solution maintains the R-tree in a strictly consistent state in which

the bounding rectangles of pages are as tight as possible, while in the other solution this requirement is relaxed.

In both solutions only a small constant number of simultaneous exclusive latches (write latches) are needed,

and in the solution that only maintains relaxed consistency also the number of simultaneous nonexclusive

latches is similarly limited. In both solutions, deletions are handled uniformly with insertions, and a

logarithmic insertion-path length is maintained under all circumstances.

DM-011 Ontology Matching: State of the Art and Future Challenges

Abstract: After years of research on ontology matching, it is reasonable to consider several questions: is the

field of ontology matching still making progress? Is this progress significant enough to pursue further research?

If so, what are the particularly promising directions? To answer these questions, we review the state of the art

of ontology matching and analyze the results of recent ontology matching evaluations. These results show a

measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant

improvements can be obtained only by addressing important challenges for ontology matching. We present

such challenges with insights on how to approach them, thereby aiming to direct research into the most

promising tracks and to facilitate the progress of the field.

DM-012 Ranking on Data Manifold with Sink Points

Abstract: Ranking is an important problem in various applications, such as Information Retrieval (IR), natural

language processing, computational biology, and social sciences. Many ranking approaches have been

proposed to rank objects according to their degrees of relevance or importance. Beyond these two goals,

diversity has also been recognized as a crucial criterion in ranking. Top ranked results are expected to convey

as little redundant information as possible, and cover as many aspects as possible. However, existing ranking

approaches either take no account of diversity, or handle it separately with some heuristics. In this paper, we

introduce a novel approach, Manifold Ranking with Sink Points (MRSPs), to address diversity as well as

relevance and importance in ranking. Specifically, our approach uses a manifold ranking process over the data

manifold, which can naturally find the most relevant and important data objects. Meanwhile, by turning ranked

objects into sink points on data manifold, we can effectively prevent redundant objects from receiving a high

rank. MRSP not only shows a nice convergence property, but also has an interesting and satisfying

http://www.elysiumtechnologies.com, info@elysiumtechnologies.com optimization explanation. We applied MRSP on two application tasks, update summarization and query

recommendation, where diversity is of great concern in ranking. Experimental results on both tasks present a

strong empirical performance of MRSP as compared to existing ranking approaches.

DM-013 Region-Based Foldings in Process Discovery

Abstract: A central problem in the area of Process Mining is to obtain a formal model that represents the

processes that are conducted in a system. If realized, this simple motivation allows for powerful techniques that

can be used to formally analyze and optimize a system, without the need to resort to its semiformal and

sometimes inaccurate specification. The problem addressed in this paper is known as Process Discovery: to

obtain a formal model from a set of system executions. The theory of regions is a valuable tool in process

discovery: it aims at learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is

applied on an automaton and therefore one should convert the traces into an acyclic automaton in order to

apply these techniques. Given that the complexity of the region-based techniques depends on the size of the

input automata, revealing the underlying cycles and folding the initial automaton can incur in a significant

complexity alleviation of the region-based techniques. In this paper, we follow this idea by incorporating

region information in the cycle detection algorithm, enabling the identification of complex cycles that cannot

be obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool

suggest that the techniques presented in this paper are a big step into widening the application of the theory of

regions in Process Mining for industrial scenarios.

DM-014

Relationships between Diversity of Classification Ensembles and Single-Class

Performance Measures

Abstract: In class imbalance learning problems, how to better recognize examples from the minority class is

the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble

solutions have been proposed in the literature with varying degrees of success. It is generally believed that

diversity in an ensemble could help to improve the performance of class imbalance learning. However, no

study has actually investigated diversity in depth in terms of its definitions and effects in the context of class

imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of

minority and majority classes. In this paper, we aim to gain a deeper understanding of if and when ensemble

diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why

diversity measured by Q-statistic can bring improved overall accuracy based on two classification patterns

proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios.

Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and F-

measure, which are widely used in class imbalance learning. Six different situations of diversity's impact on

these measures are obtained through theoretical analysis. Finally, to further understand how diversity affects

the single class performance and overall performance in class imbalance problems, we carry out extensive

experimental studies on both artificial data sets and real-world benchmarks with highly skewed class

distributions. We find strong correlations between diversity and discussed performance measures. Diversity

shows a positive impact on the minority class in general. It is also beneficial to the overall performance in

terms of AUC and G-mean.

DM-015 T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence

http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: This paper presents a smart driving direction system leveraging the intelligence of experienced

drivers. In this system, GPS-equipped taxis are employed as mobile sensors probing the traffic rhythm of a city

and taxi drivers' intelligence in choosing driving directions in the physical world. We propose a time-dependent

landmark graph to model the dynamic traffic pattern as well as the intelligence of experienced drivers so as to

provide a user with the practically fastest route to a given destination at a given departure time. Then, a

Variance-Entropy-Based Clustering approach is devised to estimate the distribution of travel time between two

landmarks in different time slots. Based on this graph, we design a two-stage routing algorithm to compute the

practically fastest and customized route for end users. We build our system based on a real-world trajectory

data set generated by over 33,000 taxis in a period of three months, and evaluate the system by conducting both

synthetic experiments and in-the-field evaluations. As a result, 60-70 percent of the routes suggested by our

method are faster than the competing methods, and 20 percent of the routes share the same results. On average,

50 percent of our routes are at least 20 percent faster than the competing approaches.

DM-016

A Generalized Flow-Based Method for Analysis of Implicit Relationships on

Wikipedia

Abstract: We focus on measuring relationships between pairs of objects in Wikipedia whose pages

can be regarded as individual objects. Two kinds of relationships between two objects exist: in

Wikipedia, an explicit relationship is represented by a single link between the two pages for the

objects, and an implicit relationship is represented by a link structure containing the two pages. Some

of the previously proposed methods for measuring relationships are cohesion-based methods, which

underestimate objects having high degrees, although such objects could be important in constituting

relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships

because they use only one or two of the following three important factors: distance, connectivity, and

cocitation. We propose a new method using a generalized maximum flow which reflects all the three

factors and does not underestimate objects having high degree. We confirm through experiments that

our method can measure the strength of a relationship more appropriately than these previously

proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is,

objects constituting a relationship. We explain that mining elucidatory objects would open a novel

way to deeply understand a relationship.

DM-017

A Proxy-Based Approach to Continuous Location-Based Spatial Queries in

Mobile Environments

Abstract: Caching valid regions of spatial queries at mobile clients is effective in reducing the number

of queries submitted by mobile clients and query load on the server. However, mobile clients suffer

from longer waiting time for the server to compute valid regions. We propose in this paper a proxy-

based approach to continuous nearest-neighbor (NN) and window queries. The proxy creates

estimated valid regions (EVRs) for mobile clients by exploiting spatial and temporal locality of

spatial queries. For NN queries, we devise two new algorithms to accelerate EVR growth, leading the

proxy to build effective EVRs even when the cache size is small. On the other hand, we propose to

represent the EVRs of window queries in the form of vectors, called estimated window vectors

(EWVs), to achieve larger estimated valid regions. This novel representation and the associated

creation algorithm result in more effective EVRs of window queries. In addition, due to the distinct

characteristics, we use separate index structures, namely EVR-tree and grid index, for NN queries and

window queries, respectively. To further increase efficiency, we develop algorithms to exploit the

results of NN queries to aid grid index growth, benefiting EWV creation of window queries.

Similarly, the grid index is utilized to support NN query answering and EVR updating. We conduct

several experiments for performance evaluation. The experimental results show that the proposed

approach significantly outperforms the existing proxy-based approaches.

DM-018

A Rough-Set-Based Incremental Approach for Updating Approximations under

Dynamic Maintenance Environments

Abstract: Approximations of a concept by a variable precision rough-set model (VPRS) usually vary

under a dynamic information system environment. It is thus effective to carry out incremental

updating approximations by utilizing previous data structures. This paper focuses on a new

incremental method for updating approximations of VPRS while objects in the information system

dynamically alter. It discusses properties of information granulation and approximations under the

dynamic environment while objects in the universe evolve over time. The variation of an attribute's

domain is also considered to perform incremental updating for approximations under VPRS. Finally,

an extensive experimental evaluation validates the efficiency of the proposed method for dynamic

maintenance of VPRS approximations

DM-019 A System to Filter Unwanted Messages from OSN User Walls

Abstract: One fundamental issue in today's Online Social Networks (OSNs) is to give users the ability

to control the messages posted on their own private space to avoid that unwanted content is displayed.

Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a

system allowing OSN users to have a direct control on the messages posted on their walls. This is

achieved through a flexible rule-based system, that allows users to customize the filtering criteria to

be applied to their walls, and a Machine Learning-based soft classifier automatically labeling

messages in support of content-based filtering.

DM-020 AML: Efficient Approximate Membership Localization within a Web-Based

Abstract: In this paper, we propose a new type of Dictionary-based Entity Recognition Problem,

named Approximate Membership Localization (AML). The popular Approximate Membership

Extraction (AME) provides a full coverage to the true matched substrings from a given document, but

many redundancies cause a low efficiency of the AME process and deteriorate the performance of

real-world applications using the extracted substrings. The AML problem targets at locating

nonoverlapped substrings which is a better approximation to the true matched substrings without

generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized

algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before

generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune

over a baseline method. We also study the AML in application to a proposed web-based join

framework scenario which is a search-based approach joining two tables using dictionary-based entity

recognition from web documents. The results not only prove the advantage of AML over AME, but

also demonstrate the effectiveness of our search-based approach.

DM-021 Anonymization of Centralized and Distributed Social Networks by Sequential

Abstract: We study the problem of privacy-preservation in social networks. We consider the

distributed setting in which the network data is split between several data holders. The goal is to

arrive at an anonymized view of the unified network without revealing to any of the data holders

information about links between nodes that are controlled by other data holders. To that end, we start

with the centralized setting and offer two variants of an anonymization algorithm which is based on

sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to

Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of

clustering. We then devise secure distributed versions of our algorithms. To the best of our

knowledge, this is the first study of privacy preservation in distributed social networks. We conclude

by outlining future research proposals in that direction.

DM-022 Clustering Large Probabilistic Graphs

Abstract: We study the problem of clustering probabilistic graphs. Similar to the problem of clustering

standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes

in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in

affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic

graphs. We establish a connection between our objective function and correlation clustering to

propose practical approximation algorithms for our problem. A benefit of our approach is that our

objective function is parameter-free. Therefore, the number of clusters is part of the output. We also

develop methods for testing the statistical significance of the output clustering and study the case of

noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show

that our methods discover the correct number of clusters and identify established protein

relationships. Finally, we show the practicality of our techniques using a large social network of

Yahoo! users consisting of one billion edges.

DM-023 Detecting Intrinsic Loops Underlying Data Manifold

Abstract: Detecting intrinsic loop structures of a data manifold is the necessary prestep for the proper

employment of the manifold learning techniques and of fundamental importance in the discovery of

the essential representational features underlying the data lying on the loopy manifold. An effective

strategy is proposed to solve this problem in this study. In line with our intuition, a formal definition

of a loop residing on a manifold is first given. Based on this definition, theoretical properties of loopy

manifolds are rigorously derived. In particular, a necessary and sufficient condition for detecting

essential loops of a manifold is derived. An effective algorithm for loop detection is then constructed.

The soundness of the proposed theory and algorithm is validated by a series of experiments performed

on synthetic and real-life data sets. In each of the experiments, the essential loops underlying the data

manifold can be properly detected, and the intrinsic representational features of the data manifold can

be revealed along the loop structure so detected. Particularly, some of these features can hardly be

discovered by the conventional manifold learning methods.

DM-024 Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)

Abstract: This paper introduces a platform for online Sensitivity Analysis (SA) that is applicable in

large scale real-time data acquisition (DAQ) systems. Here, we use the term real-time in the context

of a system that has to respond to externally generated input stimuli within a finite and specified

period. Complex industrial systems such as manufacturing, healthcare, transport, and finance require

high-quality information on which to base timely responses to events occurring in their volatile

environments. The motivation for the proposed EventTracker platform is the assumption that modern

industrial systems are able to capture data in real-time and have the necessary technological flexibility

to adjust to changing system requirements. The flexibility to adapt can only be assured if data is

succinctly interpreted and translated into corrective actions in a timely manner. An important factor

that facilitates data interpretation and information modeling is an appreciation of the affect system

inputs have on each output at the time of occurrence. Many existing sensitivity analysis methods

appear to hamper efficient and timely analysis due to a reliance on historical data, or sluggishness in

providing a timely solution that would be of use in real-time applications. This inefficiency is further

compounded by computational limitations and the complexity of some existing models. In dealing

with real-time event driven systems, the underpinning logic of the proposed method is based on the

assumption that in the vast majority of cases changes in input variables will trigger events. Every

single or combination of events could subsequently result in a change to the system state. The

proposed event tracking sensitivity analysis method describes variables and the system state as a

collection of events. The higher the numeric occurrence of an input variable at the trigger level during

an event monitoring interval, the greater is its impact on the final analysis of the system state. Expe-

iments were designed to compare the proposed event tracking sensitivity analysis method with a

comparable method (that of Entropy). An improvement of 10 percent in computational efficiency

without loss in accuracy was observed. The comparison also showed that the time taken to perform

the sensitivity analysis was 0.5 percent of that required when using the comparable Entropy-based

method.

DM-025 Fast Activity Detection: Indexing for Temporal Stochastic Automaton-Based

Abstract: Today, numerous applications require the ability to monitor a continuous stream of fine-

grained data for the occurrence of certain high-level activities. A number of computerized systems-

including ATM networks, web servers, and intrusion detection systems-systematically track every

atomic action we perform, thus generating massive streams of timestamped observation data, possibly

from multiple concurrent activities. In this paper, we address the problem of efficiently detecting

occurrences of high-level activities from such interleaved data streams. A solution to this important

problem would greatly benefit a broad range of applications, including fraud detection, video

surveillance, and cyber security. There has been extensive work in the last few years on modeling

activities using probabilistic models. In this paper, we propose a temporal probabilistic graph so that

the elapsed time between observations also plays a role in defining whether a sequence of

observations constitutes an activity. We first propose a data structure called “temporal multiactivity

graph” to store multiple activities that need to be concurrently monitored. We then define an index

called Temporal Multiactivity Graph Index Creation (tMAGIC) that, based on this data structure,

examines and links observations as they occur. We define algorithms for insertion and bulk insertion

into the tMAGIC index and show that this can be efficiently accomplished. We also define algorithms

to solve two problems: the “evidence” problem that tries to find all occurrences of an activity (with

probability over a threshold) within a given sequence of observations, and the “identification”

problem that tries to find the activity that best matches a sequence of observations. We introduce

complexity reducing restrictions and pruning strategies to make the problem-which is intrinsically

exponential-linear to the number of observations. Our experiments confirm that tMAGI- has time and

space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored

activities.

DM-026 Finding Rare Classes: Active Learning with Generative and Discriminative

Abstract: Discovering rare categories and classifying new instances of them are important data mining

issues in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in

labeling effort. There has therefore been increasing interest both in active discovery: to identify new

classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur

together in practice and are intrinsically related because examples of each class are required to train a

classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data

mining for rare classes in new domains makes inefficient use of human supervision. Developing

active learning algorithms to optimise both rare class discovery and classification simultaneously is

challenging because discovery and classification have conflicting requirements in query criteria. In

this paper, we address these issues with two contributions: a unified active learning model to jointly

discover new categories and learn to classify them by adapting query criteria online; and a classifier

combination algorithm that switches generative and discriminative classifiers as learning progresses.

Extensive evaluation on a batch of standard UCI and vision data sets demonstrates the superiority of

this approach over existing methods.

DM-027 Halite: Fast and Scalable Multiresolution Local-Correlation Clustering

Abstract: This paper proposes Halite, a novel, fast, and scalable clustering method that looks for

clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or

execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate

results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in

time and space regarding the data size and dimensionality, and the dimensionality of the clusters'

subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an

input parameter, and detects clusters in subspaces generated by original axes or by their linear

combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or

better quality compared to top related works; and 4) Generality: it includes a soft clustering approach.

Experiments on synthetic data ranging from five to 30 axes and up to 1 rm million points were

performed. Halite was in average at least 12 times faster than seven representative works, and always

presented highly accurate results. On real data, Halite was at least 11 times faster than others,

increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where

soft clustering is desirable.

DM-028 k-Pattern Set Mining under Constraints

Abstract: We introduce the problem of k-pattern set mining, concerned with finding a set of k related

patterns under constraints. This contrasts to regular pattern mining, where one searches for many

individual patterns. The k-pattern set mining problem is a very general problem that can be

instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning,

redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of

constraints for use in k-pattern set mining, both at the local level, that is, on individual patterns, and

on the global level, that is, on the overall pattern set. Building general solvers for the pattern set

mining problem remains a challenge. Here, we investigate to what extent constraint programming

(CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to

constraints currently available in CP. This allows us to investigate a large number of settings within a

unified framework and to gain insight in the possibilities and limitations of these solvers. This is

important as it allows us to create guidelines in how to model new problems successfully and how to

model existing problems more efficiently. It also opens up the way for other solver technologies.

DM-029 Minimally Supervised Novel Relation Extraction Using a Latent Relational

Abstract The World Wide Web includes semantic relations of numerous types that exist among

different entities. Extracting the relations that exist between two entities is an important step in

various Web-related tasks such as information retrieval (IR), information extraction, and social

network extraction. A supervised relation extraction system that is trained to extract a particular

relation type (source relation) might not accurately extract a new type of a relation (target relation) for

which it has not been trained. However, it is costly to create training data manually for every new

relation type that one might want to extract. We propose a method to adapt an existing relation

extraction system to extract new relation types with minimum supervision. Our proposed method

comprises two stages: learning a lower dimensional projection between different relations, and

learning a relational classifier for the target relation type with instance sampling. First, to represent a

semantic relation that exists between two entities, we extract lexical and syntactic patterns from

contexts in which those two entities co-occur. Then, we construct a bipartite graph between relation-

specific (RS) and relation-independent (RI) patterns. Spectral clustering is performed on the bipartite

graph to compute a lower dimensional projection. Second, we train a classifier for the target relation

type using a small number of labeled instances. To account for the lack of target relation training

instances, we present a one-sided under sampling method. We evaluate the proposed method using a

data set that contains 2,000 instances for 20 different relation types. Our experimental results show

that the proposed method achieves a statistically significant macroaverage F-score of 62.77.

Moreover, the proposed method outperforms numerous baselines and a previously proposed weakly

supervised relation extraction method.

DM-030 Mining User Queries with Markov Chains: Application to Online Image

Abstract: We propose a novel method for automatic annotation, indexing and annotation-based

retrieval of images. The new method, that we call Markovian Semantic Indexing (MSI), is presented

in the context of an online image retrieval system. Assuming such a system, the users' queries are

used to construct an Aggregate Markov Chain (AMC) through which the relevance between the

keywords seen by the system is defined. The users' queries are also used to automatically annotate the

images. A stochastic distance between images, based on their annotation and the keyword relevance

captured in the AMC, is then introduced. Geometric interpretations of the proposed distance are

provided and its relation to a clustering in the keyword space is investigated. By means of a new

measure of Markovian state similarity, the mean first cross passage time (CPT), optimality properties

of the proposed distance are proved. Images are modeled as points in a vector space and their

similarity is measured with MSI. The new method is shown to possess certain theoretical advantages

and also to achieve better Precision versus Recall results when compared to Latent Semantic Indexing

(LSI) and probabilistic Latent Semantic Indexing (pLSI) methods in Annotation-Based Image

Retrieval (ABIR) tasks.

DM-031 Reinforced Similarity Integration in Image-Rich Information Networks

Abstract: Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain

billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also

furnished with tremendous amounts of product-related images. In addition, images in such social

networks are also accompanied by annotations, comments, and other information, thus forming

heterogeneous image-rich information networks. In this paper, we introduce the concept of

(heterogeneous) image-rich information network and the problem of how to perform information

retrieval and recommendation in such networks. We propose a fast algorithm heterogeneous

minimum order k-SimRank (HMok-SimRank) to compute link-based similarity in weighted

heterogeneous information networks. Then, we propose an algorithm Integrated Weighted Similarity

Learning (IWSL) to account for both link-based and content-based similarities by considering the

network structure and mutually reinforcing link similarity and feature weight learning. Both local and

global feature learning methods are designed. Experimental results on Flickr and Amazon data sets

show that our approach is significantly better than traditional methods in terms of both relevance and

speed. A new product search and recommendation system for e-commerce has been implemented

based on our algorithm.

DM-032 Supporting Search-As-You-Type Using SQL in Databases

Abstract: A search-as-you-type system computes answers on-the-fly as a user types in a keyword

query character by character. We study how to support search-as-you-type on data residing in a

relational DBMS. We focus on how to support this type of search using the native database language,

SQL. A main challenge is how to leverage existing database functionalities to meet the high-

performance requirement to achieve an interactive speed. We study how to use auxiliary indexes

stored as tables to increase search performance. We present solutions for both single-keyword queries

and multikeyword queries, and develop novel techniques for fuzzy search using SQL by allowing

mismatches between query keywords and answers. We present techniques to answer first-N queries

and discuss how to support updates efficiently. Experiments on large, real data sets show that our

techniques enable DBMS systems on a commodity computer to support search-as-you-type on tables

with millions of records.

DM-033 Simple Hybrid and Incremental Postpruning Techniques for Rule Induction

: Pruning achieves the dual goal of reducing the complexity of the final hypothesis for improved

comprehensibility, and improving its predictive accuracy by minimizing the overfitting due to noisy

data. This paper presents a new hybrid pruning technique for rule induction, as well as an incremental

postpruning technique based on a misclassification tolerance. Although both have been designed for

RULES-7, the latter is also applicable to any rule induction algorithm in general. A thorough

empirical evaluation reveals that the proposed techniques enable RULES-7 to outperform other state-

of-the-art classification techniques. The improved classifier is also more accurate and up to two orders

of magnitude faster than before.

DM-034 λ -diverse nearest neighbors browsing for multidimensional data

Abstract: Traditional search methods try to obtain the most relevant information and rank it according

to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of

applications since results very similar to each other cannot capture all aspects of the queried topic. In

this paper, we focus on the lambda-diverse k-nearest neighbor search problem on spatial and

multidimensional data. Unlike the approach of diversifying query results in a postprocessing step, we

naturally obtain diverse results with the proposed geometric and index-based methods. We first make

an analogy with the concept of Natural Neighbors (NatN) and propose a natural neighbor-based

method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for

higher dimensional spaces. We then introduce a diverse browsing method based on the distance

browsing feature of spatial index structures, such as R-trees. The algorithm maintains a Priority

Queue with mindivdist of the objects depending on both relevancy and angular diversity and

efficiently prunes nondiverse items and nodes. We experiment with a number of spatial and high-

dimensional data sets, including Factual's (http://www.factual.com/) US points-of-interest data set of

13M entries. On the experimental setup, the diverse browsing method is shown to be more efficient

(regarding disk accesses) than k-NN search on R-trees, and more effective (regarding Maximal

Marginal Relevance (MMR)) than the diverse nearest neighbor search techniques found in the

literature.

DM-035 A Bound on Kappa-Error Diagrams for Analysis of Classifier Ensembles

Abstract: Kappa-error diagrams are used to gain insights about why an ensemble method is better

than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The x-

axis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study,

kappa is calculated from the 2 × 2 correct/wrong contingency matrix. We derive a lower bound on

kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments

with real data show that there is unoccupied feasible space on the diagram corresponding to

(hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the

ensemble accuracy.

DM-036 A New Algorithm for Inferring User Search Goals with Feedback Sessions

Abstract: For a broad-topic and ambiguous query, different users may have different search goals

when they submit it to a search engine. The inference and analysis of user search goals can be very

useful in improving search engine relevance and user experience. In this paper, we propose a novel

approach to infer user search goals by analyzing search engine query logs. First, we propose a

framework to discover different user search goals for a query by clustering the proposed feedback

sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect

the information needs of users. Second, we propose a novel approach to generate pseudo-documents

to better represent the feedback sessions for clustering. Finally, we propose a new criterion

)“Classified Average Precision (CAP)” to evaluate the performance of inferring user search goals.

Experimental results are presented using user click-through logs from a commercial search engine to

validate the effectiveness of our proposed methods.

DM-037 Annotating Search Results from Web Databases

Abstract: An increasing number of databases have become web accessible through HTML form-based

search interfaces. The data units returned from the underlying database are usually encoded into the

result pages dynamically for human browsing. For the encoded data units to be machine processable,

which is essential for many applications such as deep web data collection and Internet comparison

shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an

automatic annotation approach that first aligns the data units on a result page into different groups

such that the data in the same group have the same semantic. Then, for each group we annotate it

from different aspects and aggregate the different annotations to predict a final annotation label for it.

An annotation wrapper for the search site is automatically constructed and can be used to annotate

new result pages from the same web database. Our experiments indicate that the proposed approach is

highly effective.

DM-038 Building a Scalable Database-Driven Reverse Dictionary

Abstract: In this paper, we describe the design and implementation of a reverse dictionary. Unlike a

traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes

a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the

input phrase. This work has significant application not only for the general public, particularly those

who work closely with words, but also in the general field of conceptual search. We present a set of

algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and

the runtime response time performance of our implementation. Our experimental results show that our

approach can provide significant improvements in performance scale without sacrificing the quality

of the result. Our experiments comparing the quality of our approach to that of currently available

reverse dictionaries show that of our approach can provide significantly higher quality over either of

the other currently available implementations.

DM-039 Discovering Temporal Change Patterns in the Presence of Taxonomies

Abstract: Frequent itemset mining is a widely exploratory technique that focuses on discovering

recurrent correlations among data. The steadfast evolution of markets and business environments

prompts the need of data mining algorithms to discover significant correlation changes in order to

reactively suit product and service provision to customer needs. Change mining, in the context of

frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets

from one time period to another. The discovery of frequent generalized itemsets, i.e., itemsets that 1)

frequently occur in the source data, and 2) provide a high-level abstraction of the mined knowledge,

issues new challenges in the analysis of itemsets that become rare, and thus are no longer extracted,

from a certain point. This paper proposes a novel kind of dynamic pattern, namely the History

GENeralized Pattern (HIGEN), that represents the evolution of an itemset in consecutive time

periods, by reporting the information about its frequent generalizations characterized by minimal

redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period.

To address HIGEN mining, it proposes HIGEN MINER, an algorithm that focuses on avoiding

itemset mining followed by postprocessing by exploiting a support-driven itemset generalization

approach. To focus the attention on the minimally redundant frequent generalizations and thus reduce

the amount of the generated patterns, the discovery of a smart subset of HIGENs, namely the

NONREDUNDANT HIGENs, is addressed as well. Experiments performed on both real and

synthetic datasets show the efficiency and the effectiveness of the proposed approach as well as its

usefulness in a real application context.

DM-040 Extending BCDM to Cope with Proposals and Evaluations of Updates

Abstract: The cooperative construction of data/knowledge bases has recently had a significant impulse

(see, e.g., Wikipedia [1]). In cases in which data/knowledge quality and reliability are crucial,

proposals of update/insertion/deletion need to be evaluated by experts. To the best of our knowledge,

no theoretical framework has been devised to model the semantics of update proposal/ evaluation in

the relational context. Since time is an intrinsic part of most domains (as well as of the

proposal/evaluation process itself), semantic approaches to temporal relational databases (specifically,

Bitemporal Conceptual Data Model (henceforth, BCDM) [2]) are the starting point of our approach.

In this paper, we propose BCDMPV

, a semantic temporal relational model that extends BCDM to deal

with multiple update/insertion/deletion proposals and with acceptances/rejections of proposals

themselves. We propose a theoretical framework, defining the new data structures, manipulation

operations and temporal relational algebra and proving some basic properties, namely that

BCDMPV

is a consistent extension of BCDM and that it is reducible to BCDM. These properties

ensure consistency with most relational temporal database frameworks, facilitating implementations.

DM-041 Facilitating Effective User Navigation through Website Structure Improvement

Abstract: Designing well-structured websites to facilitate effective user navigation has long been a

challenge. A primary reason is that the web developers' understanding of how a website should be

structured can be considerably different from that of the users. While various methods have been

proposed to relink webpages to improve navigability using user navigation data, the completely

reorganized new structure can be highly unpredictable, and the cost of disorienting users after the

changes remains unanalyzed. This paper addresses how to improve a website without introducing

substantial changes. Specifically, we propose a mathematical programming model to improve the user

navigation on a website while minimizing alterations to its current structure. Results from extensive

tests conducted on a publicly available real data set indicate that our model not only significantly

improves the user navigation with very few changes, but also can be effectively solved. We have also

tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we

define two evaluation metrics and use them to assess the performance of the improved website using

the real data set. Evaluation results confirm that the user navigation on the improved structure is

indeed greatly enhanced. More interestingly, we find that heavily disoriented users are more likely to

benefit from the improved structure than the less disoriented users.

DM-042 Information-Theoretic Outlier Detection for Large-Scale Categorical Data

Abstract: Outlier detection can usually be considered as a pre-processing step for locating, in a data

set, those objects that do not conform to well-defined notions of expected behavior. It is very

important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional

phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is

especially challenging because of the difficulty of defining a meaningful similarity measure for

categorical data. In this paper, we propose a formal definition of outliers and an optimization model

of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into

consideration. Based on this model, we define a function for the outlier factor of an object which is

solely determined by the object itself and can be updated efficiently. We propose two practical 1-

parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined

parameters for deciding whether an object is an outlier. Users need only provide the number of

outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective

and efficient than mainstream methods and can be used to deal with both large and high-dimensional

data sets where existing algorithms fail.

DM-043

Modeling and Solving Distributed Configuration Problems: A CSP-Based

Approach

Abstract: Product configuration can be defined as the task of tailoring a product according to the

specific needs of a customer. Due to the inherent complexity of this task, which for example includes

the consideration of complex constraints or the automatic completion of partial configurations,

various Artificial Intelligence techniques have been explored in the last decades to tackle such

configuration problems. Most of the existing approaches adopt a single-site, centralized approach. In

modern supply chain settings, however, the components of a customizable product may themselves be

configurable, thus requiring a multisite, distributed approach. In this paper, we analyze the challenges

of modeling and solving such distributed configuration problems and propose an approach based on

Distributed Constraint Satisfaction. In particular, we advocate the use of Generative Constraint

Satisfaction for knowledge modeling and show in an experimental evaluation that the use of generic

constraints is particularly advantageous also in the distributed problem solving phase.

DM-044 On Similarity Preserving Feature Selection

Abstract: In the literature of feature selection, different criteria have been proposed to evaluate the

goodness of features. In our investigation, we notice that a number of existing selection criteria

implicitly select features that preserve sample similarity, and can be unified under a common

framework. We further point out that any feature selection criteria covered by this framework cannot

handle redundant features, a common drawback of these criteria. Motivated by these observations, we

propose a new "Similarity Preserving Feature Selection” framework in an explicit and rigorous way.

We show, through theoretical analysis, that the proposed framework not only encompasses many

widely used feature selection criteria, but also naturally overcomes their common weakness in

handling feature redundancy. In developing this new framework, we begin with a conventional

combinatorial optimization formulation for similarity preserving feature selection, then extend it with

a sparse multiple-output regression formulation to improve its efficiency and effectiveness. A set of

three algorithms are devised to efficiently solve the proposed formulations, each of which has its own

advantages in terms of computational complexity and selection performance. As exhibited by our

extensive experimental study, the proposed framework achieves superior feature selection

performance and attractive properties.

DM-045 Protecting Sensitive Labels in Social Network Data Anonymization

Abstract: Privacy is one of the major concerns when publishing or sharing social network data for

social science research and business analysis. Recently, researchers have developed privacy models

similar to k-anonymity to prevent node reidentification through structure information. However, even

when these privacy models are enforced, an attacker may still be able to infer one's private

information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words,

the label-node relationship is not well protected by pure structure anonymization methods.

Furthermore, existing approaches, which rely on edge editing or node clustering, may significantly

alter key graph properties. In this paper, we define a k-degree-l-diversity anonymity model that

considers the protection of structural information as well as sensitive labels of individuals. We further

propose a novel anonymization methodology based on adding noise nodes. We develop a new

algorithm by adding noise nodes into the original graph with the consideration of introducing the least

distortion to graph properties. Most importantly, we provide a rigorous analysis of the theoretical

bounds on the number of noise nodes added and their impacts on an important graph property. We

conduct extensive experiments to evaluate the effectiveness of the proposed technique.

DM-046 Robust Module-Based Data Management

Abstract: The current trend for building an ontology-based data management system (DMS) is to

capitalize on efforts made to design a preexisting well-established DMS (a reference system). The

method amounts to extracting from the reference DMS a piece of schema relevant to the new

application needs-a module-, possibly personalizing it with extra constraints w.r.t. the application

under construction, and then managing a data set using the resulting schema. In this paper, we extend

the existing definitions of modules and we introduce novel properties of robustness that provide

means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and

the data of the reference DMS. We carry out our investigations in the setting of description logics

which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we

focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL

profile of OWL2 (i.e., DL-liteR): the W3C recommendation for efficiently managing large data sets.

DM-047 Sampling Online Social Networks

Abstract: As online social networking emerges, there has been increased interest to utilize the

underlying network structure as well as the available information on social peers to improve the

information needs of a user. In this paper, we focus on improving the performance of information

collection from the neighborhood of a user in a dynamic social network. We introduce sampling-

based algorithms to efficiently explore a user's social network respecting its structure and to quickly

approximate quantities of interest. We introduce and analyze variants of the basic sampling scheme

exploring correlations across our samples. Models of centralized and distributed social networks are

considered. We show that our algorithms can be utilized to rank items in the neighborhood of a user,

assuming that information for each user in the network is available. Using real and synthetic data sets,

we validate the results of our analysis and demonstrate the efficiency of our algorithms in

approximating quantities of interest. The methods we describe are general and can probably be easily

adopted in a variety of strategies aiming to efficiently collect information from a social graph.

DM-048

Supporting Flexible, Efficient, and User-Interpretable Retrieval of Similar

Time Series

Abstract: Supporting decision making in domains in which the observed phenomenon dynamics have

to be dealt with, can greatly benefit of retrieval of past cases, provided that proper representation and

retrieval techniques are implemented. In particular, when the parameters of interest take the form of

time series, dimensionality reduction and flexible retrieval have to be addresses to this end. Classical

methodological solutions proposed to cope with these issues, typically based on mathematical

transforms, are characterized by strong limitations, such as a difficult interpretation of retrieval results

for end users, reduced flexibility and interactivity, or inefficiency. In this paper, we describe a novel

framework, in which time-series features are summarized by means of Temporal Abstractions, and

then retrieved resorting to abstraction similarity. Our approach grants for interpretability of the output

results, and understandability of the (user-guided) retrieval process. In particular, multilevel

abstraction mechanisms and proper indexing techniques are provided, for flexible query issuing, and

efficient and interactive query answering. Experimental results have shown the efficiency of our

approach in a scalability test, and its superiority with respect to the use of a classical mathematical

technique in flexibility, user friendliness, and also quality of results.

DM-049

The Minimum Consistent Subset Cover Problem: A Minimization View of Data

Mining

Abstract: In this paper, we introduce and study the minimum consistent subset cover (MCSC)

problem. Given a finite ground set X and a constraint t, find the minimum number of consistent

subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes

the traditional set covering problem and has minimum clique partition (MCP), a dual problem of

graph coloring, as an instance. Many common data mining tasks in rule learning, clustering, and

pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set

(MRS) problem that minimizes model complexity of decision rules, the converse k-clustering

problem that minimizes the number of clusters, and the pattern summarization problem that

minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm

CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then

performs an example-driven specific-to-general search on a dynamically maintained bipartite

assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering

the ground set..

DM-050 Transductive Multilabel Learning via Label Set Propagation

Abstract: The problem of multilabel classification has attracted great interest in the last decade, where

each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety

of real-world applications, e.g., automatic image annotations and gene function analysis. Current

research on multilabel classification focuses on supervised settings which assume existence of large

amounts of labeled training data. However, in many applications, the labeling of multilabeled data is

extremely expensive and time consuming, while there are often abundant unlabeled data available. In

this paper, we study the problem of transductive multilabel learning and propose a novel solution,

called Trasductive Multilabel Classification (TraM), to effectively assign a set of multiple labels to

each instance. Different from supervised multilabel learning methods, we estimate the label sets of the

unlabeled instances effectively by utilizing the information from both labeled and unlabeled data. We

first formulate the transductive multilabel learning as an optimization problem of estimating label

concept compositions. Then, we derive a closed-form solution to this optimization problem and

propose an effective algorithm to assign label sets to the unlabeled instances. Empirical studies on

several real-world multilabel learning tasks demonstrate that our TraM method can effectively boost

the performance of multilabel classification by using both labeled and unlabeled data.

DM-051

A Method for Mining Infrequent Causal Associations and Its Application in

Finding Adverse Drug Reaction Signal Pairs

Abstract: In many real-world applications, it is important to mine causal relationships where an event

or event pattern causes certain outcomes with low probability. Discovering this kind of causal

relationships can help us prevent or correct negative outcomes caused by their antecedents. In this

paper, we propose an innovative data mining framework and apply it to mine potential causal

associations in electronic patient data sets where the drug-related events of interest occur infrequently.

Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a

computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the

basis of this new measure, a data mining algorithm was developed to mine the causal relationship

between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real

patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved

data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was

employed to rank the potential causal associations between each of the three selected drugs (i.e.,

enalapril, pravastatin, and rosuvastatin) and 3,954 recorded symptoms, each of which corresponded to

a potential ADR. The top 10 drug-symptom pairs for each drug were evaluated by the physicians on

our project team. The numbers of symptoms considered as likely real ADRs for enalapril, pravastatin,

and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of

our method in finding potential ADR signal pairs for further analysis (e.g., epidemiology study) and

investigation (e.g., case review) by drug safety professionals.

DM-052

A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in

Supervised Learning

Abstract: Discretization is an essential preprocessing technique used in many knowledge discovery

and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones,

by associating categorical values to intervals and thus transforming quantitative data into qualitative

data. In this manner, symbolic data mining algorithms can be applied over continuous data and the

representation of information is simplified, making it more concise and specific. The literature

provides numerous proposals of discretization and some attempts to categorize them into a taxonomy

can be found. However, in previous papers, there is a lack of consensus in the definition of the

properties and no formal categorization has been established yet, which may be confusing for

practitioners. Furthermore, only a small set of discretizers have been widely considered, while many

other methods have gone unnoticed. With the intention of alleviating these problems, this paper

provides a survey of discretization methods proposed in the literature from a theoretical and empirical

perspective. From the theoretical perspective, we develop a taxonomy based on the main properties

pointed out in previous research, unifying the notation and including all the known methods up to

date. Empirically, we conduct an experimental study in supervised classification involving the most

representative and newest discretizers, different types of classifiers, and a large number of data sets.

The results of their performances measured in terms of accuracy, number of intervals, and

inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of

discretizers are highlighted as the best performing ones.

DM-053 Clustering Uncertain Data Based on Probability Distribution Similarity

Abstract: Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts

significant challenges on both modeling similarity between uncertain objects and developing efficient

computational methods. The previous methods extend traditional partitioning clustering methods like

$(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on

geometric distances between objects. Such methods cannot handle uncertain objects that are

geometrically indistinguishable, such as products with the same mean but very different variances in

customer ratings. Surprisingly, probability distributions, which are essential characteristics of

uncertain objects, have not been considered in measuring similarity between uncertain objects. In this

paper, we systematically model uncertain objects in both continuous and discrete domains, where an

uncertain object is modeled as a continuous and discrete random variable, respectively. We use the

well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the

continuous and discrete cases, and integrate it into partitioning and density -based clustering methods

to cluster uncertain objects . Nevertheless, a naive implementation is very costly . Particularly,

computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the

problem, we estimate KL divergence in the continuous case by kernel density estimation and employ

the fast Gauss transform technique to further speed up the computation. Our extensive experiment

results verify the effectiveness, efficiency, and scalability of our approaches.

DM-054 Efficient Evaluation of SUM Queries over Probabilistic Data

Abstract: SUM queries are crucial for many applications that need to deal with uncertain data. In this

paper, we are interested in the queries, called ALL_SUM, that return all possible sum values and their

probabilities. In general, there is no efficient solution for the problem of evaluating ALL_SUM

queries. But, for many practical applications, where aggregate values are small integers or real

numbers with small precision, it is possible to develop efficient solutions. In this paper, based on a

recursive approach, we propose a new solution for those applications. We implemented our solution

and conducted an extensive experimental evaluation over synthetic and real-world data sets; the

results show its effectiveness.

DM-055 Efficient Service Skyline Computation for Composite Service Selection

Abstract: Service composition is emerging as an effective vehicle for integrating existing web services

to create value-added and personalized composite services. As web services with similar functionality

are expected to be provided by competing providers, a key challenge is to find the “best” web services

to participate in the composition. When multiple quality aspects (e.g., response time, fee, etc.) are

considered, a weighting mechanism is usually adopted by most existing approaches, which requires

users to specify their preferences as numeric values. We propose to exploit the dominance

relationship among service providers to find a set of “best” possible composite services, referred to as

a composite service skyline. We develop efficient algorithms that allow us to find the composite

service skyline from a significantly reduced searching space instead of considering all possible

service compositions. We propose a novel bottom-up computation framework that enables the skyline

algorithm to scale well with the number of services in a composition. We conduct a comprehensive

analytical and experimental study to evaluate the effectiveness, efficiency, and scalability of the

composite skyline computation approaches.

DM-056 Finding Probabilistic Prevalent Colocations in Spatially Uncertain Data Sets

Abstract: A spatial colocation pattern is a group of spatial features whose instances are frequently

located together in geographic space. Discovering colocations has many useful applications. For

example, colocated plant species discovered from plant distribution data sets can contribute to the

analysis of plant geography, phytosociology studies, and plant protection recommendations. In this

paper, we study the colocation mining problem in the context of uncertain data, as the data generated

from a wide range of data sources are inherently uncertain. One straightforward method to mine the

prevalent colocations in a spatially uncertain data set is to simply compute the expected participation

index of a candidate and decide if it exceeds a minimum prevalence threshold. Although this

definition has been widely adopted, it misses important information about the confidence which can

be associated with the participation index of a colocation. We propose another definition, probabilistic

prevalent colocations, trying to find all the colocations that are likely to be prevalent in a randomly

generated possible world. Finding probabilistic prevalent colocations (PPCs) turn out to be difficult.

First, we propose pruning strategies for candidates to reduce the amount of computation of the

probabilistic participation index values. Next, we design an improved dynamic programming

algorithm for identifying candidates. This algorithm is suitable for parallel computation, and

approximate computation. Finally, the effectiveness and efficiency of the methods proposed as well as

the pruning strategies and the optimization techniques are verified by extensive experiments with

“real + synthetic” spatially uncertain data sets.

DM-057

Fuzzy Web Data Tables Integration Guided by an Ontological and

Terminological Resource

Abstract: In this paper, we present the design of ONDINE system which allows the loading and the

querying of a data warehouse opened on the Web, guided by an Ontological and Terminological

Resource (OTR). The data warehouse, composed of data tables extracted from Web documents, has

been built to supplement existing local data sources. First, we present the main steps of our

semiautomatic method to annotate data tables driven by an OTR. The output of this method is an

XML/RDF data warehouse composed of XML documents representing data tables with their fuzzy

RDF annotations. We then present our flexible querying system which allows the local data sources

and the data warehouse to be simultaneously and uniformly queried, using the OTR. This system

relies on SPARQL and allows approximate answers to be retrieved by comparing preferences

expressed as fuzzy sets with fuzzy RDF annotations

DM-058 PMSE: A Personalized Mobile Search Engine

Abstract: We propose a personalized mobile search engine (PMSE) that captures the users'

preferences in the form of concepts by mining their clickthrough data. Due to the importance of

location information in mobile search, PMSE classifies these concepts into content concepts and

location concepts. In addition, users' locations (positioned by GPS) are used to supplement the

location concepts in PMSE. The user preferences are organized in an ontology-based, multifacet user

profile, which are used to adapt a personalized ranking function for rank adaptation of future search

results. To characterize the diversity of the concepts associated with a query and their relevances to

the user's need, four entropies are introduced to balance the weights between the content and location

facets. Based on the client-server model, we also present a detailed architecture and design for

implementation of PMSE. In our design, the client collects and stores locally the clickthrough data to

protect privacy, whereas heavy tasks such as concept extraction, training, and reranking are

performed at the PMSE server. Moreover, we address the privacy issue by restricting the information

in the user profile exposed to the PMSE server with two privacy parameters. We prototype PMSE on

the Google Android platform. Experimental results show that PMSE significantly improves the

precision comparing to the baseline.

DM-059 Range-Based Skyline Queries in Mobile Environments

Abstract: Skyline query processing for location-based services, which considers both spatial and

nonspatial attributes of the objects being queried, has recently received increasing attention. Existing

solutions focus on solving point- or line-based skyline queries, in which the query location is an exact

location point or a line segment. However, due to privacy concerns and limited precision of

localization devices, the input of a user location is often a spatial range. This paper studies a new

problem of how to process such range-based skyline queries. Two novel algorithms are proposed: one

is index-based (I-SKY) and the other is not based on any index (N-SKY). To handle frequent

movements of the objects being queried, we also propose incremental versions of I-SKY and N-SKY,

which avoid recomputing the query index and results from scratch. Additionally, we develop efficient

solutions for probabilistic and continuous range-based skyline queries. Experimental results show that

our proposed algorithms well outperform the baseline algorithm that adopts the existing line-based

skyline solution. Moreover, the incremental versions of I-SKY and N-SKY save substantial

computation cost, especially when the objects move frequently.

DM-060 Skyline Processing on Distributed Vertical Decompositions

Abstract: We assume a data set that is vertically decomposed among several servers, and a client that

wishes to compute the skyline by obtaining the minimum number of points. Existing solutions for this

problem are restricted to the case where each server maintains exactly one dimension. This paper

proposes a general solution for vertical decompositions of arbitrary dimensionality. We first

investigate some interesting problem characteristics regarding the pruning power of points. Then, we

introduce vertical partition skyline (VPS), an algorithmic framework that includes two steps. Phase 1

searches for an anchor point Panc that dominates, and hence eliminates, a large number of records.

Starting with Panc, Phase 2 constructs incrementally a pruning area using an interesting union-

intersection property of dominance regions. Servers do not transmit points that fall within the pruning

area in their local subspace. Our experiments confirm the effectiveness of the proposed methods

under various settings.

DM-061 Spatial Query Integrity with Voronoi Neighbors

With the popularity of location-based services and the abundant usage of smart phones and GPS-

enabled devices, the necessity of outsourcing spatial data has grown rapidly over the past few years.

Meanwhile, the fast arising trend of cloud storage and cloud computing services has provided a

flexible and cost-effective platform for hosting data from businesses and individuals, further enabling

many location-based applications. Nevertheless, in this database outsourcing paradigm, the

authentication of the query results at the client remains a challenging problem. In this paper, we focus

on the Outsourced Spatial Database (OSDB) model and propose an efficient scheme, called VN-Auth,

which allows a client to verify the correctness and completeness of the result set. Our approach is

based on neighborhood information derived from the Voronoi diagram of the underlying spatial data

set and can handle fundamental spatial query types, such as k nearest neighbor and range queries, as

well as more advanced query types like reverse k nearest neighbor, aggregate nearest neighbor, and

spatial skyline. We evaluated VN-Auth based on real-world data sets using mobile devices (Google

Droid smart phones with Android OS) as query clients. Compared to the current state-of-the-art

approaches (i.e., methods based on Merkle Hash Trees), our experiments show that VN-Auth

produces significantly smaller verification objects and is more computationally efficient, especially

for queries with low selectivity.

DM-062 Supporting Pattern-Preserving Anonymization for Time-Series Data

Abstract: Time series is an important form of data available in numerous applications and often

contains vast amount of personal privacy. The need to protect privacy in time-series data while

effectively supporting complex queries on them poses nontrivial challenges to the database

community. We study the anonymization of time series while trying to support complex queries, such

as range and pattern matching queries, on the published data. The conventional k-anonymity model

cannot effectively address this problem as it may suffer severe pattern loss. We propose a novel

anonymization model called (k, P)-anonymity for pattern-rich time series. This model publishes both

the attribute values and the patterns of time series in separate data forms. We demonstrate that our

model can prevent linkage attacks on the published data while effectively support a wide variety of

queries on the anonymized data. We propose two algorithms to enforce (k, P)-anonymity on time-

series data. Our anonymity model supports customized data publishing, which allows a certain part of

the values but a different part of the pattern of the anonymized time series to be published

simultaneously. We present estimation techniques to support query processing on such customized

data. The proposed methods are evaluated in a comprehensive experimental study. Our results verify

the effectiveness and efficiency of our approach.

DM-063 Synchronization-Inspired Partitioning and Hierarchical Clustering

Synchronization is a powerful and inherently hierarchical concept regulating a large variety of

complex processes ranging from the metabolism in a cell to opinion formation in a group of

individuals. Synchronization phenomena in nature have been widely investigated and models

concisely describing the dynamical synchronization process have been proposed, e.g., the well-known

Extensive Kuramoto Model. We explore the potential of the Extensive Kuramoto Model for data

clustering. We regard each data object as a phase oscillator and simulate the dynamical behavior of

the objects over time. By interaction with similar objects, the phase of an object gradually aligns with

its neighborhood, resulting in a nonlinear object movement naturally driven by the local cluster

structure. We demonstrate that our framework has several attractive benefits: 1) It is suitable to detect

clusters of arbitrary number, shape, and data distribution, even in difficult settings with noise points

and outliers. 2) Combined with the Minimum Description Length (MDL) principle, it allows

partitioning and hierarchical clustering without requiring any input parameters which are difficult to

estimate. 3) Synchronization faithfully captures the natural hierarchical cluster structure of the data

and MDL suggests meaningful levels of abstraction. Extensive experiments demonstrate the

effectiveness and efficiency of our approach.

DM-064 Transfer across Completely Different Feature Spaces via Spectral Embedding

Abstract: In many applications, it is very expensive or time consuming to obtain a lot of labeled

examples. One practically important problem is: can the labeled data from other related sources help

predict the target task, even if they have 1) different feature spaces (e.g., image versus text data), 2)

different data distributions, and 3) different output spaces? This paper proposes a solution and

discusses the conditions where this is highly likely to produce better results. It first unifies the feature

spaces of the target and source data sets by spectral embedding, even when they are with completely

different feature spaces. The principle is to devise an optimization objective that preserves the original

structure of the data, while at the same time, maximizes the similarity between the two. A linear

projection model, as well as a nonlinear approach are derived on the basis of this principle with closed

forms. Second, a judicious sample selection strategy is applied to select only those related source

examples. At last, a Bayesian-based approach is applied to model the relationship between different

output spaces. The three steps can bridge related heterogeneous sources in order to learn the target

task. Among the 20 experiment data sets, for example, the images with wavelet-transformed-based

features are used to predict another set of images whose features are constructed from color-histogram

space; documents are used to help image classification, etc. By using these extracted examples from

heterogeneous sources, the models can reduce the error rate by as much as 50 percent, compared with

the methods using only the examples from the target task.

DM-065

Tweet Analysis for Real-Time Event Detection and Earthquake Reporting

System Development

Abstract: Twitter has received much attention recently. An important characteristic of Twitter is its

real-time nature. We investigate the real-time interaction of events such as earthquakes in Twitter and

propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we

devise a classifier of tweets based on features such as the keywords in a tweet, the number of words,

and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event

that can find the center of the event location. We regard each Twitter user as a sensor and apply

particle filtering, which are widely used for location estimation. The particle filter works better than

other comparable methods for estimating the locations of target events. As an application, we develop

an earthquake reporting system for use in Japan. Because of the numerous earthquakes and the large

number of Twitter users throughout the country, we can detect an earthquake with high probability

(93 percent of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more

are detected) merely by monitoring tweets. Our system detects earthquakes promptly and notification

is delivered much faster than JMA broadcast announcements..

DM-066

TW-$(k)$-Means: Automated Two-Level Variable Weighting Clustering

Algorithm for Multiview Data

Abstract: This paper proposes TW-k-means, an automated two-level variable weighting clustering

algorithm for multiview data, which can simultaneously compute weights for views and individual

variables. In this algorithm, a view weight is assigned to each view to identify the compactness of the

view and a variable weight is also assigned to each variable in the view to identify the importance of

the variable. Both view weights and variable weights are used in the distance function to determine

the clusters of objects. In the new algorithm, two additional steps are added to the iterative k-means

clustering process to automatically compute the view weights and the variable weights. We used two

real-life data sets to investigate the properties of two types of weights in TW-k-means and

investigated the difference between the weights of TW-k-means and the weights of the individual

variable weighting method. The experiments have revealed the convergence property of the view

weights in TW-k-means. We compared TW-k-means with five clustering algorithms on three real-life

data sets and the results have shown that the TW-k-means algorithm significantly outperformed the

other five clustering algorithms in four evaluation indices.

DM-067 U-Skyline: A New Skyline Query for Uncertain Databases

Abstract: The skyline query, aiming at identifying a set of skyline tuples that are not dominated by

any other tuple, is particularly useful for multicriteria data analysis and decision making. For

uncertain databases, a probabilistic skyline query, called P-Skyline, has been developed to return

skyline tuples by specifying a probability threshold. However, the answer obtained via a P-Skyline

query usually includes skyline tuples undesirably dominating each other when a small threshold is

specified; or it may contain much fewer skyline tuples if a larger threshold is employed. To address

this concern, we propose a new uncertain skyline query, called U-Skyline query, in this paper. Instead

of setting a probabilistic threshold to qualify each skyline tuple independently, the U-Skyline query

searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as

the skyline answer. In order to answer U-Skyline queries efficiently, we propose a number of

optimization techniques for query processing, including 1) computational simplification of U-Skyline

probability, 2) pruning of unqualified candidate skylines and early termination of query processing, 3)

reduction of the input data set, and 4) partition and conquest of the reduced data set. We perform a

comprehensive performance evaluation on our algorithm and an alternative approach that formulates

the U-Skyline processing problem by integer programming. Experimental results demonstrate that our

algorithm is 10-100 times faster than using CPLEX, a parallel integer programming solver, to answer

the U-Skyline query.

DM-068

A Novel Profit Maximizing Metric for Measuring Classification Performance of

Customer Churn Prediction Models

Abstract: The interest for data mining techniques has increased tremendously during the past decades,

and numerous classification techniques have been applied in a wide range of business applications.

Hence, the need for adequate performance measures has become more important than ever. In this

paper, a cost-benefit analysis framework is formalized in order to define performance measures which

are aligned with the main objectives of the end users, i.e., profit maximization. A new performance

measure is defined, the expected maximum profit criterion. This general framework is then applied to

the customer churn problem with its particular cost-benefit structure. The advantage of this approach

is that it assists companies with selecting the classifier which maximizes the profit. Moreover, it aids

with the practical implementation in the sense that it provides guidance about the fraction of the

customer base to be included in the retention campaign.

DM-069

A Predictive-Reactive Method for Improving the Robustness of Real-Time Data

Services

Abstract: Supporting timely data services using fresh data in data-intensive real-time applications,

such as e-commerce and transportation management is desirable but challenging, since the workload

may vary dynamically. To control the data service delay to be below the specified threshold, we

develop a predictive as well as reactive method for database admission control. The predictive method

derives the workload bound for admission control in a predictive manner, making no statistical or

queuing-theoretic assumptions about workloads. Also, our reactive scheme based on formal feedback

control theory continuously adjusts the database load bound to support the delay threshold. By

adapting the load bound in a proactive fashion, we attempt to avoid severe overload conditions and

excessive delays before they occur. Also, the feedback control scheme enhances the timeliness by

compensating for potential prediction errors due to dynamic workloads. Hence, the predictive and

reactive methods complement each other, enhancing the robustness of real-time data services as a

whole. We implement the integrated approach and several baselines in an open-source database.

Compared to the tested open-loop, feedback-only, and statistical prediction + feedback baselines

representing the state of the art, our integrated method significantly improves the average/transient

delay and real-time data service throughput.

DM-070 Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates

Abstract: We may want to keep sensitive information in a relational database hidden from a user or

group thereof. We characterize sensitive data as the extensions of secrecy views. The database, before

returning the answers to a query posed by a restricted user, is updated to make the secrecy views

empty or a single tuple with null values. Then, a query about any of those views returns no

meaningful information. Since the database is not supposed to be physically changed for this purpose,

the updates are only virtual, and also minimal. Minimality makes sure that query answers, while being

privacy preserving, are also maximally informative. The virtual updates are based on null values as

used in the SQL standard. We provide the semantics of secrecy views, virtual updates, and secret

answers (SAs) to queries. The different instances resulting from the virtually updates are specified as

the models of a logic program with stable model semantics, which becomes the basis for computation

of the SAs.

DM -071 Co-Occurrence-Based Diffusion for Expert Search on the Web

Abstract: Expert search has been studied in different contexts, e.g., enterprises, academic communities.

We examine a general expert search problem: searching experts on the web, where millions of

webpages and thousands of names are considered. It has mainly two challenging issues: 1) webpages

could be of varying quality and full of noises; 2) The expertise evidences scattered in webpages are

usually vague and ambiguous. We propose to leverage the large amount of co-occurrence information

to assess relevance and reputation of a person name for a query topic. The co-occurrence structure is

modeled using a hypergraph, on which a heat diffusion based ranking algorithm is proposed. Query

keywords are regarded as heat sources, and a person name which has strong connection with the

query (i.e., frequently co-occur with query keywords and co-occur with other names related to query

keywords) will receive most of the heat, thus being ranked high. Experiments on the ClueWeb09 web

collection show that our algorithm is effective for retrieving experts and outperforms baseline

algorithms significantly. This work would be regarded as one step toward addressing the more general

entity search problem without sophisticated NLP techniques.

DM -072

Efficient All Top-$(k)$ Computation—A Unified Solution for All Top-

$(k)$, Reverse Top-$(k)$ and Top-$(m)$ Influential Queries

Abstract: Given a set of objects $(P)$ and a set of ranking functions $(F)$ over $(P)$, an interesting

problem is to compute the top ranked objects for all functions. Evaluation of multiple top-$(k)$

queries finds application in systems, where there is a heavy workload of ranking queries (e.g., online

search engines and product recommendation systems). The simple solution of evaluating the top-

$(k)$ queries one-by-one does not scale well; instead, the system can make use of the fact that similar

queries share common results to accelerate search. This paper is the first, to our knowledge, thorough

study of this problem. We propose methods that compute all top-$(k)$ queries in batch. Our first

solution applies the block indexed nested loops paradigm, while our second technique is a view-based

algorithm. We propose appropriate optimization techniques for the two approaches and demonstrate

experimentally that the second approach is consistently the best. Our approach facilitates evaluation

of other complex queries that depend on the computation of multiple top-$(k)$ queries, such as

reverse top-$(k)$ and top-$(m)$ influential queries. We show that our batch processing technique for

these complex queries outperform the state-of-the-art by orders of magnitude.

DM - 073 Efficient and Effective Duplicate Detection in Hierarchical Data

Abstract: Although there is a long line of work on identifying duplicates in relational data, only a few

solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this

paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a

Bayesian network to determine the probability of two XML elements being duplicates, considering

not only the information within the elements, but also the way that information is structured. In

addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of

significant gains over the unoptimized version of the algorithm, is presented. Through experiments,

we show that our algorithm is able to achieve high precision and recall scores in several data sets.

XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms

of efficiency and of effectiveness.

DM -074 Failure-Aware Cascaded Suppression in Wireless Sensor Networks

Abstract: Wireless sensor networks are widely used to continuously collect data from the

environment. Because of energy constraints on battery-powered nodes, it is critical to minimize

communication. Suppression has been proposed as a way to reduce communication by using

predictive models to suppress reporting of predictable data. However, in the presence of

communication failures, missing data are difficult to interpret because these could have been either

suppressed or lost in transmission. There is no existing solution for handling failures for general,

spatiotemporal suppression that uses cascading. While cascading further reduces communication, it

makes failure handling difficult, because nodes can act on incomplete or incorrect information and in

turn affect other nodes. We propose a cascaded suppression framework that exploits both temporal

and spatial data correlation to reduce communication, and applies coding theory and Bayesian

inference to recover missing data resulted from suppression and communication failures. Experiment

results show that cascaded suppression significantly reduces communication cost and improves

missing data recovery compared to existing approaches..

DM -075 Multiview Partitioning via Tensor Methods

Abstract: Clustering by integrating multiview representations has become a crucial issue for

knowledge discovery in heterogeneous environments. However, most prior approaches assume that

the multiple representations share the same dimension, limiting their applicability to homogeneous

environments. In this paper, we present a novel tensor-based framework for integrating heterogeneous

multiview data in the context of spectral clustering. Our framework includes two novel formulations;

that is multiview clustering based on the integration of the Frobenius-norm objective function (MC-

FR-OI) and that based on matrix integration in the Frobenius-norm objective function (MC-FR-MI).

We show that the solutions for both formulations can be computed by tensor decompositions. We

evaluated our methods on synthetic data and two real-world data sets in comparison with baseline

methods. Experimental results demonstrate that the proposed formulations are effective in integrating

multiview data in heterogeneous environments.

DM -076 Novel Biobjective Clustering (BiGC) Based on Cooperative Game Theory

Abstract: We propose a new approach to clustering. Our idea is to map cluster formation to coalition

formation in cooperative games, and to use the Shapley value of the patterns to identify clusters and

cluster representatives. We show that the underlying game is convex and this leads to an efficient

biobjective clustering algorithm that we call BiGC. The algorithm yields high-quality clustering with

respect to average point-to-center distance (potential) as well as average intracluster point-to-point

distance (scatter). We demonstrate the superiority of BiGC over state-of-the-art clustering algorithms

(including the center based and the multiobjective techniques) through a detailed experimentation

using standard cluster validity criteria on several benchmark data sets. We also show that BiGC

satisfies key clustering properties such as order independence, scale invariance, and richness.

DM -077

On Generalizable Low False-Positive Learning Using Asymmetric Support

Vector Machines

Abstract: The Support Vector Machines (SVMs) have been widely used for classification due to its

ability to give low generalization error. In many practical applications of classification, however, the

wrong prediction of a certain class is much severer than that of the other classes, making the original

SVM unsatisfactory. In this paper, we propose the notion of Asymmetric Support Vector Machine

(ASVM), an asymmetric extension of the SVM, for these applications. Different from the existing

SVM extensions such as thresholding and parameter tuning, ASVM employs a new objective that

models the imbalance between the costs of false predictions from different classes in a novel way

such that user tolerance on false-positive rate can be explicitly specified. Such a new objective

formulation allows us of obtaining a lower false-positive rate without much degradation of the

prediction accuracy or increase in training time. Furthermore, we show that the generalization ability

is preserved with the new objective. We also study the effects of the parameters in ASVM objective

and address some implementation issues related to the Sequential Minimal Optimization (SMO) to

cope with large-scale data. An extensive simulation is conducted and shows that ASVM is able to

yield either noticeable improvement in performance or reduction in training time as compared to the

previous arts.

DM -078 Optimal Route Queries with Arbitrary Order Constraints

Abstract: Given a set of spatial points $(DS)$, each of which is associated with categorical

information, e.g., restaurant, pub, etc., the optimal route query finds the shortest path that starts from

the query point (e.g., a home or hotel), and covers a user-specified set of categories (e.g., {pub,

restaurant, museum}). The user may also specify partial order constraints between different

categories, e.g., a restaurant must be visited before a pub. Previous work has focused on a special case

where the query contains the total order of all categories to be visited (e.g., museum $(rightarrow)$

restaurant $(rightarrow)$ pub). For the general scenario without such a total order, the only known

solution reduces the problem to multiple, total-order optimal route queries. As we show in this paper,

this naïve approach incurs a significant amount of repeated computations, and, thus, is not

scalable to large data sets. Motivated by this, we propose novel solutions to the general optimal route

query, based on two different methodologies, namely backward search and forward search. In

addition, we discuss how the proposed methods can be adapted to answer a variant of the optimal

route queries, in which the route only needs to cover a subset of the given categories. Extensive

experiments, using both real and synthetic data sets, confirm that the proposed solutions are efficient

and practical, and outperform existing methods by large margins.

DM -079 Pay-As-You-Go Entity Resolution

Abstract: Entity resolution (ER) is the problem of identifying which records in a database refer to the

same entity. In practice, many applications need to resolve large data sets efficiently, but do not

require the ER result to be exact. For example, people data from the web may simply be too large to

completely resolve with a reasonable amount of work. As another example, real-time applications

may not be able to tolerate any ER processing that takes longer than a certain amount of time. This

paper investigates how we can maximize the progress of ER with a limited amount of work using

“hints,” which give information on records that are likely to refer to the same real-

world entity. A hint can be represented in various formats (e.g., a grouping of records based on their

likelihood of matching), and ER can use this information as a guideline for which records to compare

first. We introduce a family of techniques for constructing hints efficiently and techniques for using

the hints to maximize the number of matching records identified using a limited amount of work.

Using real data sets, we illustrate the potential gains of our pay-as-you-go approach compared to

running ER without using hints..

DM -080

Single-Database Private Information Retrieval from Fully Homomorphic

Encryption

Abstract: Private Information Retrieval (PIR) allows a user to retrieve the $(i)$th bit of an $(n)$-bit

database without revealing to the database server the value of $(i)$. In this paper, we present a PIR

protocol with the communication complexity of $(O(gamma log n))$ bits, where $(gamma)$ is the

ciphertext size. Furthermore, we extend the PIR protocol to a private block retrieval (PBR) protocol, a

natural and more practical extension of PIR in which the user retrieves a block of bits, instead of

retrieving single bit. Our protocols are built on the state-of-the-art fully homomorphic encryption

(FHE) techniques and provide privacy for the user if the underlying FHE scheme is semantically

secure. The total communication complexity of our PBR is $(O(gamma log m+gamma n/m))$ bits,

where $(m)$ is the number of blocks. The total computation complexity of our PBR is $(O(mlog m))$

modular multiplications plus $(O(n/2))$ modular additions. In terms of total protocol execution time,

our PBR protocol is more efficient than existing PBR protocols which usually require to compute

$(O(n/2))$ modular multiplications when the size of a block in the database is large and a high-speed

network is available.

DM -081

Toward SWSs Discovery: Mapping from WSDL to OWL-S Based on Ontology

Search and Standardization Engine

Abstract: Semantic Web Services (SWSs) represent the most recent and revolutionary technology

developed for machine-to-machine interaction on the web 3.0. As for the conventional web services,

the problem of discovering and selecting the most suitable web service represents a challenge for

SWSs to be widely used. In this paper, we propose a mapping algorithm that facilitates the

redefinition of the conventional web services annotations (i.e., WSDL) using semantic annotations

(i.e., OWL-S). This algorithm will be a part of a new discovery mechanism that relies on the semantic

annotations of the web services to perform its task. The “local ontology repository”

and “ontology search and standardization engine” are the backbone of this

algorithm. Both of them target to define any data type in the system using a standard ontology-based

concept. The originality of the proposed mapping algorithm is its applicability and consideration of

the standardization problem. The proposed algorithm is implemented and its components are

validated using some test collections and real examples. An experimental test of the proposed

techniques is reported, showing the impact of the proposed algorithm in decreasing the time and the

effort of the mapping process. Moreover, the experimental results promises that the proposed

algorithm will have a positive impact on the discovery process as a whole..

DM -082

Trace Ratio Optimization-Based Semi-Supervised Nonlinear Dimensionality

Reduction for Marginal Manifold Visualization

Abstract: Visualizing similarity data of different objects by exhibiting more separate organizations

with local and multimodal characteristics preserved is important in multivariate data analysis.

Laplacian Eigenmaps (LAE) and Locally Linear Embedding (LLE) aim at preserving the embeddings

of all similarity pairs in the close vicinity of the reduced output space, but they are unable to identify

and separate interclass neighbors. This paper considers the semi-supervised manifold learning

problems. We apply the pairwise Cannot-Link and Must-Link constraints induced by the

neighborhood graph to specify the types of neighboring pairs. More flexible regulation on supervised

information is provided. Two novel multimodal nonlinear techniques, which we call trace ratio (TR)

criterion-based semi-supervised LAE ($({rm S}^2{rm LAE})$) and LLE ($({rm S}^2{rm LLE})$),

are then proposed for marginal manifold visualization. We also present the kernelized $({rm S}^2{rm

LAE})$ and $({rm S}^2{rm LLE})$. We verify the feasibility of $({rm S}^2{rm LAE})$ and $({rm

S}^2{rm LLE})$ through extensive simulations over benchmark real-world MIT CBCL, CMU PIE,

MNIST, and USPS data sets. Manifold visualizations show that $({rm S}^2{rm LAE})$ and $({rm

S}^2{rm LLE})$ are able to deliver large margins between different clusters or classes with

multimodal distributions preserved. Clustering evaluations show they can achieve comparable to or

even better results than some widely used methods.

DM -083 Update Summarization via Graph-Based Sentence Ranking

Abstract: Due to the fast evolution of the information on the Internet, update summarization has

received much attention in recent years. It is to summarize an evolutionary document collection at

current time supposing the users have read some related previous documents. In this paper, we

propose a graph-ranking-based method. It performs constrained reinforcements on a sentence graph,

which unifies previous and current documents, to determine the salience of the sentences. The

constraints ensure that the most salient sentences in current documents are updates to previous

documents. Since this method is NP-hard, we then propose its approximate method, which is

polynomial time solvable. Experiments on the TAC 2008 and 2009 benchmark data sets show the

effectiveness and efficiency of our method.

DM -084 Change Detection in Streaming Multivariate Data Using Likelihood Detectors

Abstract: Change detection in streaming data relies on a fast estimation of the probability that the data

in two consecutive windows come from different distributions. Choosing the criterion is one of the

multitude of questions that need to be addressed when designing a change detection procedure. This

paper gives a log-likelihood justification for two well-known criteria for detecting change in

streaming multidimensional data: Kullback-Leibler (K-L) distance and Hotelling's T-square test for

equal means (H). We propose a semiparametric log-likelihood criterion (SPLL) for change detection.

Compared to the existing log-likelihood change detectors, SPLL trades some theoretical rigor for

computation simplicity. We examine SPLL together with K-L and H on detecting induced change on

30 real data sets. The criteria were compared using the area under the respective Receiver Operating

Characteristic (ROC) curve (AUC). SPLL was found to be on the par with H and better than K-L for

the nonnormalized data, and better than both on the normalized data.

DM -085 Coping with Events in Temporal Relational Databases

Abstract: Event relations are used in many temporal relational database approaches to represent facts

occurring at time instants. However, to the best of our knowledge, none of such approaches fully

copes with the definition of events as provided, e.g., by the “consensus” temporal

database glossary. We propose a new approach which overcomes such a limitation, allowing one to

cope with multiple events occurring in the same temporal granule. This move involves major

extensions to current approaches, since indeterminacy about the time and number of occurrences of

events need to be faced. Specifically, we have introduced a new data model, and new definitions of

relational algebraic operators coping with the above issues, and we have studied their reducibility.

Last, but not least, we have shown that our approach can be easily extended in order to cope with a

general form of temporal indeterminacy. Such an extension further increases the applicability of our

approach.

DM -086 Cutting Plane Training for Linear Support Vector Machines

Abstract: Support Vector Machines (SVMs) have been shown to achieve high performance on

classification tasks across many domains, and a great deal of work has been dedicated to developing

computationally efficient training algorithms for linear SVMs. One approach [1] approximately

minimizes risk through use of cutting planes, and is improved by [2], [3]. We build upon this work,

presenting a modification to the algorithm developed by Franc and Sonnenburg [2]. We demonstrate

empirically that our changes can reduce cutting plane training time by up to 40 percent, and discuss

how changes in data sets and parameter settings affect the effectiveness of our method.

DM-087 Successive Group Selection for Microaggregation

Abstract: In this paper, we propose an efficient clustering algorithm that has been applied to the

microaggregation problem. The goal is to partition $(N)$ given records into clusters, each of them

grouping at least $(K)$ records, so that the sum of the within-partition squared error (SSE) is

minimized. We propose a successive Group Selection algorithm that approximately solves the

microaggregation problem in $(O(N^2 log N))$ time, based on sequential Minimization of SSE.

Experimental results and comparisons to existing methods with similar computation cost on real and

synthetic data sets demonstrate the high performance and robustness of the proposed scheme.

DM-088 Modeling and Computing Ternary Projective Relations between Regions

Abstract: We report a corrected version of the algorithms to compute ternary projective relations

between regions appeared in E. Clementini and R. Billen, "Modeling and computing ternary

projective relations between regions," IEEE Transactions on Knowledge and Data Engineering, vol.

18, pp. 799-814, 2006.

DM-089

A Survival Modeling Approach to Biomedical Search Result Diversification

Using Wikipedia

Abstract: In this paper, we propose a survival modeling approach to promoting ranking diversity for

biomedical information retrieval. The proposed approach concerns with finding relevant documents

that can deliver more different aspects of a query. First, two probabilistic models derived from the

survival analysis theory are proposed for measuring aspect novelty. Second, a new method using

Wikipedia to detect aspects covered by retrieved documents is presented. Third, an aspect filter based

on a two-stage model is introduced. It ranks the detected aspects in decreasing order of the probability

that an aspect is generated by the query. Finally, the relevance and the novelty of retrieved documents

are combined at the aspect level for reranking. Experiments conducted on the TREC 2006 and 2007

Genomics collections demonstrate the effectiveness of the proposed approach in promoting ranking

diversity for biomedical information retrieval. Moreover, we further evaluate our approach in the Web

retrieval environment. The evaluation results on the ClueWeb09-T09B collection show that our

approach can achieve promising performance improvements.

DM-090 Centroid-Based Actionable 3D Subspace Clustering

Abstract: Demand-side management, together with the integration of distributed energy generation and storage,

are considered increasingly essential elements for implementing the smart grid concept and balancing massive

energy production from renewable sources. We focus on a smart grid in which the demand-side comprises

traditional users as well as users owning some kind of distributed energy sources and/or energy storage

devices. By means of a day-ahead optimization process regulated by an independent central unit, the latter

users intend to reduce their monetary energy expense by producing or storing energy rather than just

purchasing their energy needs from the grid. In this paper, we formulate the resulting grid optimization

problem as a noncooperative game and analyze the existence of optimal strategies. Furthermore, we present a

distributed algorithm to be run on the users' smart meters, which provides the optimal production and/or

storage strategies, while preserving the privacy of the users and minimizing the required signaling with the

central unit. Finally, the proposed day-ahead optimization is tested in a realistic situation.

DM-091 Constrained Text Coclustering with Supervised and Unsupervised Constraints

Abstract: In this paper, we propose a novel constrained coclustering method to achieve two goals.

First, we combine information-theoretic coclustering and constrained clustering to improve clustering

performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the

effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing

knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve

our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both

document and word constraints. We then use an alternating expectation maximization (EM) algorithm

to optimize the model. We also propose two novel methods to automatically construct and incorporate

document and word constraints to support unsupervised constrained clustering: 1) automatically

construct document constraints based on overlapping named entities (NE) extracted by an NE

extractor; 2) automatically construct word constraints based on their semantic distance inferred from

WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of

our approaches against a number of existing approaches.

ETPL GC-092 Crowdsourced Trace Similarity with Smartphones

Smartphones are nowadays equipped with a number of sensors, such as WiFi, GPS, accelerometers,

etc. This capability allows smartphone users to easily engage in crowdsourced computing services,

which contribute to the solution of complex problems in a distributed manner. In this work, we

leverage such a computing paradigm to solve efficiently the following problem: comparing a query

trace $(Q)$ against a crowd of traces generated and stored on distributed smartphones. Our proposed

framework, coined $({rm SmartTrace}^+)$, provides an effective solution without disclosing any part

of the crowd traces to the query processor. $({rm SmartTrace}^+)$, relies on an in-situ data storage

model and intelligent top-K query processing algorithms that exploit distributed trajectory similarity

measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to

$(Q)$. We evaluate our algorithms on both synthetic and real workloads. We describe our prototype

system developed on the Android OS. The solution is deployed over our own SmartLab testbed of 25

smartphones. Our study reveals that computations over $({rm SmartTrace}^+)$ result in substantial

energy conservation; in addition, results can be computed faster than competitive approaches.

DM-093 Customized Policies for Handling Partial Information in Relational Databases

Abstract: Most real-world databases have at least some missing data. Today, users of such databases

are “on their own” in terms of how they manage this incompleteness. In this paper,

we propose the general concept of partial information policy (PIP) operator to handle incompleteness

in relational databases. PIP operators build upon preference frameworks for incomplete information,

but accommodate different types of incomplete data (e.g., a value exists but is not known; a value

does not exist; a value may or may not exist). Different users in the real world have different ways in

which they want to handle incompleteness—PIP operators allow them to specify a policy that

matches their attitude to risk and their knowledge of the application and how the data was collected.

We propose index structures for efficiently evaluating PIP operators and experimentally assess their

effectiveness on a real-world airline data set. We also study how relational algebra operators and PIP

operators interact with one another.

DM-094 Decision Trees for Mining Data Streams Based on the McDiarmid's Bound

Abstract: In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the

Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting

attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic

measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's

inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the

McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in

Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a

decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the

property that its output is nearly identical to that of a conventional learner. The results of the paper

have a great impact on the state of the art of mining data streams and various developed so far

methods and algorithms should be reconsidered..

DM-095 Discovering Characterizations of the Behavior of Anomalous Subpopulations

Abstract: We consider the problem of discovering attributes, or properties, accounting for the a priori

stated abnormality of a group of anomalous individuals (the outliers) with respect to an overall given

population (the inliers). To this aim, we introduce the notion of exceptional property and define the

concept of exceptionality score, which measures the significance of a property. In particular, in order

to single out exceptional properties, we resort to a form of minimum distance estimation for

evaluating the badness of fit of the values assumed by the outliers compared to the probability

distribution associated with the values assumed by the inliers. Suitable exceptionality scores are

introduced for both numeric and categorical attributes. These scores are, both from the analytical and

the empirical point of view, designed to be effective for small samples, as it is the case for outliers.

We present an algorithm, called $({rm EXPREX})$, for efficiently discovering exceptional

properties. The algorithm is able to reduce the needed computational effort by not exploring many

irrelevant numerical intervals and by exploiting suitable pruning rules. The experimental results

confirm that our technique is able to provide knowledge characterizing outliers in a natural manner.

DM-096 FoCUS: Learning to Crawl Web Forums

Abstract: In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-

scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with

minimal overhead. Forum threads contain information content that is the target of forum crawlers.

Although forums have different layouts or styles and are powered by different forum software

packages, they always have similar implicit navigation paths connected by specific URL types to lead

users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling

problem to a URL-type recognition problem. And we show how to learn accurate and effective

regular expression patterns of implicit navigation paths from automatically created training sets using

aggregated results from weak page type classifiers. Robust page type classifiers can be trained from

as few as five annotated forums and applied to a large set of unseen forums. Our test results show that

FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums

powered by over 150 different forum software packages. In addition, the results of applying FoCUS

on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept

of implicit navigation path could apply to other social media sites.

DM-097

Improving Word Similarity by Augmenting PMI with Estimates of Word

Polysemy

Abstract: Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a

clear explanation of how it works. We explore how PMI differs from distributional similarity, and we

introduce a novel metric, $({rm PMI}_{max})$, that augments PMI with information about a word's

number of senses. The coefficients of $({rm PMI}_{max})$ are determined empirically by

maximizing a utility function based on the performance of automatic thesaurus generation. We show

that it outperforms traditional PMI in the application of automatic thesaurus generation and in two

word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. $({rm

PMI}_{max})$ achieves a correlation coefficient comparable to the best knowledge-based

approaches on the Miller-Charles similarity rating data set.

DM-098 Incentive Compatible Privacy-Preserving Data Analysis

Abstract: In many cases, competing parties who have private data may collaboratively conduct

privacy-preserving distributed data analysis (PPDA) tasks to learn beneficial data models or analysis

results. Most often, the competing parties have different incentives. Although certain PPDA

techniques guarantee that nothing other than the final analysis result is revealed, it is impossible to

verify whether participating parties are truthful about their private input data. Unless proper

incentives are set, current PPDA techniques cannot prevent participating parties from modifying their

private inputs. This raises the question of how to design incentive compatible privacy-preserving data

analysis techniques that motivate participating parties to provide truthful inputs. In this paper, we first

develop key theorems, then base on these theorems, we analyze certain important privacy-preserving

data analysis tasks that could be conducted in a way that telling the truth is the best choice for any

participating party.

DM-099 Nonnegative Matrix Factorization: A Comprehensive Review

Abstract: Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality

reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint

and thus obtains the parts-based representation as well as enhancing the interpretability of the issue

correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last

5 years, where the principles, basic models, properties, and algorithms of NMF along with its various

modifications, extensions, and generalizations are summarized systematically. The existing NMF

algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF),

Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles,

characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed

comprehensively. Some related work not on NMF that NMF should learn from or has connections

with is involved too. Moreover, some open issues remained to be solved are discussed. Several

relevant application areas of NMF are also briefly described. This survey aims to construct an

integrated, state-of-the-art framework for NMF concept, from which the follow-up research may

benefit.

DM-100 On Identifying Critical Nuggets of Information during Classification Tasks

Abstract: In large databases, there may exist critical nuggets;small collections of records or instances

that contain domain-specific important information. This information can be used for future decision

making such as labeling of critical, unlabeled data records and improving classification results by

reducing false positive and false negative errors. This work introduces the idea of critical nuggets,

proposes an innovative domain-independent method to measure criticality, suggests a heuristic to

reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from

some real-world data sets. It seems that only a few subsets may qualify to be critical nuggets,

underlying the importance of finding them. The proposed methodology can detect them. This work

also identifies certain properties of critical nuggets and provides experimental validation of the

properties. Experimental results also helped validate that critical nuggets can assist in improving

classification accuracies in real-world data sets.

DM-101

Radio Database Compression for Accurate Energy-Efficient Localization in

Fingerprinting Systems

Abstract: Location fingerprinting is a positioning method that exploits the already existing

infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy

efficient networks and the emergence of issues like green networking, we propose a clustering

technique to compress the radio database in the context of cellular fingerprinting systems. The aim of

the proposed technique is to reduce the computation cost and transmission load in the mobile-based

implementations. The presented method may be called Block-based Weighted Clustering (BWC)

technique, which is applied in a concatenated location-radio signal space, and attributes different

weight factors to the location and radio components. Computer simulations and real experiments have

been conducted to evaluate the performance of our proposed technique in the context of a GSM

network. The obtained results confirm the efficiency of the BWC technique, and show that it

improves the performance of standard k-means and hierarchical clustering methods.

DM-102

Semi-Supervised Nonlinear Hashing Using Bootstrap Sequential Projection

Learning

Abstract: In this paper, we study the effective semi-supervised hashing method under the framework of

regularized learning-based hashing. A nonlinear hash function is introduced to capture the underlying

relationship among data points. Thus, the dimensionality of the matrix for computation is not only

independent from the dimensionality of the original data space but also much smaller than the one

using linear hash function. To effectively deal with the error accumulated during converting the real-

value embeddings into the binary code after relaxation, we propose a semi-supervised nonlinear

hashing algorithm using bootstrap sequential projection learning which effectively corrects the errors

by taking into account of all the previous learned bits holistically without incurring the extra

computational overhead. Experimental results on the six benchmark data sets demonstrate that the

presented method outperforms the state-of-the-art hashing algorithms at a large margin.

DM-103 Spatial Approximate String Search

Abstract: This work deals with the approximate string search in large spatial databases. Specifically,

we investigate range queries augmented with a string similarity search predicate in both euclidean

space and road networks. We dub this query the spatial approximate string (Sas) query. In euclidean

space, we propose an approximate solution, the MhR-tree, which embeds min-wise signatures into an

R-tree. The min-wise signature for an index node $(u)$ keeps a concise representation of the union of

$(q)$-grams from strings under the subtree of $(u)$. We analyze the pruning functionality of such

signatures based on the set resemblance between the query string and the $(q)$-grams from the

subtrees of index nodes. We also discuss how to estimate the selectivity of a Sas query in euclidean

space, for which we present a novel adaptive algorithm to find balanced partitions using both the

spatial and string information stored in the tree. For queries on road networks, we propose a novel

exact method, RsasSol, which significantly outperforms the baseline algorithm in practice. The

RsasSol combines the $(q)$-gram-based inverted lists and the reference nodes based pruning.

Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our

approaches.

DM-104 SVStream: A Support Vector-Based Algorithm for Clustering Data Streams

Abstract: In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which

is based on support vector domain description and support vector clustering. In the proposed

algorithm, the data elements of a stream are mapped into a kernel space, and the support vectors are

used as the summary information of the historical elements to construct cluster boundaries of arbitrary

shape. To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained,

each describing the corresponding data domain presented in the data stream. By allowing for bounded

support vectors (BSVs), the proposed SVStream algorithm is capable of identifying overlapping

clusters. A BSV decaying mechanism is designed to automatically detect and remove outliers (noise).

We perform experiments over synthetic and real data streams, with the overlapping, evolving, and

noise situations taken into consideration. Comparison results with state-of-the-art data stream

clustering methods demonstrate the effectiveness and efficiency of the proposed method.

DM-105 The Move-Split-Merge Metric for Time Series

Abstract: A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric

uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied

in sequence to transform any time series into any other time series. A Move operation changes the

value of a single element, a Split operation converts a single element into two consecutive elements,

and a Merge operation merges two consecutive elements into one. Each operation has an associated

cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence

of operations that transforms the first time series into the second one. An efficient, quadratic-time

algorithm is provided for computing the MSM distance. MSM has the desirable properties of being

metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of

origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments

with public time series data sets demonstrate that MSM is a meaningful distance measure, that

oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP.

DM-106 A User-Friendly Patent Search Paradigm

Abstract: As an important operation for finding existing relevant patents and validating a new patent

application, patent search has attracted considerable attention recently. However, many users have

limited knowledge about the underlying patents, and they have to use a try-and-see approach to

repeatedly issue different queries and check answers, which is a very tedious process. To address this

problem, in this paper, we propose a new user-friendly patent search paradigm, which can help users

find relevant patents more easily and improve user search experience. We propose three effective

techniques, error correction, topic-based query suggestion, and query expansion, to improve the

usability of patent search. We also study how to efficiently find relevant answers from a large

collection of patents. We first partition patents into small partitions based to their topics and classes.

Then, given a query, we find highly relevant partitions and answer the query in each of such highly

relevant partitions. Finally, we combine the answers of each partition and generate top-$(k)$ answers

of the patent-search query.

DM-107

A Methodology for Direct and Indirect Discrimination Prevention in Data

Mining

Abstract: Data mining is an increasingly important technology for extracting useful knowledge hidden

in large collections of data. There are, however, negative social perceptions about data mining, among

which potential privacy invasion and potential discrimination. The latter consists of unfairly treating

people on the basis of their belonging to a specific group. Automated data collection and data mining

techniques such as classification rule mining have paved the way to making automated decisions, like

loan granting/denial, insurance premium computation, etc. If the training data sets are biased in what

regards discriminatory (sensitive) attributes like gender, race, religion, etc., discriminatory decisions

may ensue. For this reason, anti-discrimination techniques including discrimination discovery and

prevention have been introduced in data mining. Discrimination can be either direct or indirect. Direct

discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination

occurs when decisions are made based on nonsensitive attributes which are strongly correlated with

biased sensitive ones. In this paper, we tackle discrimination prevention in data mining and propose

new techniques applicable for direct or indirect discrimination prevention individually or both at the

same time. We discuss how to clean training data sets and outsourced data sets in such a way that

direct and/or indirect discriminatory decision rules are converted to legitimate (nondiscriminatory)

classification rules. We also propose new metrics to evaluate the utility of the proposed approaches

and we compare these approaches. The experimental evaluations demonstrate that the proposed

techniques are effective at removing direct and/or indirect discrimination biases in the original data

set while preserving data quality.

DM-108 Anomaly Detection via Online Oversampling Principal Component Analysis

Abstract: Anomaly detection has been an important research topic in data mining and machine

learning. Many real-world applications such as intrusion or credit card fraud detection require an

effective and efficient framework to identify deviated data instances. However, most anomaly

detection methods are typically implemented in batch mode, and thus cannot be easily extended to

large-scale problems without sacrificing computation and memory requirements. In this paper, we

propose an online oversampling principal component analysis (osPCA) algorithm to address this

problem, and we aim at detecting the presence of outliers from a large amount of data via an online

updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not

store the entire data matrix or covariance matrix, and thus our approach is especially of interest in

online or large-scale problems. By oversampling the target instance and extracting the principal

direction of the data, the proposed osPCA allows us to determine the anomaly of the target instance

according to the variation of the resulting dominant eigenvector. Since our osPCA need not perform

eigen analysis explicitly, the proposed framework is favored for online applications which have

computation or memory limitations. Compared with the well-known power method for PCA and

other popular anomaly detection algorithms, our experimental results verify the feasibility of our

proposed method in terms of both accuracy and efficiency.

DM-109

CDAMA: Concealed Data Aggregation Scheme for Multiple Applications in

Wireless Sensor Networks

Abstract: For wireless sensor networks, data aggregation scheme that reduces a large amount of

transmission is the most practical technique. In previous studies, homomorphic encryptions have been

applied to conceal communication during aggregation such that enciphered data can be aggregated

algebraically without decryption. Since aggregators collect data without decryption, adversaries are

not able to forge aggregated results by compromising them. However, these schemes are not satisfy

multi-application environments. Second, these schemes become insecure in case some sensor nodes

are compromised. Third, these schemes do not provide secure counting; thus, they may suffer

unauthorized aggregation attacks. Therefore, we propose a new concealed data aggregation scheme

extended from Boneh et al.'s homomorphic public encryption system. The proposed scheme has three

contributions. First, it is designed for a multi-application environment. The base station extracts

application-specific data from aggregated ciphertexts. Next, it mitigates the impact of compromising

attacks in single application environments. Finally, it degrades the damage from unauthorized

aggregations. To prove the proposed scheme's robustness and efficiency, we also conducted the

comprehensive analyses and comparisons in the end.

DM-110

Classification and Adaptive Novel Class Detection of Feature-Evolving Data

Streams

Abstract: Data stream classification poses many challenges to the data mining community. In this

paper, we address four such major challenges, namely, infinite length, concept-drift, concept-

evolution, and feature-evolution. Since a data stream is theoretically infinite in length, it is impractical

to store and use all the historical data for training. Concept-drift is a common phenomenon in data

streams, which occurs as a result of changes in the underlying concepts. Concept-evolution occurs as

a result of new classes evolving in the stream. Feature-evolution is a frequently occurring process in

many streams, such as text streams, in which new features (i.e., words or phrases) appear as the

stream progresses. Most existing data stream classification techniques address only the first two

challenges, and ignore the latter two. In this paper, we propose an ensemble classification framework,

where each classifier is equipped with a novel class detector, to address concept-drift and concept-

evolution. To address feature-evolution, we propose a feature set homogenization technique. We also

enhance the novel class detection module by making it more adaptive to the evolving stream, and

enabling it to detect more than one novel class at a time. Comparison with state-of-the-art data stream

classification techniques establishes the effectiveness of the proposed approach.

DM-111 Comparable Entity Mining from Comparative Questions

Abstract: Comparing one thing with another is a typical part of human decision making process.

However, it is not always easy to know what to compare and what are the alternatives. In this paper,

we present a novel way to automatically mine comparable entities from comparative questions that

users posted online to address this difficulty. To ensure high precision and high recall, we develop a

weakly supervised bootstrapping approach for comparative question identification and comparable

entity extraction by leveraging a large collection of online question archive. The experimental results

show our method achieves F1-measure of 82.5 percent in comparative question identification and

83.3 percent in comparable entity extraction. Both significantly outperform an existing state-of-the-art

method. Additionally, our ranking results show highly relevance to user's comparison intents in web.

DM-112 Cross-Space Affinity Learning with Its Application to Movie Recommendation

Abstract: In this paper, we propose a novel cross-space affinity learning algorithm over different

spaces with heterogeneous structures. Unlike most of affinity learning algorithms on the

homogeneous space, we construct a cross-space tensor model to learn the affinity measures on

heterogeneous spaces subject to a set of order constraints from the training pool. We further enhance

the model with a factorization form which greatly reduces the number of parameters of the model

with a controlled complexity. Moreover, from the practical perspective, we show the proposed

factorized cross-space tensor model can be efficiently optimized by a series of simple quadratic

optimization problems in an iterative manner. The proposed cross-space affinity learning algorithm

can be applied to many real-world problems, which involve multiple heterogeneous data objects

defined over different spaces. In this paper, we apply it into the recommendation system to measure

the affinity between users and the product items, where a higher affinity means a higher rating of the

user on the product. For an empirical evaluation, a widely used benchmark movie recommendation

data set—MovieLens—is used to compare the proposed algorithm with other state-of-

the-art recommendation algorithms and we show that very competitive results can be obtained.

DM-113 Distributed Strategies for Mining Outliers in Large Data Sets

Abstract: We introduce a distributed method for detecting distance-based outliers in very large data

sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset

of the data set that can be also employed for predicting novel outliers. The method exploits parallel

computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the

result, the proposed schema exhibits excellent performances. From the theoretical point of view, for

common settings, the temporal cost of our algorithm is expected to be at least three orders of

magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results

show that the algorithm is efficient and that its running time scales quite well for an increasing

number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to

be transferred in order to improve both the communication cost and the overall runtime. Importantly,

the solving set computed by our approach in a distributed environment has the same quality as that

produced by the corresponding centralized method.

DM-114 Enhancing Access Privacy of Range Retrievals over $({rm B}^+)$-Trees

Abstract: Users of databases that are hosted on shared servers cannot take for granted that their queries

will not be disclosed to unauthorized parties. Even if the database is encrypted, an adversary who is

monitoring the I/O activity on the server may still be able to infer some information about a user

query. For the particular case of a $({rm B}^+)$-tree that has its nodes encrypted, we identify

properties that enable the ordering among the leaf nodes to be deduced. These properties allow us to

construct adversarial algorithms to recover the $({rm B}^+)$-tree structure from the I/O traces

generated by range queries. Combining this structure with knowledge of the key distribution (or the

plaintext database itself), the adversary can infer the selection range of user queries. To counter the

threat, we propose a privacy-enhancing $({rm PB}^+)$-tree index which ensures that there is high

uncertainty about what data the user has worked on, even to a knowledgeable adversary who has

observed numerous query executions. The core idea in $({rm PB}^+)$-tree is to conceal the order of

the leaf nodes in an encrypted $({rm B}^+)$-tree. In particular, it groups the nodes of the tree into

buckets, and employs homomorphic encryption techniques to prevent the adversary from pinpointing

the exact nodes retrieved by range queries. $({rm PB}^+)$-tree can be tuned to balance its privacy

strength with the computational and I/O overheads incurred. Moreover, it can be adapted to protect

access privacy in cases where the attacker additionally knows a priori the access frequencies of key

values. Experiments demonstrate that $({rm PB}^+)$-tree effectively impairs the adversary's ability

to recover the $({rm B}^+)$-tree structure and deduce the query ranges in all considered scenarios.

DM-115 Inferring Statistically Significant Hidden Markov Models

Abstract: Hidden Markov models (HMMs) are used to analyze real-world problems. We consider an

approach that constructs minimum entropy HMMs directly from a sequence of observations. If an

insufficient amount of observation data is used to generate the HMM, the model will not represent the

underlying process. Current methods assume that observations completely represent the underlying

process. It is often the case that the training data size is not large enough to adequately capture all

statistical dependencies in the system. It is, therefore, important to know the statistical significance

level for that the constructed model representing the underlying process, not only the training set. In

this paper, we present a method to determine if the observation data and constructed model fully

express the underlying process with a given level of statistical significance. We use the statistics of

the process to calculate an upper bound on the number of samples required to guarantee that the

model has a given level significance. We provide theoretical and experimental results that confirm the

utility of this approach. The experiment is conducted on a real private Tor network.

DM-116

Lineage Encoding: An Efficient Wireless XML Streaming Supporting Twig

Pattern Queries

Abstract: In this paper, we propose an energy and latency efficient XML dissemination scheme for the

mobile computing. We define a novel unit structure called G-node for streaming XML data in the

wireless environment. It exploits the benefits of the structure indexing and attribute summarization

that can integrate relevant XML elements into a group. It provides a way for selective access of their

attribute values and text content. We also propose a lightweight and effective encoding scheme, called

Lineage Encoding, to support evaluation of predicates and twig pattern queries over the stream. The

Lineage Encoding scheme represents the parent-child relationships among XML elements as a

sequence of bit-strings, called Lineage Code(V, H), and provides basic operators and functions for

effective twig pattern query processing at mobile clients. Extensive experiments using real and

synthetic data sets demonstrate our scheme outperforms conventional wireless XML broadcasting

methods for simple path queries as well as complex twig pattern queries with predicate conditions.

DM-117 MKBoost: A Framework of Multiple Kernel Boosting

Abstract: Multiple kernel learning (MKL) is a promising family of machine learning algorithms using

multiple kernel functions for various challenging data mining tasks. Conventional MKL methods

often formulate the problem as an optimization task of learning the optimal combinations of both

kernels and classifiers, which usually results in some forms of challenging optimization tasks that are

often difficult to be solved. Different from the existing MKL methods, in this paper, we investigate a

boosting framework of MKL for classification tasks, i.e., we adopt boosting to solve a variant of

MKL problem, which avoids solving the complicated optimization tasks. Specifically, we present a

novel framework of Multiple kernel boosting (MKBoost), which applies the idea of boosting

techniques to learn kernel-based classifiers with multiple kernels for classification problems. Based

on the proposed framework, we propose several variants of MKBoost algorithms and extensively

examine their empirical performance on a number of benchmark data sets in comparisons to various

state-of-the-art MKL algorithms on classification tasks. Experimental results show that the proposed

method is more effective and efficient than the existing MKL techniques.

DM-118 Mining Order-Preserving Submatrices from Data with Repeated Measurements

Abstract: Order-preserving submatrices (OPSM's) have been shown useful in capturing concurrent

patterns in data when the relative magnitudes of data items are more important than their exact values.

For instance, in analyzing gene expression profiles obtained from microarray experiments, the relative

magnitudes are important both because they represent the change of gene activities across the

experiments, and because there is typically a high level of noise in data that makes the exact values

untrustable. To cope with data noise, repeated experiments are often conducted to collect multiple

measurements. We propose and study a more robust version of OPSM, where each data item is

represented by a set of values obtained from replicated experiments. We call the new problem OPSM-

RM (OPSM with repeated measurements). We define OPSM-RM based on a number of practical

requirements. We discuss the computational challenges of OPSM-RM and propose a generic mining

algorithm. We further propose a series of techniques to speed up two time dominating components of

the algorithm. We show the effectiveness and efficiency of our methods through a series of

experiments conducted on real microarray data.

DM -119 Modeling Noisy Annotated Data with Application to Social Annotation

Abstract: We propose a probabilistic topic model for analyzing and extracting content-related

annotations from noisy annotated discrete data such as webpages stored using social bookmarking

services. With these services, because users can attach annotations freely, some annotations do not

describe the semantics of the content, thus they are noisy, i.e., not content related. The extraction of

content-related annotations can be used as a prepossessing step in machine learning tasks such as text

classification and image recognition, or can improve information retrieval performance. The proposed

model is a generative model for content and annotations, in which the annotations are assumed to

originate either from topics that generated the content or from a general distribution unrelated to the

content. We demonstrate the effectiveness of the proposed method by using synthetic data and real

social annotation data for text and images.

DM -120 Multiparty Access Control for Online Social Networks: Model and Mechanisms

Online social networks (OSNs) have experienced tremendous growth in recent years and become a de

facto portal for hundreds of millions of Internet users. These OSNs offer attractive means for digital

social interactions and information sharing, but also raise a number of security and privacy issues.

While OSNs allow users to restrict access to shared data, they currently do not provide any

mechanism to enforce privacy concerns over data associated with multiple users. To this end, we

propose an approach to enable the protection of shared data associated with multiple users in OSNs.

We formulate an access control model to capture the essence of multiparty authorization

requirements, along with a multiparty policy specification scheme and a policy enforcement

mechanism. Besides, we present a logical representation of our access control model that allows us to

leverage the features of existing logic solvers to perform various analysis tasks on our model. We also

discuss a proof-of-concept prototype of our approach as part of an application in Facebook and

provide usability study and system evaluation of our method.

DM -121 On the Analytical Properties of High-Dimensional Randomization

Abstract: In this paper, we will provide the first comprehensive analysis of high-dimensional

randomization. The goal is to examine the strengths and weaknesses of randomization and explore

both the potential and the pitfalls of high-dimensional randomization. Our theoretical analysis results

in a number of interesting and insightful conclusions. 1) The privacy effects of randomization reduce

rapidly with increasing dimensionality. 2) The properties of the underlying data set can affect the

anonymity level of the randomization method. For example, natural properties of real data sets such

as clustering improve the effectiveness of randomization. On the other hand, variations in data density

of nonempty data localities and outliers create privacy preservation challenges for the randomization

method. 3) The use of a public information-sensitive attack method makes the choice of perturbing

distribution more critical than previously thought. In particular, Gaussian perturbations are

significantly more effective than uniformly distributed perturbations for the high dimensional case.

These insights are very useful for future research and design of the randomization method. We use the

insights gained from our analysis to discuss and suggest future research directions for improvements

and extensions of the randomization method.

DM -122 TACI: Taxonomy-Aware Catalog Integration

A fundamental data integration task faced by online commercial portals and commerce search engines

is the integration of products coming from multiple providers to their product catalogs. In this

scenario, the commercial portal has its own taxonomy (the “master taxonomy”), while each data

provider organizes its products into a different taxonomy (the “provider taxonomy”). In this paper, we

consider the problem of categorizing products from the data providers into the master taxonomy,

while making use of the provider taxonomy information. Our approach is based on a taxonomy-aware

processing step that adjusts the results of a text-based classifier to ensure that products that are close

together in the provider taxonomy remain close in the master taxonomy. We formulate this intuition

as a structured prediction optimization problem. To the best of our knowledge, this is the first

approach that leverages the structure of taxonomies in order to enhance catalog integration. We

propose algorithms that are scalable and thus applicable to the large data sets that are typical on the

web. We evaluate our algorithms on real-world data and we show that taxonomy-aware classification

provides a significant improvement over existing approaches.

DM -123 The Skyline of a Probabilistic Relation

Abstract: In a deterministic relation R, tuple u dominates tuple v if u is no worse than v on all the

attributes of interest, and better than v on at least one attribute. This concept is at the heart of skyline

queries, that return the set of undominated tuples in R. In this paper, we extend the notion of skyline

to probabilistic relations by generalizing to this context the definition of tuple domination. Our

approach is parametric in the semantics for linearly ranking probabilistic tuples and, being it based on

order-theoretic principles, preserves the three fundamental properties the skyline has in the

deterministic case: 1) It equals the union of all top-1 results of monotone scoring functions; 2) it

requires no additional parameter; and 3) it is insensitive to actual attribute scales. We then show how

domination among probabilistic tuples (or P-domination for short) can be efficiently checked by

means of a set of rules. We detail such rules for the cases in which tuples are ranked using either the

“expected rank” or the “expected score” semantics, and explain how the approach can be applied to

other semantics as well. Since computing the skyline of a probabilistic relation is a time-consuming

task, we introduce a family of algorithms for checking P-domination rules in an optimized way.

Experiments show that these algorithms can significantly reduce the actual execution times with

respect to a naive evaluation.

DM-124

Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-

Gaussian Data Clustering with Variational Inference

Abstract: Clustering has been a subject of extensive research in data mining, pattern recognition, and

other areas for several decades. The main goal is to assign samples, which are typically non-Gaussian

and expressed as points in high-dimensional feature spaces, to one of a number of clusters. It is well

known that in such high-dimensional settings, the existence of irrelevant features generally

compromises modeling capabilities. In this paper, we propose a variational inference framework for

unsupervised non-Gaussian feature selection, in the context of finite generalized Dirichlet (GD)

mixture-based clustering. Under the proposed principled variational framework, we simultaneously

estimate, in a closed form, all the involved parameters and determine the complexity (i.e., both model

an feature selection) of the GD mixture. Extensive simulations using synthetic data along with an

analysis of real-world data and human action videos demonstrate that our variational approach

achieves better results than comparable techniques.

DM-125 A Context-Based Word Indexing Model for Document Summarization

Existing models for document summarization mostly use the similarity between sentences in the

document to extract the most salient sentences. The documents as well as the sentences are indexed

using traditional term indexing measures, which do not take the context into consideration. Therefore,

the sentence similarity values remain independent of the context. In this paper, we propose a context

sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli

model of randomness has been used to find the probability of the cooccurrences of two terms in a

large corpus. A new approach using the lexical association between terms to give a context sensitive

weight to the document terms has been proposed. The resulting indexing weights are used to compute

the sentence similarity matrix. The proposed sentence similarity measure has been used with the

baseline graph-based ranking models for sentence extraction. Experiments have been conducted over

the benchmark DUC data sets and it has been shown that the proposed Bernoulli-based sentence

similarity model provides consistent improvements over the baseline IntraLink and UniformLink

methods [1].

DM-126

A Segmentation and Graph-Based Video Sequence Matching Method for Video

Copy Detection

Abstract: We propose in this paper a segmentation and graph-based video sequence matching method

for video copy detection. Specifically, due to the good stability and discriminative ability of local

features, we use SIFT descriptor for video content description. However, matching based on SIFT

descriptor is computationally expensive for large number of points and the high dimension. Thus, to

reduce the computational complexity, we first use the dual-threshold method to segment the videos

into segments with homogeneous content and extract keyframes from each segment. SIFT features are

extracted from the keyframes of the segments. Then, we propose an SVD-based method to match two

video frames with SIFT point set descriptors. To obtain the video sequence matching result, we

propose a graph-based method. It can convert the video sequence matching into finding the longest

path in the frame matching-result graph with time constraint. Experimental results demonstrate that

the segmentation and graph-based video sequence matching method can detect video copies

effectively. Also, the proposed method has advantages. Specifically, it can automatically find optimal

sequence matching result from the disordered matching results based on spatial feature. It can also

reduce the noise caused by spatial feature matching. And it is adaptive to video frame rate changes.

Experimental results also demonstrate that the proposed method can obtain a better tradeoff between

the effectiveness and the efficiency of video copy detection.

DM-127 Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus

Automatic classification of sentiment is important for numerous applications such as opinion mining,

opinion summarization, contextual advertising, and market analysis. Typically, sentiment

classification has been modeled as the problem of training a binary classifier using reviews annotated

for positive or negative sentiment. However, sentiment is expressed differently in different domains,

and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier

trained using labeled data for a particular domain to classify sentiment of user reviews on a different

domain often results in poor performance because words that occur in the train (source) domain might

not appear in the test (target) domain. We propose a method to overcome this problem in cross-

domain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using

labeled data for the source domains and unlabeled data for both source and target domains. Sentiment

sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the

context vectors used as the basis for measuring the distributional similarity between words. Next, we

use the created thesaurus to expand feature vectors during train and test times in a binary classifier.

The proposed method significantly outperforms numerous baselines and returns results that are

comparable with previously proposed cross-domain sentiment classification methods on a benchmark

data set containing Amazon user reviews for different types of products. We conduct an extensive

empirical analysis of the proposed method on single- and multisource domain adaptation,

unsupervised and supervised domain adaptation, and numerous similarity measures for creating the

sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical

resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures

words that express similar s- ntiments.

DM-128

Determining $(k)$-Most Demanding Products with Maximum Expected

Number of Total Customers

Abstract:In this paper, a problem of production plans, named k-most demanding products ($(k)$-

MDP) discovering, is formulated. Given a set of customers demanding a certain type of products with

multiple attributes, a set of existing products of the type, a set of candidate products that can be

offered by a company, and a positive integer $(k)$, we want to help the company to select $(k)$

products from the candidate products such that the expected number of the total customers for the

$(k)$ products is maximized. We show the problem is NP-hard when the number of attributes for a

product is 3 or more. One greedy algorithm is proposed to find approximate solution for the problem.

We also attempt to find the optimal solution of the problem by estimating the upper bound of the

expected number of the total customers for a set of $(k)$ candidate products for reducing the search

space of the optimal solution. An exact algorithm is then provided to find the optimal solution of the

problem by using this pruning strategy. The experiment results demonstrate that both the efficiency

and memory requirement of the exact algorithm are comparable to those for the greedy algorithm, and

the greedy algorithm is well scalable with respect to $(k)$.

DM-129

Dirichlet Process Mixture Model for Document Clustering with Feature

Partition

Abstract: Finding the appropriate number of clusters to which documents should be partitioned is

crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to

discover the latent cluster structure based on the DPM model without requiring the number of clusters

as input. Document features are automatically partitioned into two groups, in particular,

discriminative words and nondiscriminative words, and contribute differently to document clustering.

A variational inference algorithm is investigated to infer the document collection structure as well as

the partition of document words at the same time. Our experiments indicate that our proposed

approach performs well on the synthetic data set as well as real data sets. The comparison between

our approach and state-of-the-art document clustering approaches shows that our approach is robust

and effective for document clustering.

DM-130 Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension

Abstract: Data clustering is one of the fundamental research problems in data mining and machine

learning. Most of the existing clustering methods, for example, normalized cut and $(k)$-means, have

been suffering from the fact that their optimization processes normally lead to an NP-hard problem

due to the discretization of the elements in the cluster indicator matrix. A practical way to cope with

this problem is to relax this constraint to allow the elements to be continuous values. The eigenvalue

decomposition can be applied to generate a continuous solution, which has to be further discretized.

However, the continuous solution is probably mixing-signed. This result may cause it deviate severely

from the true solution, which should be naturally nonnegative. In this paper, we propose a novel

clustering algorithm, i.e., discriminative nonnegative spectral clustering, to explicitly impose an

additional nonnegative constraint on the cluster indicator matrix to seek for a more interpretable

solution. Moreover, we show an effective regularization term which is able to not only provide more

useful discriminative information but also learn a mapping function to predict cluster labels for the

out-of-sample test data. Extensive experiments on various data sets illustrate the superiority of our

proposal compared to the state-of-the-art clustering algorithms

DM-131

Efficient Algorithms for Mining High Utility Itemsets from Transactional

Databases

Abstract: Mining high utility itemsets from a transactional database refers to the discovery of itemsets

with high utility like profits. Although a number of relevant algorithms have been proposed in recent

years, they incur the problem of producing a large number of candidate itemsets for high utility

itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of

execution time and space requirement. The situation may become worse when the database contains

lots of long transactions or long high utility itemsets. In this paper, we propose two algorithms,

namely utility pattern growth (UP-Growth) and UP-Growth+, for mining high utility itemsets with a

set of effective strategies for pruning candidate itemsets. The information of high utility itemsets is

maintained in a tree-based data structure named utility pattern tree (UP-Tree) such that candidate

itemsets can be generated efficiently with only two scans of database. The performance of UP-Growth

and UP-Growth+ is compared with the state-of-the-art algorithms on many types of both real and

synthetic data sets. Experimental results show that the proposed algorithms, especially UP-Growth+,

not only reduce the number of candidates effectively but also outperform other algorithms

substantially in terms of runtime, especially when databases contain lots of long transactions.

DM-132

Entity Translation Mining from Comparable Corpora: Combining Graph

Mapping with Corpus Latent Features

Abstract: This paper addresses the problem of mining named entity translations from comparable

corpora, specifically, mining English and Chinese named entity translation. We first observe that

existing approaches use one or more of the following named entity similarity metrics: entity, entity

context, and relationship. Motivated by this observation, we propose a new holistic approach by 1)

combining all similarity types used and 2) additionally considering relationship context similarity

between pairs of named entities, a missing quadrant in the taxonomy of similarity metrics. We

abstract the named entity translation problem as the matching of two named entity graphs extracted

from the comparable corpora. Specifically, named entity graphs are first constructed from comparable

corpora to extract relationship between named entities. Entity similarity and entity context similarity

are then calculated from every pair of bilingual named entities. A reinforcing method is utilized to

reflect relationship similarity and relationship context similarity between named entities. We also

discover "latent" features lost in the graph extraction process and integrate this into our framework.

According to our experimental results, our holistic graph-based approach and its enhancement using

corpus latent features are highly effective and our framework significantly outperforms previous

approaches.

DM-133 Harnessing Folksonomies to Produce a Social Classification of Resources

Abstract: In our daily lives, organizing resources like books or webpages into a set of categories to

ease future access is a common task. The usual largeness of these collections requires a vast endeavor

and an outrageous expense to organize manually. As an approach to effectively produce an automated

classification of resources, we consider the immense amounts of annotations provided by users on

social tagging systems in the form of bookmarks. In this paper, we deal with the utilization of these

user-provided tags to perform a social classification of resources. For this purpose, we have created

three large-scale social tagging data sets including tagging data for different types of resources,

webpages and books. Those resources are accompanied by categorization data from sound expert-

driven taxonomies. We analyze the characteristics of the three social tagging systems and perform an

analysis on the usefulness of social tags to perform a social classification of resources that resembles

the classification by experts as much as possible. We analyze six different representations using tags

and compare to other data sources by using three different settings of SVM classifiers. Finally, we

explore combinations of different data sources with tags using classifier committees to best classify

the resources.

DM-134 Optimizing Multi-Top-k Queries over Uncertain Data Streams

Abstract: Query processing over uncertain data streams, in particular top-$(k)$ query processing, has

become increasingly important due to its wide application in many fields such as sensor network

monitoring and internet traffic control. In many real applications, multiple top-$(k)$ queries are

registered in the system. Sharing the results of these queries is a key factor in saving the computation

cost and providing real-time response. However, due to the complex semantics of uncertain top-$(k)$

query processing, it is nontrivial to implement sharing among different top-$(k)$ queries and few

works have addressed the sharing issue. In this paper, we formulate various types of sharing among

multiple top-$(k)$ queries over uncertain data streams based on the frequency upper bound of each

top-$(k)$ query. We present an optimal dynamic programming solution as well as a more efficient (in

terms of time and space complexity) greedy algorithm to compute the execution plan of executing

queries for saving the computation cost between them. Experiments have demonstrated that the

greedy algorithm can find the optimal solution in most cases, and it can almost achieve the same

performance (in terms of latency and throughput) as the dynamic programming approach.

DM-135 Prequery Discovery of Domain-Specific Query Forms: A Survey

Abstract: The discovery of HTML query forms is one of the main challenges in Deep web crawling.

Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on

the web, which is done through the use of traditional/focused crawlers. The second is identifying

which of these forms are indeed meant for querying, which also typically involves determining a

domain for the underlying data source (and thus for the form as well). This problem has attracted a

great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit

requests through the forms and then analyze the data retrieved in response, typically requiring a great

deal of knowledge about the domain as well as semantic processing. Others do not employ form

submission, to avoid such difficulties, although some techniques rely to some extent on semantics and

domain knowledge. This survey gives an up-to-date review of methods for the discovery of domain-

specific query forms that do not involve form submission. We detail these methods and discuss how

form discovery has become increasingly more automated over time. We conclude with a forecast of

what we believe are the immediate next steps in this trend.

DM-136 Preventing Private Information Inference Attacks on Social Networks

Abstract: Online social networks, such as Facebook, are increasingly utilized by many people. These

networks allow users to publish details about themselves and to connect to their friends. Some of the

information revealed inside these networks is meant to be private. Yet it is possible to use learning

algorithms on released data to predict private information. In this paper, we explore how to launch

inference attacks using released social networking data to predict private information. We then devise

three possible sanitization techniques that could be used in various situations. Then, we explore the

effectiveness of these techniques and attempt to use methods of collective inference to discover

sensitive attributes of the data set. We show that we can decrease the effectiveness of both local and

relational classification algorithms by using the sanitization methods we described.

DM-137

Principal Composite Kernel Feature Analysis: Data-Dependent Kernel

Approach

Abstract: Principal composite kernel feature analysis (PC-KFA) is presented to show kernel

adaptations for nonlinear features of medical image data sets (MIDS) in computer-aided diagnosis

(CAD). The proposed algorithm PC-KFA has extended the existing studies on kernel feature analysis

(KFA), which extracts salient features from a sample of unclassified patterns by use of a kernel

method. The principal composite process for PC-KFA herein has been applied to kernel principal

component analysis [34] and to our previously developed accelerated kernel feature analysis [20].

Unlike other kernel-based feature selection algorithms, PC-KFA iteratively constructs a linear

subspace of a high-dimensional feature space by maximizing a variance condition for the nonlinearly

transformed samples, which we call data-dependent kernel approach. The resulting kernel subspace

can be first chosen by principal component analysis, and then be processed for composite kernel

subspace through the efficient combination representations used for further reconstruction and

classification. Numerical experiments based on several MID feature spaces of cancer CAD data have

shown that PC-KFA generates efficient and an effective feature representation, and has yielded a

better classification performance for the proposed composite kernel subspace using a simple pattern

classifier

DM-138

Revealing Density-Based Clustering Structure from the Core-Connected Tree

of a Network

Abstract: Clustering is an important technique for mining the intrinsic community structures in

networks. The density-based network clustering method is able to not only detect communities of

arbitrary size and shape, but also identify hubs and outliers. However, it requires manual parameter

specification to define clusters, and is sensitive to the parameter of density threshold which is difficult

to determine. Furthermore, many real-world networks exhibit a hierarchical structure with

communities embedded within other communities. Therefore, the clustering result of a global

parameter setting cannot always describe the intrinsic clustering structure accurately. In this paper, we

introduce a novel density-based network clustering method, called graph-skeleton-based clustering

(gSkeletonClu). By projecting an undirected network to its core-connected maximal spanning tree, the

clustering problem can be converted to detect core connectivity components on the tree. The density-

based clustering of a specific parameter setting and the hierarchical clustering structure both can be

efficiently extracted from the tree. Moreover, it provides a convenient way to automatically select the

parameter and to achieve the meaningful cluster tree in a network. Extensive experiments on both

real-world and synthetic networks demonstrate the superior performance of gSkeletonClu for

effective and efficient density-based clustering.

DM-139 Secure Provenance Transmission for Streaming Data

Abstract: Many application domains, such as real-time financial analysis, e-healthcare systems, sensor

networks, are characterized by continuous data streaming from multiple sources and through

intermediate processing by multiple aggregators. Keeping track of data provenance in such highly

dynamic context is an important requirement, since data provenance is a key factor in assessing data

trustworthiness which is crucial for many applications. Provenance management for streaming data

requires addressing several challenges, including the assurance of high processing throughput, low

bandwidth consumption, storage efficiency and secure transmission. In this paper, we propose a novel

approach to securely transmit provenance for streaming data (focusing on sensor network) by

embedding provenance into the interpacket timing domain while addressing the above mentioned

issues. As provenance is hidden in another host-medium, our solution can be conceptualized as

watermarking technique. However, unlike traditional watermarking approaches, we embed

provenance over the interpacket delays (IPDs) rather than in the sensor data themselves, hence

avoiding the problem of data degradation due to watermarking. Provenance is extracted by the data

receiver utilizing an optimal threshold-based mechanism which minimizes the probability of

provenance decoding errors. The resiliency of the scheme against outside and inside attackers is

established through an extensive security analysis. Experiments show that our technique can recover

provenance up to a certain level against perturbations to inter-packet timing characteristics.

DM-140

The Adaptive Clustering Method for the Long Tail Problem of Recommender

Systems

Abstract: This is a study of the long tail problem of recommender systems when many items in the

long tail have only a few ratings, thus making it hard to use them in recommender systems. The

approach presented in this paper clusters items according to their popularities, so that the

recommendations for tail items are based on the ratings in more intensively clustered groups and for

the head items are based on the ratings of individual items or groups, clustered to a lesser extent. We

apply this method to two real-life data sets and compare the results with those of the nongrouping and

fully grouped methods in terms of recommendation accuracy and scalability. The results show that if

such adaptive clustering is done properly, this method reduces the recommendation error rates for the

tail items, while maintaining reasonable computational performance.

DM-141 VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

Abstract: Similarity joins play an important role in many application areas, such as data integration

and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for

similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on

extracting overlapping grams from strings and considering only strings that share a certain number of

grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity

join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of

chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm,

VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to

our chunk-based method. We also design a greedy algorithm to automatically select a good chunking

scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than

alternative methods yet occupies less space.

IEEE Projects 2013-2014 - DataMining - Project Titles and Abstracts

Documents