DT09 Abstracts 29 · 2018-05-25 · DT09 Abstracts 29 IP1 A Geometric Perspective on Machine...

DT09 Abstracts 29

IP1

A Geometric Perspective on Machine Learning andData Mining

Increasingly, we face machine learning problems in veryhigh dimensional spaces. We proceed with the intuitionthat although natural data lives in very high dimensions,they have relatively few degrees of freedom. One way toformalize this intuition is to model the data as lying onor near a low dimensional manifold embedded in the highdimensional space. This point of view leads to a new classof algorithms that are ”manifold motivated” and a new setof theoretical questions that surround their analysis. Acentral construction in these algorithms is a graph or sim-plicial complex that is data-derived and we will relate thegeometry of these to the geometry of the underlying mani-fold. Applications to embedding, clustering, classification,and semi-supervised learning will be considered.

Partha NiyogiUniversity of ChicagoDepartment of Computer [email protected]

IP2

Automated Learning and Data Visualization

Automated numeric methods of data mining, statistics,and machine learning adapt themselves to systematic pat-terns in data to carry out predictive tasks, or to describethe patterns in a way that provides fundamental under-standing. Data visualization is critical in all phases of theanalysis of data, from the moment of arrival when datachecking and cleaning are needed, to the final presentationof results. Visualization allows us to learn which patternsoccur out of an immensely broad collection of possible pat-terns; it is difficult to select and carry out, a priori, auto-mated learning methods to cover nearly as broad a collec-tion of possibilities. It is widely accepted that an effectiveknowledge of patterns is necessary for fundamental under-standing. But the knowledge can be of immense benefitfor predictive tasks as well because it gives us valuableinformation about which automated numeric methods willlikely produce best performance. Selecting best automatedmethods by trying a number of them in a training-testframework runs the risk of simply finding the best amonga collection of poor performers. So visualization supportsthe automated methods. But the reverse is true, too. Itis difficult to make progress just displaying raw data with-out the benefit of automated methods that provide fits topatterns, which are then displayed, and provide displays ofremaining variation in the data after adjusting for the fits.Automation and visualization are symbiotic. Today, an im-mense challenge to data visualization, as it is to all techni-cal areas of data analysis, is the rapid expansion in the sizeand complexity of datasets. This should not deter our com-mitment to an understanding of patterns in data, but doesrequire new frameworks for how we approach data visu-alization. One such framework is visualization databases;for a single complex dataset, it consists of a large numberof displays, many of which consist of many pages. Thedisplays become a new database that is queried and stud-ied on an as-needed basis. Production, management, andviewing a visualization database need many new ideas. Forexample, methods are needed for view selection to populatethe database when the number of views can be millions ormore. Examples are statistical sampling methods that finda representative collection of views, and automation algo-rithms that find interesting views by searching for certain

patterns.

William S. ClevelandPurdue UniversityDepartment of [email protected]

IP3

Semantics on the Web: How Do We Get There?

It is becoming increasingly clear that the next generationof web search and advertising will rely on a deeper un-derstanding of user intent and task modeling, and a cor-respondingly richer interpretation of content on the web.How we get there, in particular, how we understand webcontent in richer terms than bags of words and links, is awide open and fascinating question. I will discuss some ofthe options here, and look closely at the role that informa-tion extraction can play.

Raghu RamakrishnanYahoo! [email protected]

IP4

Applied Nonparametric Bayes

Computer Science has historically been strong on datastructures and weak on inference from data, whereas Statis-tics has historically been weak on data structures andstrong on inference from data. One way to draw onthe strengths of both disciplines is to develop “inferentialmethods for data structures’; i.e., methods that are basedon probability distributions on recursively-defined objectssuch as trees, graphs, grammars and function calls. Thisis accommodated in the world of “nonparametric Bayes,’where prior and posterior distributions are allowed to begeneral stochastic processes. In this talk I discuss a va-riety of applied problems that are naturally tackled fromthis point of view. I will discuss nonparametric Bayesiansolutions to problems in natural language parsing, com-putational vision, information retrieval, statistical geneticsand protein structural modeling.

Michael I. JordanUniversity of California, [email protected]

CP1

GAD: General Activity Detection for Fast Cluster-ing on Large Data

Abstract not available at time of publication.

Xin Jin, Sangkyum Kim, Jiawei Han, Liangliang Cao,Zhijun YinUniversity of Illinois at [email protected], [email protected], [email protected],[email protected], [email protected]

CP1

Hybrid Clustering of Text Mining and Bibliomet-rics Applied to Journal Sets

A new hybrid clustering framework of integrating text min-ing and bibliometics is proposed . We propose a noveladaptive kernel K-means clustering algorithm to com-bine textual content an citation information for clustering.

30 DT09 Abstracts

Based on several validation indices, the experimental re-sults, on a clustering problem of 1869 journals publishedin 2002-2006, demonstrate that our hybrid clustering strat-egy is able to provide clustering result as well as the bestindividual data source.

Xinhai LiuK.U. Leuven, ESAT-SCDWuhan University of Science and Technology, College [email protected]

Shi Yu, Yves Moreau, Bart De MoorK.U. Leuven, Dept. of Electrical Engineering [email protected], [email protected],[email protected]

Wolfgang GlanzelK.U. Leuven, Steunpunt O&O [email protected]

Frizo JanssensK.U. Leuven, Dept. of Electrical Engineering ESAT-SCDAttentio Company in [email protected]

CP1

Constraint-Based Subspace Clustering

Since the performance of traditional clustering algorithmsdecreases in high-dimensional data, subspace clusteringtechniques, which compute clusters in subsets of dimen-sions, have been developed. Nevertheless, due to the hugenumber of subspaces to consider, they often lack efficiency.In this paper we integrate background knowledge and, inparticular, instance-level constraints to subspace clusteringtechniques and show experimentally that this increases notonly the efficiency of the techniques but also the accuracyof the resultant clustering.

Elisa FromontUniversite de Lyon, Laboratoire Hubert-CurienUMR CNRS [email protected]

Adriana PradoUniversity of [email protected]

Celine RobardetINSA-Lyon, LIRIS UMR5205, F-69621 Villeurbanne,[email protected]

CP1

Core: Nonparametric Clustering of Large NumericDatasets

We propose CORE, a new nonparametric clustering tech-nique that explicitly computes local maxima of the den-sity and represents them with cores. CORE proposes anadaptive grid and gradients to define and compute cores ofclusters. The incrementally constructed adaptive grid andgradients make the identification of cores robust, scalable,and independent of density fluctuations. The experimen-tal studies show that CORE without any model parametersproduces better quality clustering than related techniques

and is efficient for large datasets.

Andrej TaliunFree University of [email protected]

Michael BohlenFree University of [email protected]

Arturas MazeikaMax-Planck-Institut fur [email protected]

CP1

Integrated Kl (K-Means - Laplacian) Clustering: ANew Clustering Approach by Combining AttributeData and Pairwise Relations

Most datasets in real applications come in from multiplesources. As a result, we often have attributes informationabout data objects and various pairwise relations (simi-larity) between data objects. Traditional clustering algo-rithms use either data attributes only or pairwise similarityonly. We propose to combine K-means clustering on dataattributes and normalized cut spectral clustering on pair-wise relations. We show that these two methods can be co-herently integrated together to make use of different datasources to obtain good clustering results. We also showthat our integrated KL (K-means - Laplacian) clusteringmethod can be naturally extended to semi-supervised clus-tering, data embedding and metric learning. Finally theexperimental results on benchmark data sets are presentedto show the effectiveness of our method.

Fei WangSchool of CIS, [email protected]

Chris DingUniversity of Texas at [email protected]

Tao LiFlorida International [email protected]

CP2

Proximity-Based Anomaly Detection Using SparseStructure Learning

We proposed a new anomaly detection framework for cor-relation anomalies in highly noisy multivariate data. Weshow that fitting a sparse graphical model to the data is ex-tremely useful to capture meaningful correlation changes.We then define the correlation anomaly scores by evalu-ating the distances between the fitted conditional distri-butions. Using real-world data, we demonstrate that ourmatrix-based sparse structure learning approach success-fully detects correlation anomalies under collinearites andheavy noise.

Tsuyoshi IdeIBM ResearchTokyo Research [email protected]

Aurelie Lozano, Naoki Abe, Yan Liu

DT09 Abstracts 31

IBM ResearchT. J. Watson Research [email protected], [email protected],[email protected]

CP2

FuncICA for Time Series Pattern Discovery

FuncICA is a new independent component analysis methodfor pattern discovery in functional data like time series.FuncICA is an analog to functional PCA; instead of ex-tracting components to minimize L2 loss, we maximize in-dependence of optimally-smoothed components over func-tional observations. Results for synthetic, gene expression,and electroencephalographic event-related potential datashow FuncICA recovers scientific phenomena and improvesclassification accuracy. We conclude with a novel frame-work for fMRI data analysis using FuncICA.

Nishant A. MehtaCollege of ComputingGeorgia Institute of [email protected]

Alexander GrayGeorgia Institute of [email protected]

CP2

Event Discovery in Time Series

The discovery of events in time series can have importantimplications, such as identifying microlensing events in as-tronomical surveys. In this work, we develop probabilitymodels for calculating the significance of an arbitrary-sizedsliding window and use these probabilities to find areas ofsignificance. We apply our method to over 100,000 astro-nomical time series from the MACHO survey. In additionto successfully identifying known events, we were able toidentify events that do not pass traditional event discoveryprocedures.

Dan R. PrestonInitiative in Innovative Computing, Harvard UniversityDepartment of Computer Science, Tufts [email protected]

Pavlos ProtopapasInitiative in Innovative Computing, Harvard UniversityHarvard-Smithsonian Center for [email protected]

Carla BrodleyDepartment of Computer Science, Tufts [email protected]

CP2

Optimal Distance Bounds on Time-Series Data

We present new mechanisms for very fast search operationsover the compressed time-series data, with specific focus onweblog data. An important contribution of this work is thederivation of optimally tight bounds on the Euclidean dis-tance estimation between compressed sequences. Since ourmethodology is applicable to sequential data in general, theproposed technique is of independent interest.Additionally,our distance estimation strategy is not tied to a specificcompression methodology, but can be applied on top of

any orthonormal based compression technique (Fourier,Wavelet, PCA, etc). The experimental results indicate thatthe new optimal bounds lead to a significant improvementin the pruning power of search compared to previous state-of-the-art, in many cases eliminating more than 80% of thecandidate search sequences.

Michail VlachosIBM [email protected]

Serdar KozatKoc UniversityIstanbul, [email protected]

Philip S YuUniversity of [email protected]

CP2

Autocannibalistic and Anyspace Indexing Algo-rithms with Applications to Sensor Data Mining

Efficient indexing is at the heart of many data mining al-gorithms. A simple and extremely effective algorithm forindexing under any metric space was introduced in 1991by Orchard. Orchard’s algorithm has not received muchattention in the data mining and database community be-cause of a fatal flaw; it requires quadratic space. In thiswork we show that we can produce a reduced version of Or-chard’s algorithm that requires much less space, but pro-duces nearly identical speedup. We achieve this by castingthe algorithm in an anyspace framework, allowing deployedapplications to take as much of an index as their mainmemory/sensor can afford.

Lexiang Ye, Xiaoyue Wang, Eamonn KeoghUniversity of California, [email protected], [email protected],[email protected]

Agenor Mafra-NetoISCA [email protected]

CP3

Prior-Free Rare Category Detection

Rare category detection is an open challenge in machinelearning. In this paper, we propose a new method forrare category detection named SEDER, which requires noprior information about the data set. It implicitly performssemiparametric density estimation using specially designedexponentially families, and then picks the examples for la-beling where the neighborhood density changes the most.Experimental results on both synthetic and real data setsdemonstrate the superiority of SEDER.

Jingrui HeMachine Learning DepartmentCarnegie Mellon [email protected]

Jaime CarbonellLanguage Technologies InstituteCarnegie Mellon [email protected]

32 DT09 Abstracts

CP3

Learning Random-Walk Kernels for Protein Re-mote Homology Identification and Motif Discovery

It is very difficult to choose the optimal number of randomsteps in random-walk kernels. In this paper, we will dis-cuss how to better identify protein remote homology thanany other algorithm using a learned random-walk kernelbased on a positive linear combination of random-walk ker-nels with different random steps, which leads to a convexcombination of kernels. The resulting kernel has much bet-ter prediction performance than the state-of-the-art profilekernel for protein remote homology identification. More-over, our approach based on learned random-walk kernelscan effectively identify meaningful protein sequence motifsthat are responsible for discriminating the memberships ofprotein sequences’ remote homology in SCOP.

Renqiang MinDept Computer ScienceUniversity of [email protected]

Rui KuangDept Computer ScienceUniversity of [email protected]

Anthony BonnerDept Computer ScienceUniversity of [email protected]

Zhaolei ZhangBanting and Best Dept of Medical ResearchUniversity of [email protected]

CP3

Application of Bayesian Partition Models in War-ranty Data Analysis

Warranty data analysis helps automotive engineers in theirtask of resolving manufacturing or design related qualityissues. In this contribution we outline how Bayesian par-tition models can be integrated with interactive decisiontrees to support root cause investigations. Our approachconsiders taxonomies and identifies the most likely, seman-tically meaningful partitions that are close to the conceptthat actually caused a quality issue. Real-world case stud-ies illustrate how the approach is applied in practice.

Markus Mueller, Christoph SchliederUniversity of [email protected],[email protected]

Axel BlumenstockDaimler [email protected]

CP3

A Family of Large Margin Linear Classifiers andIts Application in Dynamic Environments

We combine regularization mechanisms with online largemargin learning algorithms to learn robust classifiers innonstationary environments. We prove bounds on their er-

ror and show that removing features with small weights haslittle influence on the accuracy, suggesting that these meth-ods exhibit feature selection ability. We show that suchregularized learning algorithms automatically decrease theinfluence of the old training instances and focus on themore recent ones.

Jianqiang Shen, Thomas DietterichSchool of EECS, Oregon State [email protected], [email protected]

CP3

Outlier Detection with Globally OptimalExemplar-Based Gmm

Outlier detection has recently become an important prob-lem in many data mining applications. In this paper, anovel unsupervised algorithm for outlier detection is pro-posed. First we apply a provably globally optimal Expec-tation Maximization (EM) algorithm to t a Gaussian Mix-ture Model (GMM) to a given data set. In our approach,a Gaussian is centered at each data point, and hence, theestimated mixture proportions can be interpreted as prob-abilities of being a cluster center for all data points. Theoutlier factor at each data point is then dened as a weightedsum of the mixture proportions with weights representingthe similarities to other data points. The proposed out-lier factor is thus based on global properties of the dataset. This is in contrast to most existing approaches tooutlier detection, which are strictly local. Our experi-ments performed on several simulated and real life datasets demonstrate superior performance of the proposed ap-proach. Moreover, we also demonstrate the ability to de-tect unusual shapes.

Xingwei YangDepartment of Computer and Information ScienceTemple [email protected]

Longin Jan LateckiTemple UniversityDepartment of Computer and Information [email protected]

Dragoljub PokrajacDelaware State [email protected]

CP4

Discovering Substantial Distinctions Among Incre-mental Bi-Clusters

A fundamental task of data analysis is comprehendingwhat distinguishes clusters found within the data. Wepresent the problem of mining distinguishing sets whichseeks to find sets of objects or attributes that induce thatmost change among the incremental bi-clusters of a bi-nary dataset. Unlike emerging patterns and contrast setswhich only focus on statistical differences between supportof itemsets, our approach considers distinctions in both theattribute space and the object space. Viewing the latticeof bi-clusters formed within a data set as a weighted di-rected graph, we mine the most significant distinguishingsets by growing a maximal cost spanning tree of the lat-tice. In this paper we present a weighting function for mea-suring distinction among bi-clusters in the lattice and thenovel MIDS algorithm. MIDS simultaneously enumeratesbi-clusters, constructs the bi-cluster lattice, and computes

DT09 Abstracts 33

the distinguishing sets. The efficient computational per-formance of MIDS is exhibited in a performance test onreal world and benchmark data sets. The utility of dis-tinguishing sets is also demonstrated with experiments onsynthetic and real data.

Faris Alqadah, Raj BhatnagarUniverstiy of [email protected], [email protected]

CP4

A Framework for Exploring Categorical Data

In this paper, we present a framework for categorical dataanalysis which allows such data sets to be explored using arich set of techniques that are only applicable to continuousdata sets. We introduce the concept of separability statis-tics in the context of exploratory categorical data analysis.We show how these statistics can be used as a way to mapcategorical data to continuous space given a labeled refer-ence data set. This mapping enables visualization of cat-egorical data using techniques that are applicable to con-tinuous data. We show that in the transformed continuousspace, the performance of the standard k-nn based outlierdetection technique is comparable to the performance ofthe k-nn based outlier detection technique using the bestof the similarity measures designed for categorical data.The proposed framework can also be used to devise sim-ilarity measures best suited for a particular type of dataset.

Varun Chandola, Shyam BoriahDepartment of Computer ScienceUniversity of [email protected], [email protected]

Vipin KumarUniversity of [email protected]

CP4

DensEst: Density Estimation for Data Mining inHigh Dimensional Spaces

Subspace clustering and frequent itemset mining algo-rithms do not scale to large high dimensional databases asthe search space gets enormous. Efficiency improvementscan be achieved by estimates of object counts in selectivesubspace regions. In this work, we propose DensEst, anefficient density estimator. By incorporating correlationsbetween dimensions DensEst achieves highly accurate esti-mations. We integrated DensEst into subspace clusteringand frequent itemset mining algorithms and show both,their improved efficiency and accuracy.

Emmanuel MullerRWTH Aachen [email protected]

Ira AssentAalborg [email protected]

Ralph Krieger, Stephan Gunnemann, Thomas SeidlRWTH Aachen [email protected],[email protected],[email protected]

CP4

Bayesian Cluster Ensembles

Cluster ensembles provide a framework for combining mul-tiple base clusterings of a dataset to generate a stable androbust consensus clustering. There are important vari-ants of the basic cluster ensemble problem, notably in-cluding cluster ensembles with missing values, as well asrow-distributed or column-distributed cluster ensembles.Existing cluster ensemble algorithms are applicable onlyto a small subset of these variants. In this paper, we pro-pose Bayesian Cluster Ensembles (BCE), which is a mixed-membership model for learning cluster ensembles, and isapplicable to all the primary variants of the problem. Wepropose two methods, respectively based on variational ap-proximation and Gibbs sampling, for learning a Bayesiancluster ensemble. We compare BCE extensively with sev-eral other cluster ensemble algorithms, and demonstratethat BCE is not only versatile in terms of its applicabil-ity, but also outperforms the other algorithms in terms ofstability and accuracy.

Hanhuai ShanDepartment of Computer Science and EngineeringUniversity of Minnesota, Twin [email protected]

Hongjun WangSchool of Computer ScienceSichuan University, Chengdu, [email protected]

Arindam BanerjeeUniversity of [email protected]

CP4

Agglomerative Mean-Shift Clustering Via QuerySet Compression

Mean-Shift (MS) is a powerful non-parametric clusteringmethod. Although good accuracy can be achieved, itscomputational cost is particularly expensive even on mod-erate data sets. In this paper, for the purpose of algo-rithm speedup, we develop an agglomerative MS clusteringmethod called Agglo-MS, along with its mode-seeking abil-ity and convergence property analysis. Our method is builtupon an iterative query set compression mechanism whichis motivated by the quadratic bounding optimization na-ture of MS. The whole framework can be efficiently imple-mented in linear running time complexity. Furthermore,we show that the pairwise constraint information can benaturally integrated into our framework to derive a semi-supervised non-parametric clustering method. Extensiveexperiments on toy and real-world data sets validate thespeedup advantage and numerical accuracy of our method,as well as the superiority of its semi-supervised version.

Xiao-Tong Yuan, Bao-Gang Hu, Ran HeNational Laboratory of Pattern RecognitionInstitute of Automation, Chinese Academy of [email protected], [email protected],[email protected]

CP5

Scalable Distributed Change Detection from As-tronomy Data Streams Using Local, Asynchronous

34 DT09 Abstracts

Eigen Monitoring Algorithms

This paper considers the problem of change detection us-ing distributed eigen monitoring algorithms for astronomydata pipelines. Change point detection in such datasetsmay provide useful insights to unique astronomical phe-nomenon. However, this is a challenging problem for suchhigh-throughput distributed data streams. In this pa-per we propose a highly scalable and distributed asyn-chronous algorithm for monitoring the eigenstates of suchdata streams. Experiments performed on SDSS cataloguedata show the effectiveness of the algorithm.

Kamalika DasUniversity of Maryland, Baltimore [email protected]

Kanishka BhaduriMission Critical TechNASA Ames Research [email protected]

Sugandha Arora, Wesley GriffinDept of CSEEUniversity of Maryland, Baltimore [email protected], [email protected]

Kirk BorneComputational and Data Sciences DeptGeorge Mason Universitykborne@gmu,edu

Chris GiannellaComputer Science DeptNew Mexico State [email protected]

Hillol KarguptaDepartment of Computer ScienceUniversity of Maryland Baltimore [email protected]

CP5

Adaptive Concept Drift Detection

An established method to detect concept drift in datastreams is to perform statistical hypothesis testing on themultivariate data in the stream. Statistical decision theoryoffers rank-based statistics for this task. However, thesestatistics depend on a fixed set of characteristics of the un-derlying distribution. Thus, they work well whenever thechange in the underlying distribution affects these proper-ties measured by the statistic, but they perform not verywell, if the drift influences the characteristics caught bythe test statistic only to a small degree. To address thisproblem, we present three novel drift detection tests, whosetest statistics are dynamically adapted to match the actualdata at hand. The first one is based on a rank statistic ondensity estimates for a binary representation of the data,the second compares average margins of a linear classifierinduced by the 1-norm support vector machine (SVM), andthe last one is based on the average zero-one or sigmoid er-ror rate of an SVM classifier. Experiments show that themargin- and error-based tests outperform the multivari-ate Wald-Wolfowitz test for concept drift detection. Wealso show that the tests work even if the drift is gradualin nature and that the new methods are faster than the

Wald-Wolfowitz test.

Ulrich RuckertInternational Computer Science [email protected]

Anton DriesKatholieke Universiteit [email protected]

CP5

Positive Unlabeled Learning for Data Stream Clas-sification

This paper studies how to devise PU learning techniquesfor the data stream environment. Unlike existing datastream classification methods that assume both positiveand negative training data are available for learning, wepropose a novel PU learning technique LELC (PU Learningby Extracting Likely positive and negative micro-Clusters)for document classification. LELC only requires a smallset of positive examples and a set of unlabeled exampleswhich is easily obtainable in the data stream environmentto build accurate classifiers.

Xiaoli LiInstitute for Infocomm [email protected]

Philip Yu, Bing LiuUniversity of Illinois at [email protected], [email protected]

See-Kiong NgInstitute for Infocomm [email protected]

CP5

Time-Decayed Correlated Aggregates over DataStreams

Data stream analysis frequently relies on identifying cor-relations and posing conditional queries on the data afterit has been seen. Correlated aggregates form an importantexample of such queries. Since recent events are typicallymore important, time decay is used to downweight old val-ues. In this talk, we present space-efficient algorithms aswell as space lower bounds for the time-decayed correlatedsum, a problem at the heart of many related aggregations.

Graham CormodeAT&T [email protected]

Srikanta Tirthapura, Bojian XuDepartment of Electrical and Computer EngineeringIowa State [email protected], [email protected]

CP5

Multi-Modal Hierarchical Dirichlet Process Modelfor Predicting Image Annotation and Image-ObjectLabel Correspondence

We address the problem of predicting image captions andlabels for individual objects in the image using a multi-modal hierarchical Dirichlet Process model (MoM-HDP).The model groups related words and image features using

DT09 Abstracts 35

hidden mixture components and using a stochastic pro-cess for generating the mixture components, thus allowingto circumvent the need for a priori choice of the numberof mixture components or the computational expense ofmodel selection. The model parameters are estimated ef-ficiently using variational inference. It is then evaluatedfor image annotation task and object recognition task on2 large scale real world image datasets.

Oksana Yakhnenko, Vasant HonavarComputer Science DepartmentIowa State [email protected], [email protected]

CP6

Hierarchical Linear Discriminant Analysis forBeamforming

We demonstrate the applicability of the recently proposedhierarchical linear discriminant analysis (h-LDA) to beam-forming. h-LDA tackles the unimodal limitation of LDA byvariance decomposition in subcluster level. We present anefficient h-LDA algorithm using Cholesky decompositionand generalized singular value decomposition for oversam-pled data, and analyze its data model. Our experimentsfor beamforming simulation show that h-LDA outperformsLDA, kernel discriminant analysis, the regularized leastsquares and the kernelized support vector regression.

Jaegul ChooCollege of ComputingGeorgia Institute of [email protected]

Barry L. DrakeSEAL/AMDDGeorgia Tech Research [email protected]

Haesun ParkGeorgia Institute of [email protected]

CP6

Toward Optimal Ordering of Prediction Tasks

We study the problem of ordering a series of inter-dependent prediction tasks that must be accomplished se-quentially through user interaction. We propose an approx-imate formulation in terms of pairwise task order prefer-ences, reducing it to the well-known Linear Ordering Prob-lem. Our experiments on two practical applications showencouraging improvements in predictive performance, ascompared to approaches that do not take task dependen-cies into account.

Abhimanyu LadLanguage Technologies InstituteCarnegie Mellon [email protected]

Yiming YangCarnegie Mellon [email protected]

Rayid GhaniAccenture Technology [email protected]

Bryan KisielLanguage Technologies InstituteCarnegie Mellon [email protected]

CP6

Amori: A Metric-Based One Rule Inducer

The objectives of data mining applications vary exten-sively. We have implemented a supervised concept learnercalled A Metric-based One Rule Inducer (AMORI), forwhich it is possible to select the learning/objective met-ric based on the problem at hand. We have comparedthe performance of this algorithm on 19 UCI data sets byembedding three different learning metrics. Experimentsshow that a performance gain is achieved when using iden-tical metrics for learning and evaluation.

Niklas Lavesson, Paul DavidssonBlekinge Institute of [email protected], [email protected]

CP6

The Metric Dilemma: Competence-Conscious As-sociative Classification

The classification performance of an associative classifier isstrongly dependent on the statistic measure or metric thatis used to quantify the strength of the association betweenfeatures and classes (i.e., confidence, correlation etc.). Pre-vious studies have shown that classifiers produced by differ-ent metrics may provide conflicting predictions, and thatthe best metric to use is data-dependent and rarely knownwhile designing the classifier. This uncertainty concern-ing the optimal match between metrics and problems isa dilemma, and prevents associative classifiers to achievetheir maximal performance. This dilemma is the focus ofthis paper.

Adriano [email protected]

Mohammed ZakiRensselaer Polytechnic [email protected]

Wagner Meira JR., Marcos [email protected], [email protected]

CP6

Twin Vector Machines for Online Learning on aBudget

This paper proposes Twin Vector Machine (TVM), a con-stant space and sublinear time Support Vector Machine(SVM) algorithm for online learning. TVM achieves itsfavorable scaling by maintaining only a fixed number ofexamples, called the twin vectors, and their associated in-formation in memory during training. In addition, TVMguarantees that Kuhn-Tucker conditions are satisfied onall twin vectors at any time. To maximize the accuracy ofTVM, twin vectors are adjusted during the training phasein order to approximate the data distribution near the de-cision boundary. Given a new training example, TVM isupdated in three steps. First, the new example is addedas a new twin vector if it is near the decision boundary.

36 DT09 Abstracts

If this happens, two twin vectors are selected and mergedinto a single twin vector to maintain the budget. Finally,TVM is updated by incremental and decremental learningto account for the change in twin vectors. Several methodsfor twin vector merging were proposed and experimentallyevaluated. TVMs were thoroughly tested on 12 large datasets. In most cases, the accuracy of low-budget TVMs wascomparable to the state of the art resource-unconstrainedSVMs. Additionally, the TVM accuracy was substantiallylarger than that of SVM trained on a random sample ofthe same size. Even larger difference in accuracy was ob-served when comparing to Forgetron, a popular kernel per-ceptron algorithm on a budget. The results illustrate thathighly accurate online SVMs could be trained from largedata streams using devices with severely limited memorybudgets.

Slobodan Vucetic, Zhuang WangTemple [email protected], [email protected]

CP7

Identifying Unsafe Routes for Network-Based Tra-jectory Privacy

We propose a privacy model that offers trajectory privacyto the requesters of LBSs. Our model assumes movementon a road network as well as attackers who have knowl-edge of the users’ movement statistics. The privacy modelhas been implemented as a framework that automaticallyidentifies routes where user privacy is at risk. Then, itanonymizes user requests based on the location of the re-quester (w.r.t. his/her unsafe routes) from the time ofrequest until the service provision.

Aris Gkoulalas-Divanis, Vassilios VerykiosDepartment of Computer & Communication EngineeringUniversity of [email protected], [email protected]

Mohamed MokbelDepartment of Computer Science and EngineeringUniversity of [email protected]

CP7

Privacy Preservation in Social Networks with Sen-sitive Edge Weights

In addition to the current social network anonymity de-identification techniques, in this paper we consider per-turbing the weights of some edges to preserve data privacywhen the network is published, while retaining the short-est path and the approximate cost of the path betweensome pairs of nodes in the original network. We developtwo privacy-preserving strategies for this application, theGaussian randomization multiplication and the greedy per-turbation.

Jun Zhang, Lian LiuUniversity of KentuckyDepartment of Computer [email protected], [email protected]

Jie WangMinnesota State University MankatoComputer Science [email protected]

Jinze LiuUniversity of KentuckyDepartment of Computer [email protected]

CP7

A Bayesian Approach Toward Finding Communi-ties and Their Evolutions in Dynamic Social Net-works

In this paper, we propose a dynamic stochastic block modelfor finding communities and their evolutions in a dynamicsocial network. In this study, we employ a Bayesian treat-ment for parameter estimation that computes the poste-rior distributions for all the unknown parameters. Exten-sive experimental studies based on both synthetic data andreal-life data demonstrate that our model achieves higheraccuracy and reveals more insights in the data than severalstate-of-the-art algorithms.

Tianbao YangMichigan State [email protected]

Yun Chi, Shenghuo Zhu, Yihong GongNEC Laboratories [email protected], [email protected],[email protected]

Rong JinMichigan State Universityrongjin@msu,.edu

CP7

Graph Generation with Prescribed Feature Con-straints

In this paper, we study the problem of how to generatesynthetic graphs matching various properties of a real so-cial network with two applications, privacy preserving so-cial network publishing and significance testing of networkanalysis results. We investigate potential disclosures of sen-sitive links due to the preserved features. Our algorithmson graph generation are based on the Metropolis-Hastingssampling.

Xiaowei Ying, Xintao WuUniversity of North Carolina at [email protected], [email protected]

CP7

Detecting Communities in Social Networks UsingMax-Min Modularity

Many datasets can be described in the form of graphsor networks where nodes in the graph represent entitiesand edges represent relationships between pairs of entities.A common property of these networks is their commu-nity structure, considered as clusters of densely connectedgroups of vertices, with only sparser connections betweengroups. The identification of such communities relies onsome notion of clustering or density measure. which de-fines the communities that can be found. However, previ-ous community detection methods usually apply the samestructural measure on all kinds of networks, despite theirdistinct dissimilar features. In this paper, we present a newcommunity mining measure, Max-Min Modularity, whichconsiders both connected pairs and criteria defined by do-

DT09 Abstracts 37

main experts in finding communities, and then specify ahierarchical clustering algorithm to detect communities innetworks. When applied to real world networks for whichthe community structures are already known, our methodshows improvement over previous algorithms. In addition,when applied to randomly generated networks for which weonly have approximate information about communities, itgives promising results which shows the algorithm’s robust-ness against noise.

Jiyang Chen, Osmar Zaiane, Randy GoebelUniversity of [email protected], [email protected],[email protected]

CP8

Efficient Discovery of Interesting Patterns Basedon Strong Closedness

Regarding all patterns above a certain frequency thresholdas interesting is one way of defining interestingness in fre-quent pattern mining. We argue that in many applications,a different notion of interestingness is required in order tobe able to capture “long’, and thus particularly informa-tive, patterns that are correspondingly of low frequency.To identify such patterns, we propose a new measure ofinterestingness that is based on their degree of closedness.

Mario BoleyFraunhofer [email protected]

Tamas HorvathUniversity of BonnFraunhofer [email protected]

Stefan WrobelFraunhofer IAISUniversity of [email protected]

CP8

Efficient Computation of Partial-Support for Min-ing Interesting Itemsets

Mining interesting itemsets is a popular topic in the datamining community. The objective of this problem is tomine all interesting itemsets, with respect to a given inter-estingness measure. While considerable efforts have beingspent on justifying the various interestingness measures,the algorithms that mine them are not quite well-studied,except in the case support, which has resulted in the fa-mous frequent itemset mining (FIM) problem. In this pa-per, we show that a certain class of interesting itemsets canbe represented by functions of their partial support. Thisclass includes some definitions of fault-tolerant itemsets,estimated support of itemsets in noisy data, and bond ofitemsets. As the name implies, partial support of an item-set is the number of transactions containing some part ofthe given itemset. This paper addresses the problem of effi-ciently calculating partial supports, which leads to efficientalgorithms for mining interesting itemsets in that class. Weshow that there exists a recurrence relation between par-tial supports. Hence, we can calculate the partial supportsof itemset by simply extending any FIM algorithm (eventhe implementation). This allows us to benefit from inno-vations and optimizations in FIM algorithms. Theoreticalanalysis shows that our approaches retain the running time

complexity of the base FIM algorithms for only a smallcost in space. Extensive experiments on several real-worlddatasets also demonstrate that algorithms based on ourapproach are significantly faster than previously proposedtechniques for corresponding definitions.

Ardian K. Poernomo, Vivekanand GopalkrishnanNanyang Technological University, [email protected], [email protected]

CP8

Top-K Correlative Graph Mining

We study the problem of mining top-k correlative sub-graphs from a graph database, which share similar occur-rence distributions with a given query graph. We proposean efficient algorithm, TopCor, which effectively directsthe search to highly correlative candidates by three keytechniques: an effective correlation checking mechanism, apowerful pruning criteria, and a set of rules for candidateexploration. Experiments show that TopCor is significantlyfaster than CGSearch, the state-of-the-art threshold-basedcorrelative graph mining algorithm.

Yiping KeThe Chinese University of Hong [email protected]

James ChengNanyang Technological [email protected]

Jeffrey YuThe Chinese University of Hong [email protected]

CP8

Grammar Mining

We introduce the problem of grammar mining, where pat-terns are context-free grammars, as a generalization of alarge number of common pattern mining tasks, such astree, sequence and itemset mining. The proposed systemoffers data miners the possibility to specify and explorepattern domains declaratively, in a way which is very sim-ilar to the declarative specification of regular expressionsin popular scripting languages.

Siegfried Nijssen, Luc De RaedtKatholieke Universiteit [email protected],[email protected]

CP8

High Performance Parallel/Distributed Bicluster-ing Using Barycenter Heuristic

Biclustering refers to simultaneous clustering of objectsand their features. It has been shown that Bipartite Spec-tral Partitioning can be reformulated as a graph drawingproblem where objective is to minimize Hall’s energy ofthe bipartite graph representation of the input data. Weprovide an embarrassingly parallel algorithm for bicluster-ing, based on parallel energy minimization using barycen-ter heuristic. Experimental evaluation shows large super-linear speedups, scalability and high level of accuracy.

Arifa Nisar

38 DT09 Abstracts

Northwestern UniversityDepartment of Electrical Engineering and [email protected]

Waseem [email protected]

Wei-Keng Liao, Alok ChoudharyDepartment of Electrical Engineering and ComputerScienceNorthwestern [email protected],[email protected]

CP9

Polynomial-Delay and Polynomial-Space Algo-rithms for Mining Closed Sequences, Graphs, andPictures in Accessible Set Systems

This paper studies efficient closed pattern mining fromsemi-structured data. By modeling semi-structured datawith a framework of set systems, we present an efficientdepth-first algorithm that finds all closed patterns in ac-cessible set systems without duplicates in polynomial-delayand polynomial-space w.r.t. the total input size. We alsoapply this result to efficient closed pattern mining forclasses of semi-structured patterns including rigid sequencemotifs, itemset sequences, relational graphs, 2-D convexhulls, and 2-D picture patterns.

Hiroki ArimuraGraduate School of IST, Hokkaido [email protected]

Takeaki UnoNational Institute of [email protected]

CP9

Link Propagation: A Fast Semi-Supervised Learn-ing Algorithm for Link Prediction

We propose Link Propagation as a new semi-supervisedlearning method for link prediction problems, where thetask is to predict unknown parts of the network structureby using auxiliary information such as node similarities.Since the proposed method can fill in missing parts of ten-sors, it is applicable to multi-relational domains, allowingus to handle multiple types of links simultaneously. Wealso give a novel efficient algorithm for Link Propagationbased on an accelerated conjugate gradient method.

Hisashi KashimaIBM Research, Tokyo Research [email protected]

CP9

MultiVis: Content-Based Social Network Explo-ration Through Multi-Way Visual Analysis

With the explosion of social media, scalability becomes akey challenge. There are two main aspects of the problemsthat arise: 1) data volume: how to manage and analyzehuge datasets to efficiently extract patterns, 2) data un-derstanding: how to facilitate understanding of the pat-terns by users? To address both aspects of the scalabil-

ity challenge, we present a hybrid approach that leveragestwo complementary disciplines, data mining and informa-tion visualization. In particular, we propose 1) an analyticdata model for content-based networks using tensors; 2) anefficient high-order clustering framework for analyzing thedata; 3) a scalable context-sensitive graph visualization topresent the clusters. We evaluate the proposed methodsusing both synthetic and real datasets. In terms of com-putational efficiency, the proposed methods are an order ofmagnitude faster compared to the baseline. In terms of ef-fectiveness, we present several case studies of real corporatesocial networks.

Jimeng SunIBM T.J. Watson Research [email protected]

Spiros Papadimitriou, Ching-Yung Lin, Nan Cao, ShixiaLiu, Weihong QianIBM [email protected], [email protected], [email protected], [email protected], [email protected]

CP9

Near-Optimal Supervised Feature SelectionAmong Frequent Subgraphs

Graph classification is an increasingly important step innumerous application domains. A popular classificationapproach using frequent subgraphs suffers from the enor-mous problem that the number of extracted features maygrow exponentially with the size of the graphs. In order toensure an efficient graph representation of high discrimina-tive power, we propose a submodular approach to featureselection on frequent subgraphs which can be integratedinto gSpan, the state-of-the-art tool for frequent subgraphmining.

Marisa ThomaInstitute for InformaticsLudwig-Maximilians-Universitat [email protected]

Hong ChengDepartment of Systems Engineering and EngineeringManagementChinese University of Hong [email protected]

Arthur GrettonMax-Planck Institute for Biological [email protected]

Jiawei HanUniversity of Illinois at [email protected]

Hans-Peter KriegelLudwig-Maximilians University [email protected]

Alex SmolaYahoo! Research, Santa Clara, [email protected]

Le SongSchool of Computer Science

DT09 Abstracts 39

Carnegie Mellon [email protected]

Philip YuUniversity of Illinois at [email protected]

Xifeng YanDepartment of Computer ScienceUniversity of California at Santa [email protected]

Karsten BorgwardtUniversity of CambridgeMax-Planck-Institutes for BiologicalCybernetics,[email protected]

CP9

Understanding Importance of Collaborations inCo-Authorship Networks: A Supportiveness Anal-ysis Approach

In co-authorship networks, the fact two authors co-authorone paper can be regarded as one author supports theother’s scientific work. Such characteristics can be mea-sured by supportiveness, a novel and interesting measureon co-authorship relation. In our work, several efficientalgorithms are developed to compute the top-n most sup-portive authors and most supportive groups. The empiricalstudy conducted on DBLP data set indicates the support-iveness measures are interesting, and methods are effectiveand efficient.

Yi HanNational University of Defense [email protected]

Bin ZhouSimon Fraser [email protected]

Jian PeiSchool of Computing ScienceSimon Fraser [email protected]

Yan JiaNational University of Defense [email protected]

CP10

Local Relevance Weighted Maximum Margin Cri-terion for Text Classification

We propose a feature extraction method for text classifica-tion, named Local Relevance Weighted Maximum MarginCriterion. It aims to learn a subspace in which the docu-ments in the same class are as near as possible while thedocuments in the different classes are as far as possible inthe local region of each document. Furthermore, the rel-evance is taken into account as a weight to determine theextent to which the documents will be projected.

Quanquan Gu, Jie ZhouDepartment of Automation, Tsinghua [email protected], [email protected]

CP10

Parallel Large Scale Feature Selection for LogisticRegression

In this paper we examine the problem of efficient fea-ture evaluation for logistic regression on very large datasets. We present a new forward feature selection heuristicthat ranks features by their estimated effect on the result-ing model’s performance. An approximate optimization,based on backfitting, provides a fast and accurate estimateof each new feature’s coefficient in the logistic regressionmodel. Further, the algorithm is highly scalable by paral-lelizing simultaneously over both features and records, al-lowing us to quickly evaluate billions of potential featureseven for very large data sets.

Sameer SinghUniversity of Massachusetts, AmherstDepartment of Computer [email protected]

Jeremy Kubica, Scott LarsenGoogle Inc.Pittsburgh PA [email protected], [email protected]

Daria SorokinaCarnegie Mellon UniversityPittsburgh PA [email protected]

CP10

Multi-Topic Based Query-Oriented Summarization

In this paper, we study a new setup of the problem ofmulti-topic based query-oriented summarization. We pro-pose using a probabilistic approach to solve this problem.More specifically, we propose two strategies to incorporatethe query information into a probabilistic model. Exper-imental results on two different genres of data show thatour proposed approach can effectively extract a multi-topicsummary from a document collection and the summariza-tion performance is better than baseline methods.

Jie TangTsinghua [email protected]

Limin YaoUniversity of Massachusetts [email protected]

Dewei ChenTsinghua [email protected]

CP10

Straightforward Feature Selection for Scalable La-tent Semantic Indexing

Latent Semantic Indexing (LSI) has been validated to beeffective on many small scale text collections. However, lit-tle evidence has shown its effectiveness on unsampled largescale text corpus due to its high computational complex-ity. In this paper, we propose a straightforward featureselection strategy, which is named as Feature Selection forLatent Semantic Indexing (FSLSI), as a preprocessing stepsuch that LSI can be efficiently approximated on large scale

40 DT09 Abstracts

text corpus.

Jun YanMicrosoft Research [email protected]

Shuicheng YanNational University of [email protected]

Ning Liu, Zheng ChenMicrosoft Research [email protected], [email protected]

CP10

Topic Cube: Topic Modeling for OLAP on Multi-dimensional Text Databases

As the amount of textual information grows explosively invarious kinds of business systems, it becomes more andmore desirable to analyze both structured data recordsand unstructured text data simultaneously. While OLAPtechniques have been proven very useful for analyzing andmining structured data, they face challenges in handlingtext data. On the other hand, probabilistic topic modelsare among the most effective approaches to latent topicanalysis and mining on text data. In this lecture, we willdescribe a new data model called topic cube which com-bines OLAP with probabilistic topic modeling and enablesOLAP on the dimension of text data in a multidimensionaltext database.

Duo Zhang, Chengxiang ZhaiUniversity of Illinois at Urbana [email protected], [email protected]

Jiawei HanUniversity of Illinois at [email protected]

CP11

Travel-Time Prediction Using Gaussian ProcessRegression: A Trajectory-Based Approach

We tackle the task of travel-time prediction for an arbitraryorigin-destination pair on a map. Unlike most of the exist-ing studies, our method allows us to probabilistically pre-dict the travel time along an unknown path if the similaritybetween paths is defined as a kernel function. Our first in-novation is to use a string kernel to represent the similaritybetween paths. Our second new idea is to apply Gaussianprocess regression for probabilistic travel-time prediction.

Tsuyoshi IdeIBM ResearchTokyo Research [email protected]

Sei KatoIBM Research, Tokyo Research [email protected]

CP11

Discretized Spatio-Temporal Scan Window

The focus of this paper is the discovery of anoma-lous spatio-temporal windows. We propose a DiscretizedSpatio- Temporal Scan Window approach to address the

question of how we can treat Space and Time togetherwithout compromising on the properties of each and theirimpact on each other. In doing so we discover anomalousSpatio- Temporal windows, identify at what point in timethe window changes, identify the spatial patterns of changeover time and identify a spatial extent in time which is com-pletely deviant with respect to the rest of the anomalousspatio- temporal windows. None of the current approachesaddress all these issues in combination. Subsequently weperform experiments on several real world datasets to val-idate our approach while comparing with the establishedapproach of discovering a cylindrical spatio-temporal Scanwindow.

Vandana JanejaInformation Systems DepartmentUniversity Of Maryland Baltimore [email protected]

Seyed MohammadiUMBC, Johns Hopkins [email protected]

Aryya GangopadhyayUniversity of Maryland, Baltimore [email protected]

CP11

Efficient Multiplicative Updates for Support Vec-tor Machines

The dual formulation of the support vector machine (SVM)objective function is an instance of a nonnegative quadraticprogramming problem. We reformulate the SVM objectivefunction as a matrix factorization problem which estab-lishes a connection with the regularized nonnegative ma-trix factorization (NMF) problem. This allows us to derivea novel multiplicative algorithm for solving hard and softmargin SVM. The algorithm follows as a natural extensionof the updates for NMF and semi-NMF. No additional pa-rameter setting, such as choosing learning rate, is required.Exploiting the connection between SVM and NMF formu-lation, we show how NMF algorithms can be applied to theSVM problem. Multiplicative updates that we derive forSVM problem also represent novel updates for semi-NMF.Further this unified view yields algorithmic insights in bothdirections: we demonstrate that the Kernel Adatron algo-rithm for solving SVMs can be adapted to NMF problems.Experiments demonstrate rapid convergence to good clas-sifiers. We analyze the rates of asymptotic convergence ofthe updates and establish tight bounds. We test them onseveral datasets using various kernels and report equivalentclassification performance to that of a standard SVM.

Vamsi PotluruDept of Computer Science, University of New [email protected]

Sergey PlisDepartment of Computer ScienceUniversity of New [email protected]

Morten MorupTechnical University of [email protected]

Vince CalhounElectrical and Computer Engineering

DT09 Abstracts 41

University of New [email protected]

Terran LaneDepartment of Computer ScienceUniversity of New [email protected]

CP11

Finding Links and Initiators: aGraph-Reconstruction Problem

Consider a 0–1 observation matrix M , where rows cor-respond to entities and columns correspond to signals; avalue of 1 (or 0) in cell (i, j) of M indicates that signal jhas been observed (or not observed) in entity i. Given sucha matrix we study the problem of inferring the underlyingdirected links between entities (rows) and finding whichentries in the matrix are initiators. We formally define thisproblem and propose an MCMC framework for estimatingthe links and the initiators given the matrix of observationsM . We also show how this framework can be extended toincorporate a temporal aspect; instead of considering a sin-gle observation matrix M we consider a sequence of obser-vation matrices M1, . . . ,Mt over time. We show the con-nection between our problem and several problems stud-ied in the field of social-network analysis. We apply ourmethod to paleontological and ecological data and showthat our algorithms work well in practice and give reason-able results.

Heikki MannilaHIIT, Helsinki University of TechnologyUniversity of [email protected]

Evimaria TerziIBM Almaden Research [email protected]

CP11

Efficient Active Learning with Boosting

We construct a novel objective function to unify semi-supervised learning and active learning boosting. Mini-mization of this objective is achieved through alternatingoptimization w.r.t the classifier ensemble and the querieddata set iteratively. More important, we derive an effi-cient active learning algorithm under this framework, basedon a novel query mechanism called query by incrementalcommittee. It does not only save considerable computa-tional cost, but also outperforms conventional active learn-ing methods based on boosting.

Zheng WangDepartment of Automation, Tsinghua [email protected]

Yangqiu SongTsinghua [email protected]

Changshui ZhangTsinghua [email protected]

PP0

On Segment-Based Stream Modeling and Its Ap-plications

The primary constraint in the effective mining of datastreams is the large volume of data which must be pro-cessed in real time. In many cases, it is desirable to storea summary of the data stream segments in order to per-form data mining tasks. Since density estimation providesa comprehensive overview of the probabilistic data distri-bution of a stream segment, it is a natural choice for thispurpose. A direct use of density distributions can how-ever turn out to be an inefficient storage and processingmechanism in practice. In this paper, we introduce theconcept of cluster histograms, which provides an efficientway to estimate and summarize the most important datadistribution profiles over different stream segments. Theseprofiles can be constructed in a supervised or unsupervisedway depending upon the nature of the underlying applica-tion. The profiles can also be used for change detection,anomaly detection, segmental nearest neighbor search, orsupervised stream segment classification. The flexibilityof the tasks which can be performed from the cluster his-togram framework follows from its generality in storing thehistorical density profile of the data stream. As a result,this method provides a holistic framework for density basedmining of data streams. We discuss and test the applica-tion of the cluster histogram framework to a variety ofinteresting data mining applications such as speaker recog-nition and intrusion detection.

Charu C. AggarwalIBM T. J. Watson Research [email protected]

PP0

Structure and Dynamics of Research Collaborationin Computer Science

We use the DBLP bibliographic database of Computer Sci-ence publications in top tier conferences to construct col-laboration networks and examine the properties of thesenetworks. We perform community structure analysis, ex-amine various forms of centralization, and use PCA on thevarious areas of computer science research to compare andcontrast their collaboration patterns. Our analysis exam-ines the entire network, separate networks based on re-search area, and looks at how they have changed over time.

Christian A. BirdUniversity of California, [email protected]

Earl Barr, Andre Nash, Premkumar Devanbu, VladimirFilkov, Zhendong SuUC [email protected], [email protected], [email protected], [email protected], [email protected]

PP0

On the Comparison of Relative Clustering ValidityCriteria

The present paper presents an alternative methodology forcomparing clustering validity criteria and uses it to makean extensive comparison of the performances of 4 well-known validity criteria and 20 variants of them over a col-

42 DT09 Abstracts

lection of 142,560 partitions of 324 different data sets of agiven class of interest.

Lucas VendraminDepartment of Computer SciencesUniversity of Sao Paulo at Sao [email protected]

Ricardo J. CampelloDepartment of Computer SciencesUniversity of So Paulo at So [email protected]

Eduardo HruschkaDepartment of Computer SciencesUniversity of Sao Paulo at Sao [email protected]

PP0

Context Aware Trace Clustering: Towards Improv-ing Process Mining Results

Process Mining refers to the extraction of process modelsfrom event logs. Real-life processes tend to be less struc-tured and more flexible. Traditional process mining algo-rithms have problems dealing with such unstructured pro-cesses and generate spaghetti-like process models that arehard to comprehend. An approach to overcome this is tocluster process instances such that each of the resultingclusters correspond to a coherent set of process instancesthat can be adequately represented by a process model. Inthis paper, we propose a context aware approach to traceclustering based on generic edit distance. It is well knownthat the generic edit distance framework is highly sensi-tive to the costs of edit operations. We define an auto-mated approach to derive the costs of edit operations. Themethod proposed in this paper outperforms contemporaryapproaches to trace clustering in process mining. We eval-uate the goodness of the formed clusters using establishedfitness and comprehensibility metrics defined in the contextof process mining. The proposed approach is able to gen-erate clusters such that the process models mined from theclustered traces show a high degree of fitness and compre-hensibility when compared to contemporary approaches.

Jagadeesh Chandra Bose R.PDepartment of Mathematics and Computer ScienceUniversity of Technology Eindhoven (TU/e), [email protected]

Wil Van Der AalstDepartment of Mathematics and Computer ScienceUniversity of Technology, Eindhoven, The [email protected]

PP0

A Semi-Supervised Framework for Feature Map-ping and Multiclass Classification

We propose a semi-supervised framework incorporatingfeature mapping with multiclass classification. By learn-ing multiple classification tasks simultaneously, this frame-work can learn the latent feature space effectively for bothlabeled and unlabeled data. The knowledge in the trans-formed space can be transferred not only between the la-beled and unlabeled data, but also across multiple classes,so as to improve the classification performance given asmall amount of labeled data. We show that this problem

is equivalent to a sequential convex optimization problemby applying constraint concave-convex procedure (CCCP).Efficient algorithm with theoretical guarantee is proposedand computational issue is investigated. Extensive exper-iments have been conducted to demonstrate the effective-ness of our proposed framework.

Bo Chen, Wai LamChinese University of Hong [email protected], [email protected]

Ivor TsangNanyang Technological [email protected]

Tak-Lam WongChinese University of Hong [email protected]

PP0

Divide and Conquer Strategies for Effective Infor-mation Retrieval

Latent Semantic Indexing, a well-known technique for in-formation retrieval, requires the computation of a partialSVD of the term-document matrix. This computation be-comes infeasible for large document collections, since it isvery demanding both in time and memory. We discusstwo divide and conquer strategies, with the goal of allevi-ating these difficulties. An additional benefit is that thecomputation can be easily adapted to a parallel comput-ing environment. Experimental results confirm that theproposed strategies are effective.

Jie ChenDepartment of Computer Science and EngineeringUniversity of [email protected]

Yousef SaadDepartment of Computer ScienceUniversity of [email protected]

PP0

A Bayesian Approach to Graph Regression withRelevant Subgraph Selection

This paper introduces a Bayesian approach to graph regres-sion problems requiring relevant subgraph selection whichprovides a posterior distribution on the target variable asopposed to a single estimate. The intractability issue arisenfrom the representation of the graphs as binary vectors ofindicators of subgraphs is solved using a column generationapproach, where the most violated constraints are foundby weighted subgraph mining. The model is evaluated onseveral molecular graph datasets.

Silvia ChiappaMax-Planck Institute for Biological [email protected]

Hiroto SaigoMax-Planck Institute for InformaticsCampus E1 4, 66123 Saarbruecken, [email protected]

Koji Tsuda

DT09 Abstracts 43

Max Planck Institute for Biological [email protected]

PP0

A New Constraint for Mining Sets in Sequences

Discovering interesting patterns in event sequences is apopular task in the field of data mining. Most existingmethods try to do this based on some measure of cohesionto determine an occurrence of a pattern, and a frequencythreshold to determine if the pattern occurs often enough.We introduce a new constraint based on a new interesting-ness measure combining the cohesion and the frequency ofa pattern.

Boris Cule, Bart GoethalsUniversity of [email protected], [email protected]

Celine RobardetINSA-Lyon, LIRIS UMR5205, F-69621 Villeurbanne,[email protected]

PP0

Non-Parametric Information-Theoretic Measuresof One-Dimensional Distribution Functions fromContinuous Time Series

We study non-parametric measures for the problem of com-paring distributions, which arise in anomaly detection forcontinuous time series. Some of these measures are forPDFs and others are for CDFs. We show how to adaptPDF measures to compare CDFs —we compare 23 CDFmeasures. We provide a unified functional form for allmeasure. We determine the measure significance by sim-ulations only. Finally, we evaluate them for the anomalydetection in continuous time series.

Paolo D’Alberto, Ali DasdanYahoo! [email protected], [email protected]

PP0

Noise Robust Classification Based On Spread Spec-trum

In this paper we develop a robust classification mecha-nism based on a connectionist model in order to classifyobjects from arbitrary feature spaces. Our main contribu-tion is to adapt the spread spectrum method from signaltransmission technology to the noise-robust classificationof feature vectors using a recurrent neural network. Weapplied our technique to four publicly available classifica-tion benchmarks, providing higher classification accuracies(2% to 16% improvement) than support vector machinesand meta-classification techniques.

Joern DavidTechnical University [email protected]

PP0

Finding Representative Association Rules fromLarge Rule Collections

One of the most well-studied problems in data miningis computing association rules from large transactional

databases. Often, the rule collections extracted from exist-ing data-mining methods can be far too large to be care-fully examined and understood by the data analysts. Inthis paper, we address exactly this issue of overwhelminglylarge rule collections by introducing and studying the fol-lowing problem: Given a large collection R of associationrules we want to pick a subset of them S ⊆ R that best rep-resents the original collection R as well as the dataset fromwhich R was extracted. We first quantify the notion of thegoodness of a ruleset using two very simple and intuitivedefinitions. Based on these definitions we then formallydefine and study the corresponding optimization problemsof picking the best ruleset S ⊆ R. We propose algorithmsfor solving these problems and present experiments to showthat our algorithms work well for real datasets and lead tolarge reduction in the size of the original rule collection.

Warren L. DavisIBM Almaden Research [email protected]

Peter [email protected]

Evimaria TerziIBM Almaden Research [email protected]

PP0

Discovery of Geospatial Discriminating Patternsfrom Remote Sensing Datasets

Large amounts of remotely sensed data calls for data min-ing techniques to fully utilize their rich information con-tent. In this paper, a new value-iteration method is intro-duced to optimally split the spatial domain of the selectedvariable into two classes. This division is used to calculatethe set of patterns that are emerging with respect to thetwo classes. A new method for a concise summarization isintroduced to construct super patterns of controlling fac-tors.

Wei DingUMass [email protected]

Tomasz StepinskiLunar and Planetary [email protected]

Josue SalazarUniversity of Houston-Clear [email protected]

PP0

Mining for Surprise Events Within Text Streams

Text streams are a fundamental source of information thatcan be used to detect and characterize strategic intent ofindividuals and organizations as well as detecting abruptor surprising events. We describe our algorithm develop-ment and analysis methodology for mining the evolvingcontent in text streams. Our approach focuses on the tem-poral characteristics in a text stream to identify relevantfeatures, and on the analysis and algorithmic methodologyto communicate these characteristics to a user.

Dave Engel, Paul Whitney, Nick Cramer

44 DT09 Abstracts

Pacific Northwest National [email protected], [email protected],[email protected]

PP0

Topic Evolution in a Stream of Documents

Document collections evolve over time, new topics emergeand old ones decline. At the same time, the terminologyevolves as well. We propose Topic Monitor for monitor-ing and understanding of topic and vocabulary evolutionover an infinite document stream. We use PLSA for topicmodeling and propose new folding-in techniques for topicadaptation under an evolving vocabulary. We extract aseries of models, on which we detect topic threads as de-scriptions of topic evolution.

Andre GohrLeibniz Institute of Plant Biochemistry,IPB, [email protected]

Alexander HinneburgMartin-Luther-University [email protected]

Rene Schult, Myra SpiliopoulouOtto-von-Guericke-University [email protected],[email protected]

PP0

Randomization Techniques for Graphs

Within the framework of statistical hypothesis testing, wefocus on randomization techniques for unweighted undi-rected graphs. Given an input graph, our randomizationmethod will sample data from the class of graphs that sharecertain structural properties with the input graph. Wepresent three alternative algorithms based on local edgeswapping and Metropolis sampling. We test our frame-work in graph clustering and frequent subgraph mining.

Sami HanhijarviHelsinki Institute for Information Technology HIITHelsinki University of [email protected]

Gemma Garriga, Kai PuolamakiHelsinki Institute for Information Technology [email protected], [email protected]

PP0

Musk: Uniform Sampling of k Maximal Patterns

We propose Musk, an algorithm to obtain representativefrequent patterns by sampling uniformly from the pool ofall maximal frequent patterns; uniformity is achieved by avariant of Markov Chain Monte Carlo (MCMC) algorithm.Musk simulates a random walk on the frequent patternpartial order graph with a prescribed transition probabil-ity matrix, whose values are computed locally during thesimulation. In the stationary distribution of the randomwalk, all maximal frequent pattern nodes in the partial or-der graph are sampled uniformly. Experiments on variouslarge datasets validate that Musk is effective in obtainingrepresentative frequent patterns when complete enumera-

tion of all the frequent patterns are infeasible by traditionalalgorithms.

Mohammad A. HasanDepartment of Computer ScienceRensselaer Polytechnic [email protected]

Mohammed ZakiRensselaer Polytechnic [email protected]

PP0

Low-Entropy Set Selection

Most pattern discovery algorithms easily generate verylarge numbers of patterns, making the results impossibleto understand and hard to use. In this paper we presenta succinct way of representing data on the basis of item-sets that identify strong interactions. This new approach,LESS, provides a powerful and general MDL-based tech-nique to data description. We consider the data symmetri-cally and describe all interactions between attributes, notjust co-occurrences, in only a handful of sets.

Hannes HeikinheimoHelsinki University of Technology TKKDepartment of Information and Computer Science, [email protected]

Jilles VreekenDepartment of Computer ScienceUniversiteit [email protected]

Arno SiebesDept. of Information and Computing SciencesUniversiteit [email protected]

Heikki MannilaHIIT, Helsinki University of TechnologyUniversity of [email protected]

PP0

A Re-Evaluation of the Over-Searching Phe-nomenon in Inductive Rule Learning

We evaluate the spectrum of different search strategies tosee whether separate-and-conquer rule learners are able togain performance by using more powerful search strate-gies like beam or exhaustive search. Unlike previous re-sults that demonstrated that rule learners suffer from over-searching, our work pays particular attention to the con-nection between the search heuristic and the search strat-egy, and we show that for some heuristics, complex searchalgorithms will consistently improve results without suffer-ing from over-searching.

Frederik JanssenTU Darmstadt, Knowledge Engineering [email protected]

Johannes FurnkranzTU DarmstadtKnowledge Engineering [email protected]

DT09 Abstracts 45

PP0

Change-Point Detection in Time-Series Data byDirect Density-Ratio Estimation

Change-point detection is the problem of discovering timepoints at which properties of time-series data change. Thiscovers a broad range of real-world problems and has beenactively discussed in the community of statistics and datamining. In this paper, we present a novel non-parametricapproach to detecting the change of probability distribu-tions of sequence data. Our key idea is to estimate the ratioof probability densities, not the probability densities them-selves. This formulation allows us to avoid non-parametricdensity estimation, which is known to be a difficult prob-lem. We provide a change-point detection algorithm basedon direct density-ratio estimation that can be computedvery efficiently in an online manner. The usefulness of theproposed method is demonstrated through experiments us-ing artificial and real datasets.

Yoshinobu Kawahara, Masashi SugiyamaTokyo Institute of [email protected], [email protected]

PP0

PICC Counting: Who Needs Joins when you CanPropagate Efficiently?

Counting is a common task in many data mining applica-tions. In situations where the attributes of interest spanmultiple tables in databases, computing instance countscan be expensive. In this paper, we propose PICC, atechnique for discovering instance counts. We propose apropagation-based instance counting scheme which avoidsjoins to obtain a single table. We then present a method forsummarizing a database into a concise synopsis and thusestimating the required counts efficiently.

Jong Wook Kim, K. Selcuk CandanComputer Science and Engineering Dept.Arizona State [email protected], [email protected]

PP0

Spatially Cost-Sensitive Active Learning

In active learning, one attempts to maximize classifier per-formance for a given number of labeled training pointsby allowing the active learning algorithm to choose whichpoints should be labeled. Typically, when the active learnerrequests labels for the selected points, it assumes that allpoints require the same amount of effort to label and thatthe cost of labeling a point is independent of other selectedpoints. In spatially distributed data such as hyperspectralimagery for land-cover classification, the act of labeling apoint (i.e., determining the land-type) may involve physi-cally traveling to a location and determining ground truth.In this case, both assumptions about label acquisition costsmade by traditional active learning are broken, since costswill depend on physical locations and accessibility of allthe visited points. This paper formulates and analyzes thenovel problem of performing active learning on spatial datawhere label acquisition costs are proportional to distancetraveled.

Alexander Liu, Goo Jun, Joydeep GhoshUniversity of Texas at [email protected], [email protected],[email protected]

PP0

Highlighting Diverse Concepts in Documents

We show the underpinnings of a method for summarizingdocuments: it ingests a document and automatically high-lights a small set of sentences that are expected to coverthe different aspects of the document. The sentences arepicked using simple coverage and orthogonality criteria.We describe a novel combinatorial formulation that cap-tures exactly the document-summarization problem, andwe develop simple and efficient algorithms for solving it.We compare our algorithms with many popular document-summarization techniques via a broad set of experimentson real data. The results demonstrate that our algorithmswork well in practice and give high-quality summaries.

Kun Liu, Evimaria Terzi, Tyrone GrandisonIBM Almaden Research [email protected], [email protected],[email protected]

PP0

Fedra: A Fast and Efficient Dimensionality Reduc-tion Algorithm

Motivated by the problems occurring while mining data inhigh dimensional spaces we propose FEDRA, a fast and ef-ficient dimensionality reduction algorithm that uses a set oflandmark points to project data to a lower dimensional Eu-clidean space. FEDRA is faster and requires less memorythan other comparable algorithms, without compromisingthe projection’s quality. We theoretically assess the qual-ity of the resulting projection and provide a bound for theerror induced in pairwise distances.

Panagis Magdalinos, Christos Doulkeridis, Michalis‘VazirgiannisAthens University of Economics and [email protected], [email protected], [email protected]

PP0

Mining Cohesive Patterns from Graphs with Fea-ture Vectors

In this paper, we introduce the novel problem of miningcohesive patterns from graphs with feature vectors. A co-hesive pattern is a dense and connected subgraph thathas homogeneous values in a large enough feature sub-space. We present the algorithm CoPaM which exploitsvarious pruning strategies. Our theoretical analysis provesthe correctness of CoPaM, and our experimental evalua-tion demonstrates its efficiency and effectiveness in drivingapplications such as social network analysis and molecularbiology.

Flavia S. MoserSimon Fraser [email protected]

PP0

Exact Discovery of Time Series Motifs

Time series motifs are sets of very similar individual timeseries or subsequences of a long time series. Because ofthe quadratic search space, only approximate motifs havebeen found in the past. We designed a tractable algorithm(MK ) to find exact motifs for the first time. Empirically,MK is way faster than brute-force search and applicable asa subroutine in high level data mining tasks like anytime

46 DT09 Abstracts

classification, near-duplicate detection and summarization.

Abdullah Mueen, Eamonn Keogh, Qiang ZhuUniversity of California, [email protected], [email protected], [email protected]

Sydne CashMassachusetts General HospitalHarvard Medical [email protected]

Brandon WestoverMassachusetts General HospitalBrigham and Women’s [email protected]

PP0

Providing Privacy Through Plausibly DeniableSearch

Query-based web search is an integral part of many peo-ples daily activities. Most do not realize that their searchhistory can be used to identify them (and their interests).In July 2006, AOL released an anonymized search querylog of some 600K randomly selected users. While valuableas a research tool, the anonymization was insufficient: in-dividuals were identified from the contents of the queriesalone. Government requests for such logs increases the con-cern. To address this problem, we propose a client-centeredapproach of plausibly deniable search. Each user query issubstituted with a standard, closely-related query intendedto fetch the desired results. In addition, a set of k-1 coverqueries are issued; these have characteristics similar to thestandard query but on unrelated topics. The system en-sures that any of these k queries will produce the same setof k queries, giving k possible topics the user could havebeen searching for. We use a Latent Semantic Indexing(LSI) based approach to generate queries, and evaluate onthe DMOZ webpage collection to show effectiveness of theproposed approach.

Mummoorthy MurugesanDepartment of Computer Science,Purdue [email protected]

Chris CliftonDepartment of Computer SciencePurdue [email protected]

PP0

The Set Classification Problem and Solution Meth-ods

This paper focuses on developing classification algorithmsfor problems in which there is a need to predict the classbased on multiple observations (examples) of the same phe-nomenon (class). These problems give rise to a new clas-sification problem, referred to as set classification, that re-quires the prediction of a set of instances given the priorknowledge that all the instances of the set belong to thesame unknown class. This problem falls under the generalclass of problems whose instances have class label depen-dencies. Four methods for solving the set classificationproblem are developed and studied. The first is based ona straightforward extension of the traditional classificationparadigm whereas the other three are designed to explic-itly take into account the known dependencies among theinstances of the unlabeled set during learning or classifi-

cation. A comprehensive experimental evaluation of thevarious methods and their underlying parameters showsthat some of them lead to significant gains in performance.

Xia NingUniversity of Minnesota, Twin [email protected]

George KarypisUniversity of Minnesota / [email protected]

PP0

Text Categorization with All Substring Features

This paper presents a novel document classification methodusing all substrings as features. Learning by using all sub-strings has a prohibitive computational cost because thenumber of candidate substrings can be very large. We showthat the idea of equivalent classes of substrings can help de-termine all effective substrings exhaustively in linear time.In experiments, we show that our method can extract ef-fective substrings efficiently, and achieved more accurateresults than the results using previous methods.

Daisuke Okanohara, Jun’ichi TsujiiUniversity of [email protected], [email protected]

PP0

Exploiting Semantic Constraints for EstimatingSupersenses with Crfs

The annotation of words by ontology concepts is extremelyhelpful for semantic interpretation. We employ conditionalrandom fields to predict the coarse meanings (supersenses)of words. As the annotation of training data is costly wemodify the CRF algorithm to process a set of possible la-bels for each training instance (lumped labels). Using onlyunlabelled data for training it turns out that the result-ing F-value is only slightly lower than for the fully labelleddata.

Frank Reichartz, Gerhard PaassFraunhofer [email protected],[email protected]

PP0

Analyses for Service Interaction Networks withApplications to Service Delivery

In this work we focus on learning individual and team be-havior of different people or agents of a service organizationby studying the patterns and outcomes of historical inter-actions. We develop the notion of service interaction net-works which is an abstraction of the historical data and al-lows one to cast practical problems in a formal setting. To-wards this goal we develop new algorithms based on eigenvalue methods and an iterative approach.

Vinayaka Pandit, SAMEEP MEHTA, GYANA PARIJAIBM India Research [email protected], [email protected],[email protected]

S. KAMESHWARAN, VISWANADHAM N.

DT09 Abstracts 47

Indian School of Business (ISB)kameshwaran [email protected], n [email protected]

SUDHANSHU SINGHUniversity of North Carolina, Chapel [email protected]

PP0

Measuring Discrimination in Socially-Sensitive De-cision Records

We tackle the problem of determining, given a dataset ofhistorical decision records, a precise measure of the degreeof discrimination suffered by a given group of people. Thisproblem is rephrased in a classification rule setting by in-troducing a collection of quantitative measures of discrim-ination. Based on these measures, we are able to unveildiscriminatory decision patterns hidden in the historicaldata or in classifiers that learn over training data biasedby discriminatory decisions.

Dino Pedreschi, Salvatore Ruggieri, Franco TuriniDipartimento di Informatica, Universita’ di [email protected], [email protected], [email protected]

PP0

A Hybrid Data Mining Metaheuristic for the P-Median Problem

The main contribution of this work is a hybrid versionof the GRASP metaheuristic, which incorporates a datamining process, to solve the p-median problem. Patternsmined from a set of sub-optimal solutions are used toguide the GRASP procedures in the search for better solu-tions. Computational experiments, comparing traditionalGRASP and different hybrid proposals showed that em-ploying the mined patterns made the hybrid GRASP findbetter results in less computational time.

Alexandre Plastino, Erick Fonseca, Richard Fuchshuber,Simone MartinsFluminense Federal [email protected], [email protected],[email protected], [email protected]

Alex Freitas, Martino Luis, Said SalhiUniversity of [email protected], [email protected],[email protected]

PP0

Aligned Graph Classification with Regularized Lo-gistic Regression

We consider a classification problem in which there is afixed and known binary relation defined on the featuresof a set of multivariate random variables, which we callan aligned graph classification problem. We aim to im-prove classification performance over conventional learningby incorporating feature relation information in the learn-ing process through extending logistic regression to includethe normalized Laplacian of the graph. We validate ourmethod using simulated and real data sets.

Brian Quanz, Jun HuanInformation and Telecommunication Technology CenterUniversity of [email protected], [email protected]

PP0

Feature Weighted SVMs Using Receiver OperatingCharacteristics

Support Vector Machines (SVMs) are a leading tool in clas-sification and pattern recognition and the kernel functionis one of its most important components. This function isused to map the input space into a high dimensional fea-ture space. However, it can perform rather poorly whenthere are too many dimensions (e.g. for gene expressiondata) or when there is a lot of noise. In this paper, weinvestigate the suitability of using a new feature weightingscheme for SVM kernel functions, based on receiver oper-ating characteristics (ROC). This strategy is clean, simpleand surprisingly effective. We experimentally demonstratethat it can significantly and substantially boost classifica-tion performance, across a range of datasets.

Shaoyi Zhang, M. Maruf Hossain, Md. Rafiul Hassan,James Bailey, Kotagiri RamamohanaraoDepartment of Computer Science and SoftwareEngineeringThe University of Melbourne, [email protected],[email protected],[email protected],[email protected], [email protected]

PP0

Identifying Information-Rich Subspace Trends inHigh-Dimensional Data

Identifying information-rich subsets in high-dimensionalspaces and representing them as order revealing patterns isan important research problem in many science and engi-neering applications. In this paper, we seek an information-revealing representation of the data subsets and formal-ize the problem of identifying subspace trends focusing oninformation-rich subsets and develop a new algorithm toextract such subspace trends. We demonstrate our resultson both synthetic and real-world datasets and show theadvantages of the proposed methodology.

Chandan K. ReddyDepartment of Computer ScienceWayne State [email protected]

Snehal PokharkarWayne State [email protected]

PP0

On Maximum Coverage in the Streaming Modeland Application To

The set-streaming model is the generalization of graph-streaming model to hyper-graphs. We consider the prob-lem of maximum coverage, in which k sets have to be se-lected that maximize the total weight of the covered ele-ments in this model and show that our algorithm achievesan approximation factor of 1 /4 . Using this algorithm,we provide efficient online solution to a multi-topic blog-watch application, an extension of blog-alert, for handlingsimultaneous multiple-topic requests.

Barna SahaUniversity of Maryland College [email protected]

48 DT09 Abstracts

Lise GetoorUniversity of Maryland, College [email protected]

PP0

Multi-Field Correlated Topic Modeling

Popular methods for probabilistic topic modeling like Cor-related Topic Models (CTM) and Latent Dirichlet Alloca-tion (LDA) share an important property, i.e. using a com-mon set of topics to model all the data. This can be toorestrictive for modeling complex data entries consisting ofmultiple heterogeneous fields. We propose a new extensionof the CTM method to enable modeling with multi-fieldtopics in a global graphical structure.

Konstantin Salomatin, Yiming YangCarnegie Mellon [email protected], [email protected]

Abhimanyu LadLanguage Technologies InstituteCarnegie Mellon [email protected]

PP0

FutureRank: Ranking Scientific Articles by Pre-dicting Their Future PageRank

The dynamic nature of citation networks makes the taskof ranking scientific articles hard. We argue that what ismost useful is the expected future references. We definea new measure, FutureRank, which is the expected futurePageRank score based on citations that will be obtainedin the future. In addition to making use of the citationnetwork, FutureRank uses the authorship network and thepublication time of the article in order to predict futurecitations.

Hassan SayyadiUniversity of Maryland-College [email protected]

Lise GetoorUniversity of Maryland, College [email protected]

PP0

Diversity-Based Weighting Schemes for ClusteringEnsembles

We propose general weighting schemes for clustering en-sembles. These schemes are independent of the particularmethod of clustering ensembles and consider the individ-ual clustering solutions in different ways, based on differ-ent implementations of the notion of diversity. We showhow the proposed schemes can be instantiated into anyinstance-based, cluster-based and hybrid clustering ensem-bles methods. Experiments have shown that the perfor-mance of clustering ensembles algorithms increases whenthe proposed weighting schemes are employed.

Andrea TagarelliDept. of Electronics, Computer and System SciencesUniversity of Calabria, [email protected]

Francesco Gullo

Department of Electronics, Computer and SystemSciencesUniversity of [email protected]

Sergio GrecoDept. of Electronics, Computer and System SciencesUniversity of Calabria, [email protected]

PP0

Detection and Characterization of Anomalies inMultivariate Time Series

This talk presents a robust algorithm for detecting anoma-lies in noisy multivariate time series data by employing akernel matrix alignment method to capture the dependencerelationships among variables in the time series. We showthat the algorithm is flexible enough to handle differenttypes of time series anomalies including subsequence-basedand local anomalies. A case study is also presented to il-lustrate the ability of the algorithm to detect ecosystemdisturbances in Earth science data.

Pang-Ning Tan, Haibin ChengMichigan State [email protected], [email protected]

Christopher PotterNASA Ames Research [email protected]

Steven KloosterCalifornia State [email protected]

PP0

Tracking User Mobility to Detect Suspicious Be-havior

Popularity of mobile devices is accompanied by widespreadsecurity problems, such as MAC address spoofing in wire-less networks. We propose a probabilistic approach totemporal anomaly detection using smoothing technique forsparse data. Our technique builds up on the Markov chain,and clustering is presented for reduced storage require-ments. Wireless networks suffer from oscillations betweenlocations, which result in weaker statistical models. Ourtechnique identifies such oscillations, resulting in higher ac-curacy. Experimental results on publicly available wirelessnetwork data sets indicate that our technique is more ef-fective than Markov chain to detect anomalies for location,time, or both.

Gaurav TandonNuance [email protected]

Philip ChanDeptt. of Computer Sciences, Florida Institute ofTechnology150 W. University Blvd., Melbourne, FL 32901, [email protected]

PP0

ShatterPlots: Fast Tools for Mining Large Graphs

Graphs appear in several settings, like social networks,

DT09 Abstracts 49

recommendation systems, among others. The main con-tribution of this paper is ShatterPlots, a simple, scalable(O(E)) and powerful algorithm to extract patterns fromreal graphs that help us spot synthetic graphs. The high-light patterns are: “30-per-cent”, at the Shattering pointall real and synthetic graphs have about 30% more nodesthan edges; “NodeShatteringRatio”, which can almost per-fectly separate the real graphs from the synthetic.

Ana Paula AppelSo Paulo [email protected]

Deepayan ChakrabartiYahoo! ResearchSunnyvale, [email protected]

Christos FaloutsosCarnegie Mellon [email protected]

Ravi KumarYahoo! [email protected]

Jure LeskovecCornell UniversityComputer Science [email protected]

Andrew TomkinsYahoo! [email protected]

Hanghang TongMLD SCS [email protected]

PP0

Non-Negative Matrix Factorization, Convexity andIsometry

In this paper we explore avenues for improving the reli-ability of dimensionality reduction methods such as Non-Negative Matrix Factorization as interpretive exploratorydata analysis tools. We first show for the first time thatnon-trivial NMF solutions always exist and that the op-timization problem is actually convex, by using the the-ory of Completely Positive Factorization. We subsequentlyexplore four novel approaches to finding globally-optimalNMF solutions using various ideas from convex optimiza-tion. We then develop a new method, isometric NMF(isoNMF), which preserves non-negativity while also pro-viding an isometric embedding, simultaneously achievingtwo properties which are helpful for interpretation.

Nikolaos VasiloglouGeorgia Institute of [email protected]

PP0

Mining Complex Spatio-Temporal Sequence Pat-terns

Mining sequential movement patterns describing group be-haviour in potentially streaming spatio-temporal data setsis a challenging problem. This work mines sequences of

rules (k-STARs) that describe complex behaviours includ-ing spatio-temporal gaps and paths formed by ‘replenish-ment’. The user may drill down and roll up on a latticedefined over the sequences for exploratory analysis. Thealgorithm runs linearly in the number of patterns minedand interesting sequences are found in a real world dataset.

Florian VerheinInstitut fur InformatikLudwig-Maximilians-Universitat Munchen, [email protected]

PP0

An Entity Based Model for Coreference Resolution

Recently, many advanced machine learning approacheshave been proposed for coreference resolution; however,all of the discriminatively-trained models reason over men-tions, rather than entities. That is, they do not explicitlycontain variables indicating the “canonical’ values for eachattribute of an entity (e.g., name, venue, title, etc.). Thiscanonicalization step is typically implemented as a post-processing routine to coreference resolution prior to addingthe extracted entity to a database. In this paper, we pro-pose a discriminatively-trained model that jointly performscoreference resolution and canonicalization, enabling fea-tures over hypothesized entities. We validate our approachon two different coreference problems: newswire anaphoraresolution and research paper citation matching, demon-strating improvements in both tasks and achieving an errorreduction of up to 62% when compared to a method thatreasons about mentions only.

Michael WickUniversity of [email protected]

PP0

Semi-Supervised Learning by Sparse Representa-tion

The L1 graph proposed in this work is motivated by thateach datum can be reconstructed by the sparse linear su-perposition of the training data. The sparse reconstructioncoefficients, used to deduce the weights of the directed L1graph, are derived by solving an L1 optimization prob-lem on sparse representation. Then we propose a semi-supervised learning framework based on L1 graph to utilizeboth labeled and unlabeled data for inference on a graph.

Shuicheng YanNational University of [email protected]

Huan WangChinese University of Hong [email protected]

PP0

On Randomness Measures for Social Networks

In this paper, we theoretically analyze graph randomnessand present a framework which provides a series of non-randomness measures at levels of edge, node, and the over-all graph. We show that graph non-randomness can be ob-tained mathematically from the spectra of the adjacency

50 DT09 Abstracts

matrix of the network.

Xiaowei Ying, Xintao WuUniversity of North Carolina at [email protected], [email protected]

PP0

Parallel Pairwise Clustering

We propose a simple strategy for pairwise clustering ofmassive data by randomly splitting their affinity matrixinto small manageable affinity matrices that are clusteredindependently, for example using a parallel platform. Wedemonstrate that this approach yields high quality cluster-ing for various real world problems, even though at eachiteration only small fractions of the original data are ex-amined and at no point is the entire affinity matrix storedin memory or even computed.

Elad Yom-Tov, Noam SlonimIBM Haifa Research [email protected], [email protected]

PP0

Speeding Up Secure Computations via EmbeddedCaching

High computation overheads of many cryptography-basedPrivacy Preserving Data Mining algorithms have renderedthem less practical. In this paper, we address the effi-ciency issue of these algorithms by proposing a cachingapproach/concept. After carefully examining micro-stepsof several secure computations blocks, we identify itera-tive portions and reduce their overall computational costby caching intermediate results/data. We show empiri-cally that the overall system efficiency would be greatlyimproved without affecting result quality or compromisingdata privacy.

Ke Zhai, Wee Keong Ng, Andre HeriantoSchool of Computer EngineeringNanyang Technological [email protected], [email protected],[email protected]

Shuguo HanNanyang Technological [email protected]

PP0

Multiple Kernel Clustering

Maximum margin clustering (MMC) has recently attractedconsiderable interests in both the data mining and ma-chine learning communities. As in other kernel methods,choosing a suitable kernel function is imperative to thesuccess of maximum margin clustering. In this paper, wepropose a multiple kernel clustering (MKC) algorithm thatsimultaneously finds the maximum margin hyperplane, thebest cluster labeling, and the optimal kernel. Experimen-tal results demonstrate the effectiveness and efficiency ofthe MKC algorithm.

Bin ZhaoTsinghua [email protected]

James Kwok

Dept. Computer Science and EngineeringHong Kong Univ. of Science and [email protected]

Changshui ZhangTsinghua [email protected]

Date post:	05-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times