+ All Categories
Home > Documents > Visualizing the Impact of Geographical Variations on...

Visualizing the Impact of Geographical Variations on...

Date post: 07-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
10
Eurographics Conference on Visualization (EuroVis) 2016 K.-L. Ma, G. Santucci, and J. van Wijk (Guest Editors) Volume 35 (2016), Number 3 Visualizing the Impact of Geographical Variations on Multivariate Clustering Y. Zhang 1 , W. Luo 2 , E. A. Mack 3 , R. Maciejewski 1 1 Arizona State University, Tempe, USA 2 University of California, Santa Barbara, USA 3 Michigan State University, East Lansing, USA Abstract Traditional multivariate clustering approaches are common in many geovisualization applications. These algorithms are used to define geodemographic profiles, ecosystems and various other land use patterns that are based on multivariate measures. Cluster labels are then projected onto a choropleth map to enable analysts to explore spatial dependencies and heterogeneity within the multivariate attributes. However, local variations in the data and choices of clustering parameters can greatly impact the resultant visualization. In this work, we develop a visual analytics framework for exploring and comparing the impact of geographical variations for multivariate clustering. Our framework employs a variety of graphical configurations and summary statistics to explore the spatial extents of clustering. It also allows users to discover patterns that can be concealed by traditional global clustering via several interactive visualization techniques including a novel drag & drop clustering difference view. We demonstrate the applicability of our framework over a demographics dataset containing quick facts about counties in the continental United States and demonstrate the need for analytical tools that can enable users to explore and compare clustering results over varying geographical features and scales. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Applications— 1. Introduction Traditionally, multivariate clustering has been applied to mea- surements aggregated over various geographical areas (e.g., coun- ties, states, etc.) in order to identify similar and dissimilar re- gions. Examples include clustering vegetation measures to de- fine regional ecosystems [MHKH11], clustering common sur- names to generate cultural demographic maps [CML11], and clus- tering socioeconomic demographic measures to identify at risk neighborhoods [AL05]. The most common clustering methods ap- plied include k-means (e.g., [MHKH11]), hierarchical clustering (e.g., [CML11]), and self-organizing maps (e.g., [CMG08]). These methods are usually, but not always (e.g., [MJ07]), applied at a level that is agnostic to the spatial relationships between the data (i.e., the positions of the regions are not used as features in the clustering method). As such, local geographic variations may be obscured in a global clustering approach, and the quality of the clustering results needs to be explored locally and globally. Previous work in geographical visualization has focused on a variety of techniques designed for analyzing multivariate relation- ships in spatiotemporal data. Systems in this area commonly utilize coordinated multiple views [Rob07] in which users can observe patterns of multivariate data in scatter plots [And72], parallel co- ordinate plots [Ins85] and other visual representations and then in- teractively select areas in those views that will be highlighted on a map. These systems have been deployed for a variety of dif- ferent application domain areas and utilize a number of different analytical algorithms. For example, work by von Landesberger et al. [vLBA * 12] presented an approach for classifying spatiotempo- ral categorical data supported by algorithms for the selection of globally and focally representative time steps based on categorical changes, and Goodwin et al. [GDST16] developed a suite of novel interactive visualization methods to identify interdependencies in multivariate data coupled with a series of correlation matrix views. While there are a variety of visual analytics frameworks de- signed to enable the exploration of multivariate spatiotemporal data, previous methods focus primarily on the visual inspection of the quality of the clustering result. In this paper, we present a vi- sual analytics framework that links users to a variety of clustering quality measurements to enable them to develop an understanding of global and local multivariate clustering for geographical visu- alization. Our framework supports the visual exploration of geo- graphical projections of both k-means and hierarchical clustering. Clustering comparisons and rose plots are provided to assess cluster quality, and interactive manipulation and selection allows users to dynamically apply multivariate clustering to different geographical locations, features and scales. We showcase our approach using real c 2016 The Author(s) Computer Graphics Forum c 2016 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
Transcript
Page 1: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Eurographics Conference on Visualization (EuroVis) 2016K.-L. Ma, G. Santucci, and J. van Wijk(Guest Editors)

Volume 35 (2016), Number 3

Visualizing the Impact of Geographical Variations on MultivariateClustering

Y. Zhang1, W. Luo2, E. A. Mack3, R. Maciejewski1

1Arizona State University, Tempe, USA2University of California, Santa Barbara, USA

3Michigan State University, East Lansing, USA

AbstractTraditional multivariate clustering approaches are common in many geovisualization applications. These algorithms are usedto define geodemographic profiles, ecosystems and various other land use patterns that are based on multivariate measures.Cluster labels are then projected onto a choropleth map to enable analysts to explore spatial dependencies and heterogeneitywithin the multivariate attributes. However, local variations in the data and choices of clustering parameters can greatly impactthe resultant visualization. In this work, we develop a visual analytics framework for exploring and comparing the impact ofgeographical variations for multivariate clustering. Our framework employs a variety of graphical configurations and summarystatistics to explore the spatial extents of clustering. It also allows users to discover patterns that can be concealed by traditionalglobal clustering via several interactive visualization techniques including a novel drag & drop clustering difference view.We demonstrate the applicability of our framework over a demographics dataset containing quick facts about counties in thecontinental United States and demonstrate the need for analytical tools that can enable users to explore and compare clusteringresults over varying geographical features and scales.

Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Applications—

1. Introduction

Traditionally, multivariate clustering has been applied to mea-surements aggregated over various geographical areas (e.g., coun-ties, states, etc.) in order to identify similar and dissimilar re-gions. Examples include clustering vegetation measures to de-fine regional ecosystems [MHKH11], clustering common sur-names to generate cultural demographic maps [CML11], and clus-tering socioeconomic demographic measures to identify at riskneighborhoods [AL05]. The most common clustering methods ap-plied include k-means (e.g., [MHKH11]), hierarchical clustering(e.g., [CML11]), and self-organizing maps (e.g., [CMG08]). Thesemethods are usually, but not always (e.g., [MJ07]), applied at alevel that is agnostic to the spatial relationships between the data(i.e., the positions of the regions are not used as features in theclustering method). As such, local geographic variations may beobscured in a global clustering approach, and the quality of theclustering results needs to be explored locally and globally.

Previous work in geographical visualization has focused on avariety of techniques designed for analyzing multivariate relation-ships in spatiotemporal data. Systems in this area commonly utilizecoordinated multiple views [Rob07] in which users can observepatterns of multivariate data in scatter plots [And72], parallel co-ordinate plots [Ins85] and other visual representations and then in-

teractively select areas in those views that will be highlighted ona map. These systems have been deployed for a variety of dif-ferent application domain areas and utilize a number of differentanalytical algorithms. For example, work by von Landesberger etal. [vLBA∗12] presented an approach for classifying spatiotempo-ral categorical data supported by algorithms for the selection ofglobally and focally representative time steps based on categoricalchanges, and Goodwin et al. [GDST16] developed a suite of novelinteractive visualization methods to identify interdependencies inmultivariate data coupled with a series of correlation matrix views.

While there are a variety of visual analytics frameworks de-signed to enable the exploration of multivariate spatiotemporaldata, previous methods focus primarily on the visual inspection ofthe quality of the clustering result. In this paper, we present a vi-sual analytics framework that links users to a variety of clusteringquality measurements to enable them to develop an understandingof global and local multivariate clustering for geographical visu-alization. Our framework supports the visual exploration of geo-graphical projections of both k-means and hierarchical clustering.Clustering comparisons and rose plots are provided to assess clusterquality, and interactive manipulation and selection allows users todynamically apply multivariate clustering to different geographicallocations, features and scales. We showcase our approach using real

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and JohnWiley & Sons Ltd. Published by John Wiley & Sons Ltd.

Page 2: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

world data from the US Census Bureau and demonstrate how ourapproach helps in understanding geographical variations in multi-variate clustering, ultimately enabling the discovery of patterns thatmay be concealed.

2. Related Work

Our work aims to integrate fundamental theories from geographyinto multivariate spatial clustering through a geovisual analyticsframework. In this section we review the relevant fields describingthe characteristics of spatial processes, current multivariate geovi-sualization techniques, and clustering comparison methods in orderto frame our contribution with respect to the current state-of-the-art.

2.1. Geographical Variation

Data generating processes associated with spatial data are of-ten characterized as spatial dependence or spatial heterogene-ity [Ans88]. Spatial dependency refers to the similarity in attributevalues of nearby spatial units [Ans88] as proposed in Tobler’s firstlaw of geography [Tob70]. In contrast, Spatial heterogeneity ornonstationarity refers to variation rather than similarity in valuesfor a particular measures across all spatial units [BFC96]. Spatialstationarity is often assumed in statistical analyses, but this is prob-lematic in the presence of spatial heterogeneity where assumptionsof a global trend do not reflect the underlying data generating pro-cesses [BFC96]. As such, the diagnosis of local dependence andheterogeneity is particularly valuable to understanding statisticaloutput.

Due to this persistent issue in spatial data, tools suchas the Moran scatterplot [Ans93], local indicators of spatialassociation [Ans95], and geographically weighted regression(GWR) [BFC98] are critical to diagnosing outliers which mightotherwise be obscured in global and local statistics not designedto diagnose spatial heterogeneity. The development of the localMoran’s I and GWR in particular were critical to analyses of spatialdata because prior local statistics including the G statistic [GO92]and the G* statistic [OG95] are not capable of assessing spatialheterogeneity in the form of local outliers.

The inability to diagnose spatial dependence and heterogeneityis an issue with current aspatial multivariate clustering techniques.Currently, the most common approaches for evaluating multivariateclustering is to assign a label/class/category to each observation inthe dataset. Post-classification, a choropleth map can then be gen-erated from clustering results, from which, the analyst can beginattempt to diagnose local and global spatial dependence. For exam-ple, Turkay et al. [TSH∗14] have developed methods for exploringgeographically referenced multivariate data over location and scalethrough a variety of linked small multiples and summary statistics.However, research highlights that visual inspection of maps can bemisleading [Mon14] and that specific local statistics are necessaryto diagnose dependence and heterogeneity [Ans95]. Research hasalso highlighted that multivariate results when mapped, can pro-duce non-sensical results [MGK07] because closeness in multivari-ate space is not necessarily the same as closeness in geographicspace. Thus, the design of toolkits to examine spatial processes inclustering results that move beyond potentially misleading visual

inspection of maps and global summary statistics that obscure im-portant local variations in multivariate data is necessary.

2.2. Multivariate Geovisualization

Previous work in geographical visualization has focused on a vari-ety of techniques designed for analyzing multivariate relationshipsin spatial data. One well known example is the UK National Statis-tics Output Area Classification (OAC) which is an open geode-mographic classification with a hierarchical structure of 7 super-groups, 21 groups and 52 subgroups [VR07]. This type of mul-tivariate clustering has served as the basis for various geovisu-alization techniques. For example, Slingsby et al. [SDW10] de-veloped rectangular hierarchical cartograms for mapping socio-economic data of OAC, and also proposed a set of interactive vi-sualization techniques to explore population profiles of areas andhow uncertainty in OAC varies geographically and by OAC cat-egory [SDW11]. Singleton and Longley [SL15] presented a Lon-don classification (LOAC) based upon the OAC methodology toaccommodate local structures that diverge from national patterns.While Singleton and Longley compared the model of LOAC tothe 2011 OAC in a statistical and semantic manner, our work ex-plores quantitative and visual comparison of clustering outputs.As for localized geographical exploration, various geographicallyweighted (GW) statistics have been developed (e.g., GW summarystatistics [BFC02], GWR [BFC98], GWPCA [HBC11]). Dykes andBrunsdon [DB07] introduced geographically weighted interactivegraphics for exploring and hypothesizing the spatial relationshipsunder different scale-based variations. Goodwin et al. [GDST16]developed a suite of novel interactive visualization methods to iden-tify interdependencies in multivariate data coupled with a series ofcorrelation matrix views. While Goodwin et al. focus primarily onspatial extents of pairwise correlations, our work explores spatialextents in the multivariate clustering space and enables exploratoryanalysis between clustering differences. Previous work by Lex etal. [LSP∗10] also explores comparing clustering results; however,their domain focus is on genomics which does not have issues re-lated to spatial extent.

2.3. Clustering Comparison

For many clustering evaluation and comparison techniques, re-searchers assume a true cluster structure exists and use an exter-nal criteria of clustering quality, such as the Rand index [Ran71] orNMI (Normalized Mutual Information) [MRS08] to measure theconcordance between the true structure and the output of the clus-tering algorithms [KF75,FH09,FLK11]. Jung et al. defined cluster-ing gain which is based on the squared error sum as a measure forcluster optimality [JPDD03]. Their measurements can be utilizedto estimate the desired number of clusters for partitional clusteringmethods. Meila [Mei05] characterized criteria for comparing twoclustering results directly by treating clusters as elements of a lat-tice. However, those works still remain at the arithmetic level (i.e.,only numerical indicators have been provided and no visual infor-mation is available for illustration). Hoffman and Hargrove createdsimple multivariate geographic clustering comparison according totheir state space color assignments [HHDG03], yet they do not ap-ply a uniform comparison method. Recently, Zhou et al. [ZKG09]

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 3: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 1: An example of the clustering exploration approaches provided in our framework. (A) Choropleth clustering map with the thumbnailplot displaying the clustering criteria. (B) Scatterplot in PCA projection mode with scope lens enabled. (C) PCP area profiler of the datavalues of the local highlighted counties. (D) PCP area profiler of the data values from all counties. (E) PCP area profiler showing the datavalues from counties of the blue cluster. (F) Rose plots of all five clusters.

extended parallel sets to provide the mutual comparison and eval-uation of multiple partitions. Their visualization can present theoverall change between clusterings but may be not suitable forshowing the detail of changes in geographical applications. Hu etal. [HKV12] described a heuristic to promote dynamic cluster sta-bility and maximize stability between labels. Their approach forvisualizing multiple relationships ensures mental map preservationbut lacks the capability to show detailed local comparison. In thiswork, we expand on these methods and provide a Triple-D (Drag &Drop clustering Difference) View to interactively display the legi-ble visual results for clustering comparison.

3. Framework and Design

In order to merge previously aspatial interactive multivariate clus-tering techniques with local techniques to evaluate trends in spatialdata, we have developed a framework that enables the manipulationand exploration of spatial data in a multivariate clustering environ-ment. To address important differences in local trends related toeither spatial dependence or spatial heterogeneity, our work char-acterizes space in the following four categories:

• Discrete spatial extent - Particular types of data may be reportedin such a way that they are bounded by a fixed spatial extent. Pre-vious research tends to apply multivariate clustering to the entirespatial extent, which can conceal local variations. The proposedframework enables geographical selections at a constant spatialextent and allows users to apply multivariate clustering to the se-lected spatial extent. The interaction and visual encoding enablesusers to identify local patterns between places in order to under-stand the impact of the spatial heterogeneity on the multivariateclustering procedure.• Discrete geographical features -Different geographical features

are not always spatially continuous. Thus the proposed frame-work allows users to distinguish geographical features (e.g., ur-ban vs. rural) and then apply multivariate clustering to the geo-graphical features of interest.• User-defined (continuous) spatial extent - While many geo-

graphic studies examine phenomenon where the spatial extentis fixed, many other questions require an analyst to modify the

spatial extent by zooming in to a particular set of spatial units orzooming out to a particular extent, and then performing a clusteranalysis. In this context, the arrangement and the neighborhoodstructure of the data are variable. The proposed framework al-lows users to adjust spatial scales around a fixed location to un-derstand the impact of the spatial dependence on the multivariateclustering procedure.

• Continuous geographic resolution - Another issue to considerwhen evaluating multivariate clustering results in geographicspace is the impact on the results of varying the resolution ofthe data. This variation in the resolution of the spatial units ofinterest is otherwise known as the problem of modifiable arealunits [Ope83]. We know that using larger areal units (i.e. states asopposed to counties) reduces the variance in the data [BBR09].The proposed tool allows users to aggregate multivariate at-tributes at different spatial resolutions (e.g., county, state) to un-derstand the impact of the spatial dependence on the multivariateclustering procedure.

3.1. Group Selection

To enable the visual analysis of the spatial impact on multivariateclustering, our framework extends the traditional selection opera-tion through the concept of group operations. Our framework hasfully implemented three types of selection:

• Rubber band selection in geographic space;• Selections from multivariate space utilizing the histogram, scat-

terplot, categorical view, and box plot (note that all views men-tioned are fully implemented in the system);

• Automated geographical selections such as selecting based on aboundary layer or using a neighborhood.

These three selection methods enable users to define any desired ar-eas. The group level operations include updating all the exploratorydata analysis widgets (e.g, histogram, scatterplot), applying localclustering, and aggregating local clustering statistics.

3.2. Clustering Exploration

Due to the often non-intuitive connection between multivariatespace and geospace, it is a challenge to simultaneously explore the

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 4: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 2: Here we demonstrate the coherent clustering color mapping when both maps have five clusters. By maintaining label consistencyfor generalized clustering comparison, users can quickly tell that the clustering results are similar while at the same time noting that thereexists differences in the northern part of the US (the red circle). However, it is still difficult for users to figure out exactly how many differencesthere are.

clustering result in both multivariate space and geographical space.We provide several criteria in order to help users assess the cluster-ing results. The first criterion is the silhouette coefficient [Rou87]for multidimensional space. The silhouette coefficient refers to amethod of interpreting clusters which allows users to know howwell each object lies within its cluster and is defined as:

S(i) =b(i)−a(i)

maxa(i),b(i) ,

where a(i) is the average dissimilarity of object i with all otherobjects within the same cluster, and b(i) is the lowest average dis-similarity of object i to any other clusters where object i is not amember, and S(i) lies in the range: −1 ≤ S(i)≤ 1. When S(i) isclose to 1, it means the datum is appropriately clustered. When S(i)is close to -1, it means object i would be more appropriate labeledif it was clustered in one of its neighboring clusters. When S(i) isclose to zero, it means that the datum is on the border of two nat-ural clusters. Therefore, we leverage this coefficient as a means ofassessing the goodness of a clustering result.

Users may also be interested in the relativeness of cluster labelswithin a neighboring area. We use the Gini index-like [Gin12] pu-rity indicator as the second criterion for geographic multivariateclustering inspection. The purity indicator is defined as:

P(i) = (nCi

N(i))2− ∑

Ci 6=C j

(nC j

N(i))2,

where nCi is the number of units that belong to the same cluster of i(Ci) in i’s neighborhood, nC j is the number of units that are differentfrom i’s cluster in i’s neighborhood, N(i) is the total count of unitsof i’s neighbors, and P(i) also lies in the range: −1 ≤ P(i)≤ 1.When P(i) is close to 1, it means that unit i is almost surroundedby the neighbors within the same cluster. When P(i) is close to -1,it means that unit i is almost surrounded by the neighbors from an-other cluster. When P(i) is near zero, it means its neighbors are ran-domly scattered in different clusters. Thus, the higher purity valuea unit has, the stronger the spatial association is around that unit.

These indicies can be displayed in the thumbnail plot when usershover the mouse over a geographical unit (Figure 1(A)), where the

summarized value will be displayed in the group view. In addi-tion to the numerical criteria, our framework provides three visualanalytics methods for clustering exploration: PCA (Principle Com-ponent Analysis) scatterplot, PCP (Parallel Coordinate Plots) areaprofiler, and rose plot.

PCA Scatterplot: We utilize PCA to project data into 2D spaceand provide users a generalized overview of how the clusters aredistributed. For instance, from the PCA scatterplot (Figure 1(B)),we can quickly tell where the point is and on which cluster borderthe point is lying on.PCP Area Profiler: While PCA is good for visualizing the multi-variate distance, it lacks consistency in the appearance as the datachange. Hence we implement a PCP area profiler that can visual-ize the multivariate relations of different area profiles in a simpleclick. There is one customizable area profile where users can se-lect the units of interest and three predefined area profiles: the localneighboring area which only considers the units within the first or-der contiguity of the selected unit, the intra-cluster area which onlyconsiders the units in the same cluster as the selected unit, and theglobal area which considers all the units. By switching among thosearea profiles, users can explore how the datum is distributed in themultidimensional space (Figure 1(C- E)).Rose Plot: While PCPs provide a detailed view of the data values,they are often very cluttered. We employ a modified version of thetraditional rose plot (Figure 1(F)) akin to Schreck et al. [SOBL13].Each variable axis owns five points which indicate lower bound,three quartiles, and upper bound of each variable respectively.While Gestalt principles note that humans are good at shape com-parison, drawbacks of the rose plot include shape changes due toaxis ordering and the often unintuitive scaling that must be doneper axis.

3.3. Clustering comparison

To enhance the coherence and generalized cluster comparison be-tween different clustering results, Hu et al. [HKV12] proposed thecoherent clustering color mapping that attempts to keep cluster la-bels of spatial units consistent between different clustering results.

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 5: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 3: Comparing two clustering results for the same group of15 objects. In the top figure, the left part is a clustering result Ω

with three clusters, and the right part is a clustering result Ω′ with

four clusters. The bottom figure is the illustration of the comparisonprocess. The value on the arrow indicates the proportion of the sub-cluster in that step.

They assign the same label (color) to the clusters with the max-imum number of correspondences to facilitate the comparison ofclusters. However, this method has few drawbacks: it lacks a detailcomparison capability (in Figure 2, users can not tell the exact dif-ference between two clustering results when the amount of spatialunits or the number of clusters is large); it can only compare clus-tering with the exact same cluster numbers and units. To overcomesuch issues, we have designed a novel visual analytics tool calledthe Triple-D View (Drag and Drop clustering Difference View).

Each cluster is essentially a set, thus comparing the differencebetween clusterings is equivalent to exploring the changes amongthose sets. According to observations, we generalize the changesinto a combination of the splitting step and the merging step. Tokeep the idea simple, consider the example in Figure 3. Here, wehave a clustering result Ω of 3 clusters A,B,C for 15 objects on theleft, and another clustering result Ω

′ of 4 clusters A′,B′,C′,D′ forthose same 15 objects on the right. In the splitting step, we subdi-vide Ω into small clusters. For instance, for cluster A, objects 1,3are formed into the same cluster in Ω

′, objects 4,6,7,9 are formedinto the same cluster in Ω

′, and object 14 is merged into anothercluster in Ω

′. Thus we will have three sub-clusters in Ω′′ for clus-

ter A. In the merging step, we just need to check each cluster in Ω′

to find out which small sub-clusters in Ω′′ it contains. The inter-

Figure 4: An example of the Triple-D View (Drag and Drop clus-tering Difference View). The top three maps are three different clus-tering results using K-means but with different initial centroids re-spectively. The bottom two maps are the comparison results of thefirst two and last two respectively. When users click on a certainunit in the comparison result, indicator lines will be drawn on topof them to mark the corresponding units from the two comparedclustering results.

mediate clusters are actually the mutual information between thesetwo clustering results. This process is demonstrated in Figure 3.

By dragging one clustering result and dropping it onto anotherclustering result in the Triple-D view, the Triple-D view will mapthe changes (i.e., the intermediate sub-clusters) under the twoclusterings being compared (Figure 4). The layout of our differ-ence view is an inverted pyramid which is similar to the GTdiffmethod [HWH∗11], yet GTdiff only provides comparison for tem-poral bins as a difference of values between time steps. Here weutilize this layout to show the difference between different cluster-ing results. To represent the changes, we define three criteria for theproportion: less than 50 percent, larger than 50 percent, and equalsto 100 percent. The proportion in the splitting step refers to the ratiobetween the size of the sub-cluster and the size of its original clus-ter where the sub-cluster splits from, in the merging step it refersto the ratio between the size of the sub-cluster and the size of itssuccessive cluster which the sub-cluster merged into. As there arethree criteria for both steps, there will be 9 variables that can beused to represent the observed changes.

The Triple-D view not only can visualize the difference betweenclustering results regardless of the cluster number and coloringscheme, but also can generate a numerical proximity metric thatobeys all the metric properties (positivity, symmetry, triangle in-equality, indiscernibility) to help users assess the clustering sim-ilarity. In contrast, the Rand Index is not suitable for unlabeledclustering comparison as it requires a ground truth, NMI (Normal-ized Mutual Information)/Variation of Information can not handlethe situation when mutual information is 0, and the similarity mea-sure introduced by Torres et al. [TBS∗09] does not provide diver-sity/entropy information for the comparison which make the result

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 6: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 5: An example of exploration between clusterings in different geographical locations. The top and bottom row are results based onsurroundings of King County and Harris County respectively. (A) Selections with the circle selection tool. (B) Local clustering results andits clustering statistics. (C) Rose plots for the local clusters, the variables from the top in a clock-wise manner are: percentage of otherlanguages, percentage of education level above high school, mean time to work, and per capita income.

less meaningful. So our split-merge metric is defined as:

SM(Ω,Ω′′) =−∑i

∑j

|Ci∩C′

j|N

log|Ci∩C

j|2

|Ci||C′j|

,

where Ω and Ω′′ are the two clustering results been compared. Ci

is the ith cluster in Ω, C′

j is the jth cluster in Ω′′, and N is the total

number of units. Larger metric values represent more dissimilaritybetween clusterings. Potential future work could explore the use ofrecent set visualization methods (e.g., OnSet [SMDS14]) as a novelmeans of comparing clustering results.

4. Case Studies

In the following case studies, we used a demographics data set con-taining quick facts about counties in the continental United Statesfrom the US Census Bureau (http://quickfacts.census.gov/qfd/download_data.html). There are 3106 counties in thisdataset and 52 demographic variables. The counties are distin-guished by their FIPS (Federal Information Processing Standards)number, and the variables by their mnemonic identifier. Note thatvariable choices here are chosen to clearly highlight observable pat-terns in the data and demonstrate our framework features.

4.1. Discrete spatial extent

POP815213 (Language other than English spoken percentage),EDU635213 (Education level above high school), LFE305213(Mean travel time to work), and INC910213 (Per capita income)are the variables of interest in this case study. Here we explorethe surroundings of King County (Seattle) and the surroundingsof Harris County (Houston) to determine if they share any com-mon patterns in clustering. We first choose the surrounding coun-ties within the same radius from both counties using the circle se-lection tool (Figure 5(A)) and then apply hierarchical clustering us-ing Ward’s method [War63] and Euclidean distance (Figure 5(B)).The relatively high value of the average Silhouette coefficient fromKing County’s clustering indicates the goodness of its clustering isslightly better than Harris County’s clustering under the same clus-tering method. From the rose plots (Figure 5(C)), we also noticethat the characteristics of each cluster in King County’s clusteringare more distinguishable. As King County belongs to the orangecluster, we identify that it possesses more well-educated people,higher income, and requires more time travel to work when com-pared with the other clusters from the rose plots (Figure 5(C) Top).The difference of cluster characteristics is small from King Countywestward, but is large eastward as these counties have significantless travel time to work and more non-English language speakers.

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 7: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 6: An example of partial clustering for units of different geographical feature. The left map shows the units with all positive populationchange and the right map shows the units with all negative population change in 2014.

Harris County is similar to the situation in King County; however,the difference of cluster characteristics does not shows the similarwest-east pattern. We can tell the non-English spoken percentagedrops significantly towards the north but remains towards the south,which makes sense as Mexico boarders Harris County in the south.

Interestingly, the other two values of Shannon diversity and aver-age purity from Harris County’s clustering suggest that the clustersin Harris County’s clustering are more spatially associated and theclusters in King County’s clustering are more scattered. The spatialdistribution of the clustering results in Figure 5(B) display the cor-responding patterns. Though these two counties are both near theocean, close to the country border, and contain major metropoli-tan areas, the different styles in population composition, educationlevel and traffic level of its surroundings counties appear to con-tribute to different spatial patterns.

4.2. Discrete geographical features

We noticed that variable PST120214 representing the percentchange of population in 2014 has both negative and positive val-ues, so we explore clustering on units with positive values andnegative values respectively. We first use the scatterplot (or cat-egorical view) to select units with only positive values and thenapply hierarchical clustering using Ward’s method and Euclidean

distance. The number of clusters chosen is 6 and the variables clus-tered on are SEX255213 (Percent of female in 2013), POP645213(Percent of foreign born persons in 2013), EDU635213 (Percentof persons with high school graduate or higher), INC910213 (Percapita money income in 2013), and PVY020213 (Percent of per-sons below poverty level). Then we repeat the same process forthe units with only negative percent change of population. Figure 6demonstrates both the clustering results for units with positive andnegative population change.

After investigating the rose plot for each clustering result, wefind a similar matching pattern based on the clusters’ characteris-tics. As shown in Figure 6, a matching of A⇔ e/c, B⇔ d, C⇔ a,D⇔ b, E ⇔ f can be easily identified from the rose plot of eachcluster. This means that spatial units with positive and negativepopulation changes do share some similar patterns with those 5variables. We also notice that only the F cluster does not have amatching cluster in the other clustering result. Cluster F has thehighest distribution in education level, income level and foreignborn persons level. Also, judging from the map, we can tell that theunits from cluster F are all major metropolitan areas such as LosAngeles, New York City, etc. We hypothesize that areas with highincome levels, education levels and foreign born population willhave positive population change, in other words, areas with nega-

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 8: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 7: An example of scale effect on clustering around Cook County. (A) The Triple-D view of the clustering comparison regarding thescale change (B) The change of scales demonstrated in 4 colors (C) PCP area profiler for scale 1 and 2 (D) PCP area profiler for scale 1and 3 (D) PCP area profiler for scale 1 and 4.

tive population change usually do not have high income, educationand foreign born persons level.

4.3. Continuous spatial extent

Next, we explore the scale change effects in clustering. The clus-tering features 6 clusters with Ward’s method and Euclidean dis-tance. It is based on the variables of age (percents of person un-der 5, under 18, and above 65 respectively), house living(percentof living in same house more than 1 year), and education (percentof person have bachelor’s degree or higher). We first choose CookCounty (the Chicago metro area). We select the start scale and endscale as shown in the Figure 7(A-1) and (A-4) respectively. Thenthe framework automatically interpolates the steps between thosetwo scales (Figure 7 (A-2)(A-3)). After clustering on each of thescales and applying the comparison in the Triple-D view, we findout that the comparison metrics which stands for dissimilarity areslowly increasing as the scale changes (Scores circled in red in Fig-ure 7(A)). Thus we label the change of scales in different colors(Figure 7(B)) and visualize the difference between them with PCParea profiler (Figure 7(C)(D)(E)). As shown in the PCP area pro-filer, Cook County and its neighboring counties have a higher mea-surement in the % of population under 18 and in the education vari-able (Darker Green lines as scale 1 in Figure 7(C)). When the scaleexpands outward to the next contiguous set of neighbors, the outercounties have higher percent of elderly and the education level goesdown (Light Green lines as scale 2 in Figure 7(C)). As the char-acteristics of these two area profilers are quite distinguishable, itmeans that the scale change from A-1 to A-2 (B-2) may have lesseffects on the clustering of A-1. That also explains the indiscrimina-tion of the comparison between the scale A-1 and A-2’s clustering

results. When the spatial extent increases, more units that are sim-ilar to the units in scale A-1 have been induced (Overlapping lightgreen lines in Figure 7(D)), and that interferes with the clusteringresults. Note that clustering results here are further confounded dueto variables being dependent proportions of the population.

4.4. Continuous geographical resolution

As shown in Figure 8, we applied the hierarchical clusteringof 6 clusters on the county level and state level respectivelyacross the mainland US. The clustering uses the Education level(EDU685213), Veterans (VET605213), Mean travel time to work(LFE305213), and Private nonfarm employment (BZA110213). Wecan tell that county level rose plots (Figure 8(a)) have more vari-ance for each of the clusters. The PCA scatterplot also shows moreoverlap at the county level, but clear separation in the state level.From the statistics of the two clustering results, the average Sil-houette coefficient of clusters under state level is higher than undercounty level which indicates higher intra-cluster similarity at thestate level aggregation. These observations in cooperation with thework of Sun [SW15] demonstrate that spatial aggregation can im-prove data quality.

5. Conclusion and Future Work

This paper presents a geovisual analytics framework to allow usersto understand the impact of geographical variations across loca-tions and scales for multivariate data clustering. We categorize thespace into four aspects: discrete spatial extent, discrete geographi-cal features, continuous spatial extent, and continuous geograph-ical resolution in order to characterize the impact of spatial de-

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 9: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

Figure 8: An example of clustering results under different geographical resolution. (A) Clustering result, corresponding rose plot andPCA scatterplot in county level, (B) Clustering result, corresponding rose plot and PCA scatterplot under state level. Both use the samehierarchical clustering with 6 clusters.

pendence and heterogeneity. A variety of visualization and interac-tion techniques (e.g., PCA scatterplot, PCP area profiler, Rose plot)have been implemented to facilitate clustering exploration over ge-ographical variations with statistical measures (e.g., Silhouette co-efficient) to evaluate cluster quality. We provide methods for com-paring within (k-means vs. k-means) and between (hierarchical vs.k-means) cluster results, and demonstrate potential ways of inter-acting with data to explore cluster results.

While this framework enables the exploration and comparisonof clustering methods over different scales, there is still a need toenable quick identification of similar and dissimilar regions. Cur-rently, the comparative analysis between clusters is done in a purelyvisual manner, and while humans are capable of identifying pat-terns, the integration of further analytical methods to help highlightand identify statistically significant similarities and differences be-tween clusters is critical. Furthermore, this exploration focused pri-marily on the spatial extent of the data; however, extensions tothe spatiotemporal domain are critical in analyzing how underlyingphysical properties may develop in the data. It may also be possi-ble to automatically explore the impact of scale simply by defininglevels of aggregation and present a summary comparison to endusers to suggest appropriate scales of analysis for the data. Futurework should explore a combination of automation with human-in-the-loop exploration and recommendations.

6. Acknowledgments

This work was supported by the NSF under Grant No. 1350573and in part by the U.S. Department of Homeland Security’s VAC-

CINE Center under Award Number 2009-ST-061-CI0001. Wethank the anonymous reviewers whose comments helped improvethe manuscript.

References

[AL05] ASHBY D. I., LONGLEY P. A.: Geocomputation, geodemo-graphics and resource allocation for local policing. Transactions in GIS9, 1 (2005), 53–72. 1

[And72] ANDREWS D. F.: Plots of high-dimensional data. Biometrics(1972), 125–136. 1

[Ans88] ANSELIN L.: Spatial econometrics: Methods and models, vol. 4.Springer Science & Business Media, 1988. 2

[Ans93] ANSELIN L.: The Moran scatterplot as an ESDA tool to assesslocal instability in spatial association. Regional Research Institute, WestVirginia University Morgantown, WV, 1993. 2

[Ans95] ANSELIN L.: Local indicators of spatial association - LISA.Geographical Analysis 27, 2 (1995), 93–115. 2

[BBR09] BURT J. E., BARBER G. M., RIGBY D. L.: Elementary statis-tics for geographers. Guilford Press, 2009. 3

[BFC96] BRUNSDON C., FOTHERINGHAM A. S., CHARLTON M. E.:Geographically weighted regression: A method for exploring spatialnonstationarity. Geographical analysis 28, 4 (1996), 281–298. 2

[BFC98] BRUNSDON C., FOTHERINGHAM S., CHARLTON M.: Geo-graphically weighted regression. Journal of the Royal Statistical Society:Series D (The Statistician) 47, 3 (1998), 431–443. 2

[BFC02] BRUNSDON C., FOTHERINGHAM A., CHARLTON M.: Geo-graphically weighted summary statistics - a framework for localised ex-ploratory data analysis. Computers, Environment and Urban Systems 26,6 (2002), 501–524. 2

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.

Page 10: Visualizing the Impact of Geographical Variations on ...rmaciejewski.faculty.asu.edu/papers/2016/Eurovis16-final.pdf · ral categorical data supported by algorithms for the selection

Y. Zhang & W. Luo &E. A. Mack & R. Maciejewski / Visualizing the Impact of Geographical Variations on Multivariate Clustering

[CMG08] CHEN J., MACEACHREN A. M., GUO D.: Supporting the pro-cess of exploring and interpreting space–time multivariate patterns: Thevisual inquiry toolkit. Cartography and geographic information science35, 1 (2008), 33–50. 1

[CML11] CHESHIRE J., MATEOS P., LONGLEY P. A.: Delineating eu-rope’s cultural regions: Population structure and surname clustering. Hu-man Biology 83, 5 (2011), 573–598. 1

[DB07] DYKES J., BRUNSDON C.: Geographically weighted visualiza-tion: Interactive graphics for scale-varying exploratory analysis. IEEETransactions on Visualization and Computer Graphics 13, 6 (2007),1161–1168. 2

[FH09] FERREIRA L., HITCHCOCK D. B.: A comparison of hierarchicalmethods for clustering functional data. Communications in Statistics-Simulation and Computation 38, 9 (2009), 1925–1949. 2

[FLK11] FATTAH S. A., LIN C.-C., KUNG S.-Y.: A mutual informationbased approach for evaluating the quality of clustering. In IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP)(2011), IEEE, pp. 601–604. 2

[GDST16] GOODWIN S., DYKES J., SLINGSBY A., TURKAY C.: Vi-sualizing multiple variables across scale and geography. IEEE Transac-tions on Visualization and Computer Graphics 22, 1 (2016), 599–608. 1,2

[Gin12] GINI C.: Variabilità e mutabilità. 1912. 4

[GO92] GETIS A., ORD J. K.: The analysis of spatial association by useof distance statistics. Geographical analysis 24, 3 (1992), 189–206. 2

[HBC11] HARRIS P., BRUNSDON C., CHARLTON M.: Geographicallyweighted principal components analysis. International Journal of Geo-graphical Information Science 25, 10 (2011), 1717–1736. 2

[HHDG03] HOFFMAN F. M., HARGROVE W. W., DEL GENIO A. D.:Multivariate spatio-temporal clustering of time-series data: An approachfor diagnosing cloud properties and understanding ARM site represen-tativeness. In Thirteenth ARM Science Team Meeting Proc., Broomfield,Colorado (2003). 2

[HKV12] HU Y., KOBOUROV S. G., VEERAMONI S.: Embedding, clus-tering and coloring for dynamic maps. In Pacific Visualization Sympo-sium (PacificVis) (2012), IEEE, pp. 33–40. 3, 4

[HWH∗11] HOEBER O., WILSON G., HARDING S., ENGUEHARD R.,DEVILLERS R.: Exploring geo-temporal differences using GTdiff. InPacific Visualization Symposium (2011), IEEE, pp. 139–146. 5

[Ins85] INSELBERG A.: The plane with parallel coordinates. The VisualComputer 1, 2 (1985), 69–91. 1

[JPDD03] JUNG Y., PARK H., DU D.-Z., DRAKE B. L.: A decisioncriterion for the optimal number of clusters in hierarchical clustering.Journal of Global Optimization 25, 1 (2003), 91–111. 2

[KF75] KUIPER F. K., FISHER L.: 391: A monte carlo comparison ofsix clustering procedures. Biometrics (1975), 777–783. 2

[LSP∗10] LEX A., STREIT M., PARTL C., KASHOFER K., SCHMAL-STIEG D.: Comparative analysis of multidimensional, quantitativedata. IEEE Transactions on Visualization and Computer Graphics 16,6 (2010), 1027–1035. 2

[Mei05] MEILA M.: Comparing clusterings: An axiomatic view. InProceedings of the 22nd international conference on Machine learning(2005), ACM, pp. 577–584. 2

[MGK07] MACK E., GRUBESIC T. H., KESSLER E.: Indices of indus-trial diversity and regional economic composition. Growth and Change38, 3 (2007), 474–509. 2

[MHKH11] MILLS R. T., HOFFMAN F. M., KUMAR J., HARGROVEW. W.: Cluster analysis-based approaches for geospatiotemporal datamining of massive data sets for identification of forest threats. ProcediaComputer Science 4 (2011), 1612–1621. 1

[MJ07] MASON G., JACOBSON R.: Fuzzy geographically weighted clus-tering. In Proceedings of the 9th international conference on geocompu-tation, Maynooth, Eire, Ireland (2007), pp. 3–5. 1

[Mon14] MONMONIER M.: How to lie with maps. University of ChicagoPress, 2014. 2

[MRS08] MANNING C. D., RAGHAVAN P., SCHÜTZE H.: Introductionto information retrieval, vol. 1. Cambridge university press, 2008. 2

[OG95] ORD J. K., GETIS A.: Local spatial autocorrelation statistics:distributional issues and an application. Geographical analysis 27, 4(1995), 286–306. 2

[Ope83] OPENSHAW S.: The modifiable areal unit problem. Norwick[Norfolk]: Geo Books, 1983. 3

[Ran71] RAND W. M.: Objective criteria for the evaluation of clusteringmethods. Journal of the American Statistical association 66, 336 (1971),846–850. 2

[Rob07] ROBERTS J. C.: State of the art: Coordinated & multiple viewsin exploratory visualization. In Fifth International Conference on Coor-dinated and Multiple Views in Exploratory Visualization (2007), IEEE,pp. 61–71. 1

[Rou87] ROUSSEEUW P. J.: Silhouettes: A graphical aid to the interpre-tation and validation of cluster analysis. Journal of computational andapplied mathematics 20 (1987), 53–65. 4

[SDW10] SLINGSBY A., DYKES J., WOOD J.: Rectangular hierarchicalcartograms for socio-economic data. Journal of Maps 6, 1 (2010), 330–345. 2

[SDW11] SLINGSBY A., DYKES J., WOOD J.: Exploring uncertaintyin geodemographics with interactive graphics. IEEE Transactions onVisualization and Computer Graphics 17, 12 (2011), 2545–2554. 2

[SL15] SINGLETON A. D., LONGLEY P.: The internal structure ofgreater london: A comparison of national and regional geodemographicmodels. Geo: Geography and Environment 2, 1 (2015), 69–87. 2

[SMDS14] SADANA R., MAJOR T., DOVE A., STASKO J.: Onset: A vi-sualization technique for large-scale binary set data. IEEE Transactionson Visualization and Computer Graphics 20, 12 (2014), 1993–2002. 6

[SOBL13] SCHRECK T., OMER I., BAK P., LERMAN Y.: GeographicInformation Science at the Heart of Europe. Springer International Pub-lishing, Cham, 2013, ch. A Visual Analytics Approach for AssessingPedestrian Friendliness of Urban Environments, pp. 353–368. 4

[SW15] SUN M., WONG D. W.: Spatial aggregation as a means to im-prove data quality. In Proceedings of the 13th International Conferenceon GeoComputation (2015). 8

[TBS∗09] TORRES G. J., BASNET R. B., SUNG A. H., MUKKAMALAS., RIBEIRO B. M.: A similarity measure for clustering and its applica-tions. Int. J. of Elec. Comput. & Syst. Eng 3 (2009), 164–170. 5

[Tob70] TOBLER W. R.: A computer movie simulating urban growth inthe detroit region. Economic geography (1970), 234–240. 2

[TSH∗14] TURKAY C., SLINGSBY A., HAUSER H., WOOD J., DYKESJ.: Attribute signatures: Dynamic visual summaries for analyzing multi-variate geographical data. IEEE Transactions on Visualization and Com-puter Graphics 20, 12 (2014), 2033–2042. 2

[vLBA∗12] VON LANDESBERGER T., BREMM S., ANDRIENKO N.,ANDRIENKO G., TEKUSOVA M.: Visual analytics methods for categoricspatio-temporal data. In IEEE Conference on Visual Analytics Scienceand Technology (2012), IEEE, pp. 183–192. 1

[VR07] VICKERS D., REES P.: Creating the UK National Statistics 2001output area classification. Journal of the Royal Statistical Society: SeriesA (Statistics in Society) 170, 2 (2007), 379–403. 2

[War63] WARD J. H.: Hierarchical grouping to optimize an objectivefunction. Journal of the American statistical association 58, 301 (1963),236–244. 6

[ZKG09] ZHOU J., KONECNI S., GRINSTEIN G.: Visually comparingmultiple partitions of data with applications to clustering. In IS&T/SPIEElectronic Imaging (2009), International Society for Optics and Photon-ics, pp. 72430J–72430J. 2

c© 2016 The Author(s)Computer Graphics Forum c© 2016 The Eurographics Association and John Wiley & Sons Ltd.


Recommended