+ All Categories
Home > Documents > Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic...

Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic...

Date post: 30-Dec-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
8
Interactive Visualization and Navigation in Large Data Collections using the Hyperbolic Space org Walter · org Ontrup · Daniel Wessling · Helge Ritter Neuroinformatics Group · Department of Computer Science University of Bielefeld · D-33615 Bielefeld · Germany E-mail: [email protected] Abstract We propose the combination of two recently introduced methods for the interactive visual data mining of large collections of data. Both, Hyperbolic Multi-Dimensional Scaling (HMDS) and Hyperbolic Self-Organizing Maps (HSOM) employ the extraordinary advantages of the hy- perbolic plane (H2): (i) the underlying space grows expo- nentially with its radius around each point - ideal for em- bedding high-dimensional (or hierarchical) data; (ii) the Poincar´ e model of the IH 2 exhibits a fish-eye perspective with a focus area and a context preserving surrounding; (iii) the mouse binding of focus-transfer allows intuitive interac- tive navigation. The HMDS approach extends multi-dimensional scaling and generates a spatial embedding of the data represent- ing their dissimilarity structure as faithfully as possible. It is very suitable for interactive browsing of data object col- lections, but calls for batch precomputation for larger col- lection sizes. The HSOM is an extension of Kohonen’s Self-Organizing Map and generates a partitioning of the data collection as- signed to an IH 2 tessellating grid. While the algorithm’s complexity is linear in the collection size, the data brows- ing is rigidly bound to the underlying grid. By integrating the two approaches we gain the synergetic ef- fect of adding advantages of both. And the hybrid architec- ture uses consistently the IH 2 visualization and navigation concept. We present the successfully application to a text mining example involving the Reuters-21578 text corpus. 1. Introduction The demand for techniques handling large collections of data is rapidly growing. While the power of information systems increases – the amount of information a human user can directly digest does not. The challenge is to pro- vide good clues for the right questions, which is the key to discoveries. The human expert possess not only valuable background knowledge, intuition and creativity – he is also vested with powerful pattern recognition and processing ca- pabilities especially for the visual information channel. The design goals for an optimal user–data interaction strongly depend on the given exploration task but they certainly in- clude an easy and intuitive navigation with strong support for the user’s orientation. Display Area Data Layout Technique Projection Technique Layout space Figure 1. Displaying larger collections of data with lim- ited display area requires careful usage of space. The allo- cation of spatial representation for providing overview and detail is a challenge. After choosing the level of detail the data layout operated on the canvas (“layout space”) before it is suitably projected onto the display area (e.g. after pan- ning and zooming). Visualizing large data collections has to provide means to effectively use a limited display space and give the user the overview as well as the details. Since most of avail- able data display devices are two-dimensional – paper and screens – the following problem must be solved: finding a meaningful spatial mapping of data onto the display area. One limiting factor is the “restricted neighborhood” around a point in a Euclidean 2D surface. Hyperbolic spaces open an interesting loophole. The extraordinary property of ex- ponential growth of neighborhood with increasing radius around all points enables us to build novel displays. The “hyperbolic tree viewer”, developed at Xerox Parc [6], demonstrated the remarkably elegant interactive capa- bilities. The hyperbolic model appears as a continuously graded, focus+context mapping to the display. See [9, 10] for comparative studies with traditional display types. Unfortunately, previous usage of direct hyperbolic visu- alization was limited to hierarchical, tree-like, or “quasi- graph” data. Two reliefs were recently introduced, sug- gesting more general IH 2 -layout techniques: one gener- alizes Kohonen’s SOM algorithm to the Hyperbolic Self- Organizing Map algorithm (HSOM) [11]; the other intro- 1
Transcript
Page 1: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

Interactive Visualization and Navigation inLarge Data Collections using the Hyperbolic Space

Jorg Walter · Jorg Ontrup · Daniel Wessling · Helge Ritter

Neuroinformatics Group· Department of Computer ScienceUniversity of Bielefeld · D-33615 Bielefeld · Germany

E-mail: [email protected]

Abstract

We propose the combination of two recently introducedmethods for the interactive visual data mining of largecollections of data. Both, Hyperbolic Multi-DimensionalScaling (HMDS) and Hyperbolic Self-Organizing Maps(HSOM) employ the extraordinary advantages of the hy-perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each point - ideal for em-bedding high-dimensional (or hierarchical) data; (ii) thePoincare model of theIH2 exhibits a fish-eye perspectivewith a focus area and a context preserving surrounding; (iii)the mouse binding of focus-transfer allows intuitive interac-tive navigation.The HMDS approach extends multi-dimensional scalingand generates a spatial embedding of the data represent-ing their dissimilarity structure as faithfully as possible. Itis very suitable for interactive browsing of data object col-lections, but calls for batch precomputation for larger col-lection sizes.The HSOM is an extension of Kohonen’s Self-OrganizingMap and generates a partitioning of the data collection as-signed to anIH2 tessellating grid. While the algorithm’scomplexity is linear in the collection size, the data brows-ing is rigidly bound to the underlying grid.By integrating the two approaches we gain the synergetic ef-fect of adding advantages of both. And the hybrid architec-ture uses consistently theIH2 visualization and navigationconcept. We present the successfully application to a textmining example involving the Reuters-21578 text corpus.

1. Introduction

The demand for techniques handling large collections ofdata is rapidly growing. While the power of informationsystems increases – the amount of information a humanuser can directly digest does not. The challenge is to pro-vide good clues for the right questions, which is the key todiscoveries. The human expert possess not only valuable

background knowledge, intuition and creativity – he is alsovested with powerful pattern recognition and processing ca-pabilities especially for the visual information channel. Thedesign goals for an optimal user–data interaction stronglydepend on the given exploration task but they certainly in-clude an easy and intuitive navigation with strong supportfor the user’s orientation.

Display Area Data

Layout Technique

Projection Technique

Layout space

Figure 1. Displaying larger collections of data with lim-ited display area requires careful usage of space. The allo-cation of spatial representation for providing overviewanddetail is a challenge. After choosing the level of detail thedata layout operated on the canvas (“layout space”) beforeit is suitably projected onto the display area (e.g. after pan-ning and zooming).

Visualizing large data collections has to provide meansto effectively use a limited display space and give the userthe overview as well as the details. Since most of avail-able data display devices are two-dimensional – paper andscreens – the following problem must be solved: finding ameaningful spatial mapping of data onto the display area.One limiting factor is the “restricted neighborhood” arounda point in a Euclidean 2D surface.Hyperbolic spacesopenan interesting loophole. The extraordinary property of ex-ponential growth of neighborhood with increasing radiusaround all points enables us to build novel displays.

The “hyperbolic tree viewer”, developed at Xerox Parc[6], demonstrated the remarkably elegant interactive capa-bilities. The hyperbolic model appears as a continuouslygraded, focus+context mapping to the display. See [9, 10]for comparative studies with traditional display types.

Unfortunately, previous usage of direct hyperbolic visu-alization was limited to hierarchical, tree-like, or “quasi-graph” data. Two reliefs were recently introduced, sug-gesting more generalIH2-layout techniques: one gener-alizes Kohonen’s SOM algorithm to theHyperbolic Self-Organizing Map algorithm(HSOM) [11]; the other intro-

1

Page 2: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

ducesHyperbolic Multi-Dimensional Scaling(HMDS) [16]for a direct construction of a distance preserving embeddingof high-dimensional data into the hyperbolic space.

In Sec. 2 and 3 we review the hyperbolic space and thethree mentioned layout techniques for visualizing data inIH2. Sec. 4 discusses a synthesis for a two step visualizationarchitecture. Even though the look and feel of an interactivevisualization and navigation cannot be really conveyed in astatic format, we report in Sec. 5 several results and snap-shots of an application to text mining. This approach allowsto visualize, navigate and search in a space of documentsappearing in the Reuters news stream.

2. The Hyperbolic SpaceIH2

Historically, the hyperbolic space is a “recent” discov-ery in the 18th century. Lobachevsky, Bolyai, and Gaussindependently discovered the non-Euclidean geometries bynegating the “parallel axiom” which Euclid formulated2300 years ago. Today we know three geometries with uni-form curvature. Our daily experience is governed by theflat or Euclideangeometry with zero curvature. Still famil-iar is thespherical geometrywith positivecurvature – de-scribing the surface of a sphere, like the earth or an orange.Its counterpart with constantnegativecurvature is known asthehyperbolic planeIH2 (with analogous generalizations tohigher dimensions) [2, 14]. Unfortunately, there is no “per-fect” embedding of theIH2 in IR3, which makes it harder tograsp the unusual properties of theIH2. Local patches re-semble the situation at a saddle point, where the neighbor-hood grows faster than in flat space (see Fig. 2). Standardtextbooks on Riemannian geometry (see, e.g. [7]) show thatthe circumferencec and areaa of a circle of radiusρ in IH2

are given by

area: a(ρ) = 4π sinh2(ρ/2) (1)

circumference: c(ρ) = 2π sinh(ρ) . (2)

This bears two remarkable asymptotic properties,(i) forsmall radiusρ the space “looks flat” sincea(ρ) ≈ πρ2 andc(r) ≈ 2πρ. (ii) For largerρ both growexponentiallywiththe radius. As observed in [6, 5], this trait makes the hy-perbolic space ideal for embedding hierarchical structures.Fig. 2 illustrates the spatial relations by embedding a smallpatch of theIH2 in IR3.

To use the visualization potential of theIH2 we mustsolve the two problems displayed before in Fig. 1. Now weturn to the projection problem, which was solved for theIH2 more than a century ago.

2.1. The Projection Solution forIH2

The perfect projection into the flat display area shouldpreserve length, area, and angles (≈form). But it lays in thenature of a curvated space to resist the attempt to simultane-ously achieve these goals. Consequently several projections

Figure 2. There is literally more room in hyperbolic spacethan in Euclidean space, as shown in this illustrated embed-ding of the hyperbolic plane into 3D Euclidean space (cour-tesy of Jeffrey Weeks). (Right:) Exponential growth(Eq. 1) of the circumferencec(ρ) and areaa(ρ) is experi-enced if a “circle” with radiusρ is drawn in the wrinklingstructure. (Left:) The sum of angles in a triangle issmaller than 180◦.

or mapsof the hyperbolic space were developed, four areespecially well examined:(i) theMinkowski, (ii) theupper-half plane, (iii) theKlein-Beltrami, and(iv) thePoincareordiskmapping. See [2] for more details and geometric map-pings to transform in-between(i)–(iv).

The Poincare projection is for our purpose the most suit-able. Its main characteristics are:

Display compatibility: The infinite large area of theIH2 ismapped entirely into a circle, the Poincare diskPD . This in-finity representation fascinated Maurits Escher and inspiredhim to several wood cuts [15].

Circle rim “=∞”: All remote points are close to the rim,without touching it.

Focus+Context: The focuscan be moved to each loca-tion in IH2, like a “fovea” . The zooming factor is 0.5 inthe center and falls (exponentially) off with distance to thefovea. Therefore, the context appears very natural. As moreremote things are, the less spatial representation is assignedin the current display.

Lines become circles: All IH2-lines1 appear ascircle arcsegments of centered straight lines inPD (both belong tothe set of so-called “generalized circles”). There extensionscross thePD -rim always perpendicular on both ends.

Conformal mapping: Angles (and therefore form) rela-tions are preserved inPD , area and length relations obvi-ously not.

Regular tessellations with triangles offer richer possibili-ties than theIR2. It turns out that there is an infinite set ofchoices to tessellateIH2: for any integern ≥ 7, one canconstruct a regular tessellations in whichn triangles meetat each vertex (in contrast to the plane with allows only

1A line is by definition the shortest path between two points

Page 3: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

n = 3, 4, 6 and the sphere onlyn = 3, 4, 5). Fig. 3 depictsexamples forn = 7 andn = 10.

One way to compute these tessellations algorithmicallyis by repeated application of a suitable set of generators oftheir symmetry group to a suitably sized “starting triangle”(see also Eq. 3 and [11]).

2.2. Moving the Focus

For changing the focus point inPD we need a transla-tion operation which can be bound to mouse click and dragevents. In the Poincare disk model theMobiustransforma-tion T (z) is the appropriate solution. By describing thePoincare diskPD as the unit circle in the complex plane,the isometric transformations for a pointz ∈ PD can bewritten

z′ = T (z; c, θ) =θz + c

cθz + 1, |θ| = 1, |c| < 1. (3)

Here the complex numberθ describes a pure rotation ofPDaround the origin0. The following translation byc mapsthe origin toc and−c becomes the new center0 (if θ = 1).The Mobius transformations are also called the “circle au-tomorphies” of the complex plane, since they describe thetransformations from circles to (generalized) circles. Herethey serve to translateIH2 straight lines to lines – bothap-pearing as generalized circles in thePD projection. Forfurther details, see [5, 15].

Figure 3. RegularIH2 tessellation with congruent tri-angles. (Left:) Here,n = 7 triangles meet at eachvertex. Due to the angular deficit of the triangle sum inIH2-triangles, the minimal number to complete a circle is 7.(Right:) For n = 10 the triangle side length increases toperfectly fill the plane. Note, that allIH2-lines appear ascircle arcs, which extend perpendicular to the “∞-rim”.

3. Layout Techniques inIH2

In the following section we discuss three layout tech-niques for theIH2.

3.1. Hyperbolic Tree Layout (HTL)for Tree-Like Graph Data

Now we turn to the question raised earlier: how to ac-commodate data in the hyperbolic space. A solution to this

question for the case of acyclic, tree-like graph data wasprovided by Lamping and Rao [6, 5]. By using mainly suc-cessive applications of transformation Eq. 3 they developed(and patented) a method to find a suitable layout for this datatype in IH2. Each tree node receives a certain open space“pie segment”, where the node chooses the locations of itssiblings. For all its siblingsi it calls recursively the layout-routine after applying the Mobius transformation Eq. 3 inorder to centeri.

Tamara Munzner developed another graph layout algo-rithm for the three-dimensional hyperbolic space [8]. Whileshe gains much more space for the layout, the problem ofmore complex navigation (and viewport control) in 3D and,more serious, the problem of occlusion appears.

The next two layout techniques are freed from the re-quirement of hierarchical data.

3.2. Hyperbolic Self-Organizing Map (HSOM)

wa�

x

Grid of ÿNeurons a

*

a� *

Input Space X

Figure 4. The “Self-Organizing Map” (“SOM”) is formedby a grid of processing units, called formalneurons. Herethe usual case, a two-dimensional grid is illustrated at theright side. Each neuron has a reference, or prototype vectorwa attached, which is a point in the embedding input spaceX. A presented inputx will select that neuron withwa

closest to it. The HSOM uses a hyperbolic grid as displayedin Fig. 3 and the appropriate neighborhood functionh(·).

The standard Self-Organizing Map (SOM) algorithm isused in many application for learning and visualization (Ko-honen, e.g. [3]). Fig. 4 illustrated the basic operation. Thefeature map is built by a lattice of nodes (or formal neu-rons)a ∈ A, each with a reference vector or “prototypevector”wa attached, projecting into the input spaceX. Theresponse of a SOM to an input vectorx is determined bythe reference vectorwa∗ of the discrete “best-match” nodea∗, i.e. the node which has its prototype vectorwa closestto the given input

a∗ = argmin∀a∈A

‖wa − x‖ . (4)

Page 4: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

The distribution of the reference vectorswa, is iterativelyadapted by a sequence of training vectorsx. After findingthe best-match neurona∗ all reference vectors are updatedw(new)a := w(old)

a + ∆wa by the adaptation rule

∆wa = ε h(a, a∗) (x−wa). with (5)

h(a, a∗) = exp [−(da,a∗/λ)2] (6)

Here h(a, a∗) is a bell shaped Gaussian centered at the“winner” a∗ and decaying with increasing distanceda,a∗ =|ga−g∗a| in the neuron grid{ga}. Thus, each node or “neu-ron” in the neighborhood of the “winner”a∗ participates inthe current learning step (as indicated by the gray shadingin Fig. 4.)

The network starts with a given node gridA and a ran-dom initialization of the reference vectors. During thecourse of learning, the widthλ of the neighborhood bellfunction and the learning step size parameterε is continu-ously decreased in order to allow more and more specializa-tion and fine tuning of the (then increasingly) weakly cou-pled neurons.

This neighborhood cooperation in the adaptation algo-rithm has important advantages:(i) it is able to generatetopological orderbetween thewa which means that similarinputs are mapped to neighboring nodes;(ii) As a result,the convergence of the algorithm can besped upby involv-ing a whole group of neighboring neurons in each learningstep.

The structure of this neighborhood is essentially gov-erned by the structure ofh(a, a∗) = h(da,a∗) – thereforealso called theneighborhoodfunction. Most learning andvisualization applications chooseda,a∗ as distances in a reg-ular two and three-dimensional euclidian lattice.

SOMs in non-euclidian spaces were suggested by oneof the authors [11]. The core idea of the Hyperbolic Self-Organizing Map (HSOM) is to employ aIH2-grid of nodes.A particular convenient choice is to take the{ga} ∈ PD ofa finite patch of the triangular tessellation grid introduced inSec. 2.1. The internode distance is computed in the appro-priate the Poincare metric as

da,a∗ = 2 arctanh

(|ga − ga∗ ||1− gaga∗ |

). (7)

3.3. Hyperbolic Multidimensional Scaling(HMDS)

Multidimensional scaling refers to a class of algorithmsfor finding a suitable representation ofproximity relationsof N objects by distances between points in a low dimen-sional – usually Euclidean – space. In the following werepresent proximity asdissimilarityvalues between pairs ofobjects, mathematically written as dissimilarityδij ∈ IR+

0

between thei andj item. As usual we assume symmetry,i.e. δij = δji. Often the raw dissimilarity distribution is

not suitable for the low-dimensional embedding and an ad-ditional δ-processing step is applied. We model it here asa monotonic transformationD(.) of dissimilaritiesδij intodisparitiesDij = D(δij).

The goal of the MDS algorithm is to find a spatial rep-resentationxi of each objecti in theL-dimensional space,where the pair distancesdij ≡ d(xi,xj) match the dispar-ities Dij as faithfully as possible∀i 6=jDij ≈ dij. The pairdistance is usually measured by the Euclidian distance:

dij = ||xi−xj || with xi ∈ IRL, i, j ∈ {1, 2, ..N} (8)

One of the most widely known MDS algorithms was in-troduced by Sammon [12] in 1969. He formulates a min-imization problem of a cost function which sums over thesquares of disparities–distance misfits, here written as

E({xi}) =N∑i=1

∑j>i

wij(dij −Dij)2. (9)

The factorswij are introduced to weight the disparities in-dividually and also to normalize the cost functionE to beindependent to the absolute scale of the disparitiesDij. De-pending on the given analysis task the factors can be cho-sen to weight all the disparities equally – theglobal variant(w(g)ij = const) – or to emphasize thelocal structure by

reducing the influence of larger disparities (w(l)ij , which we

are using in the following)

w(g)ij =

1∑Nk=1

∑l>k D2

kl

, w(l)ij =

2N(N − 1)

1D2

ij

.

(10)Note that the latter is undefined if any pair has zero dispar-ity. In his original work [12] Sammon suggested aninter-mediatenormalizationw(m)

ij = (∑Nk=1

∑l>k Dkl)−1D−1

ij .The set ofxi is found by a gradient descent procedure, min-imizing iteratively the cost or stress function Eq. 9. Thereader is referred to [12, 1] for further details on this andother MDS algorithms.

The recently introduced Hyperbolic Multi-DimensionalScaling (HMDS) [16] combines the concept of MDS andhyperbolic geometry. The core idea turns out to be verysimple: instead of finding a MDS solution in the low-dimensional EuclideanIRL and transferring it to theIH2

(which can not work well), the MDS formalism operates inthe hyperbolic space from the beginning. The key point isEq. 8. The Euclidean distance in the target space is replacedby the appropriate distance metric for the Poincare model(see, e.g. [7] and compare Eq. 7)

dij = 2 arctanh

(|xi − xj ||1− xixj |

), xi,xj ∈ PD. (11)

While the gradients∂dij,q/∂xi,q required for the gradientdescent are rather simple to compute for the Euclidean ge-

Page 5: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

ometry, the case becomes complex for Eq. 11.2 Details canbe found in [16, 15].

Disparity preprocessing: Due to the non-linearity ofEq. 11, the preprocessing functionD(.) (see Sec. 3.3) hasmore influence inIH2. Consider, e.g., linear rescaling of thedissimilaritiesDij = αδij: in the Euclidean case the visualstructure is not affected – only magnified byα. In contrastin IH2, α scales the distribution and with it the amount ofcurvature felt by the data. The optimalα depends on thegiven task, the dataset, and its dissimilarity structure. Wesetα manually and choose a compromise between visibil-ity of the entire structure and space for navigation in thedetail-rich areas.

4. Hyperbolic Data Viewer:Combining the advantages

Before we introduce a new hybrid architecture, we firstcompare the advantages and disadvantages of the threeIH2

layout methods with respect to several aspects.

Input data type: The HTL requires acyclic graph data andis therefore limited to hierarchically ordered data (prefer-ably balanced with a branch count≈4–12).

The HSOM processed only vectorial data representations– while the HMDS uses dissimilarity data. Since a suitabledistance function can directly transform any data type intodissimilarly data (not vice versa) and handling of missingdata is easy, this can considered the most general data type.

Scaling behavior for the number of objects N : Both,HTL and HSOM share the advantage of linear scaling withthe number of objects. HSOM scales also linearly with theinput space dimension and the number of nodes. HMDSdoes not scale well and requires to processN(N − 1)/2distance pairs. WhenN grows to several hundred objects,the layout generation becomes slow and the results less con-vincing. Then precomputation may help for undamped in-teractive exploration.

Layout result: the HTL returns theIH2-location deter-mined by the recursive space partitioning.

The HSOM returns the rigid grid, i.e., the triangulartessellation grid. Each object or document is mapped tothe node with the best matching prototype vectors assigned(Eq. 4). Each node is associated with two sources of de-scriptive information: the collected set of assigned objectsand the prototype vector representing the group. Those in-formations can be transformed in various kinds of graphicalattribution and annotation.

In contrast to the former, the layout results of the HMDSdirectly carry information on the data level, since the spatiallocations represent the similarity structure of the given pairdistance data. Therefore, the map metaphor of closenessand proximity is here brought to the detail level.

2Note, no complex gradient information is required in the HSOM ap-proach.

HSOM

HMDS Poincare Projection

Poincare Projection

Display Area

Data Selection

navigate

navigate

select

H2 prototypes

H2 objects

Display Area

Figure 5. The proposed architecture combines the advan-tages of the two layout techniques:(upper part)the HSOMfor obtaining a coarse map of a large data collection and(lower part) the HMDS for a mapping smaller data set toa spatially continuous representation of data relationships.The display concept is unified: both employ the extraordi-nary visualization and navigation features of the hyperbolicplane.

New objects: For the HTL a new object requires a newpartial layout of the smallest subtree(s) containing the newobjects.

The HSOM maps a new object to the best-matchingnode, i.e. a location in the map. The mapping time scaleswith the grid size since it involves number of node manycomparisons. Furthermore the architecture many choose toimplement online learning in order to adapt to new trainingdata.

The HMDS requires a new minimization of the globalcost function. For speedup, it can employ the previousobject locations as start configuration.

New Hybrid Architecture: The previous discussionof advantages and disadvantages of the layout techniquesmotivates the here proposed synthesis of three core compo-nents: (i) the HSOM for building a coarse-grain theme mapin a self-organized manner;(ii) HMDS for detailed inspec-tion of data subsets where data similarities are continouslyreflected as spatial proximities;(iii) the display paradigmemploys in both cases the hyperbolic plane in order to profitfrom its focus and context technique. Fig. 5 displays the ba-sic architecture.

5. Application to Navigationin Unstructured Text

In times of exponential growth of digital information thesemantic navigation in datasets – particularly for the case ofunstructured text documents – is a major challenge. In thisexperiment we demonstrate the application of the proposedarchitecture to this situation.

As an example we use the “Reuters-21578”3 collection

3As compiled by David Lewis from the AT&T Re-search Lab in 1987. The data can be found athttp://www.daviddlewis.com/resources/testcollections/reuters21578

Page 6: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

of news articles that appeared on Reuters newswire between1987/02/26 and 1987/10/20. Most of the documents weremanually tagged with 135 different category names such as“earn”, “trade” or “jobs”. We employed the 9603 docu-ments of the training set from the “ModApte” split to formthe HSOM input vectorsx using the standard bag-of-wordsmodel using the TFIDF scheme (term frequency times in-verse document frequency) with a 5561 dimensional vectorspace (equals # derived word stems).

Distances and therewith dissimilarities of two documentsare computed with the cosine metric

δij = 1− cos(~fti , ~ftj ) = 1− ~f ′ti~f ′tj , with ~f ′ =

~f

||~f ||(12)

and efficiently implemented by storing the normalized doc-ument feature vectors~f ′.

5.1. Interactive Browsing of the Overview Map

Embedding the 2D Poincare Model in 3D Euclideanspace and placing at each node position a 3D glyph allowsfor the simultaneous visualization of several attributes atonce. The glyph size, form, color or height above the PDground plane might characterize the number of documents,the predominant category in the corresponding node or thenumber of new documents mapped.

Depending on the size of the text database to be mined,the number of nodes for the HSOM is chosen. In Figure 6a HSOM with a total of 1306 nodes is shown. Since theHMDS approach can handle sizes of several hundred docu-ments, such a map could easily contain a million articles.

Figure 6. A HSOM projection of a large collection ofnewswire articles forms semantically related category clus-ters (shown as different glyphs).

In case of the Reuters-21578 collection we show resultswith a HSOM consisting of a tessellation with 3 “rings” and8 neighbors per node, resulting in 161 prototype vectors asshown in Figure 7. By mapping the 12902 documents of theReuters-21578 training and test collection we have a mean

of 80 documents per node, which are given to the HMDSmodule for further inspection. This number of documentscan be handled in real time by HMDS and allows an on-lineinteractive text mining process.

Figure 7. The Reuters-21578 corpus coarsly mapped witha HSOM containing 161 nodes. The glyph size correspondsto the number of documents before 1987/04/07, the heightabove ground plane the number of articles after 1987/04/07.

Figure 7 is the initial point for an interactive text min-ing session we describe in the following. The global hy-perbolic overview map reveals several large clusters whichmainly contain the top 10 topics. There is only one largerglyph (at the 3 o’clock position) indicating documents notbelonging to the 10 top categories. The area marked withthe question mark “?” contains an isolated yellow spherewhich is surrounded by green cubes. In Figure 8 the userhas adjusted the fovea of the hyperbolic map to inspect thisselected region more closely. The figure shows that the re-lation of the nodes in this region can now be inspected moreeasily while the global context is still in view. By inter-actively selecting a node, the HSOM prototype vectors areused to automatically generate a key word list which anno-tate and semantically describe the selected glyph. To thisend, the words corresponding to the ten largest componentsof the reference vector are selected. These describe the pro-totype document which resembles a non-linear superposi-tion of the texts for which this node is the best-match node.In our example the most prominent words “strike”, “union”and “port” indicate that this area of the map probably con-tains articles describing worker strikes in ports.

By mouse selection, all node-assigned-documents aresend to the HMDS module. Fig. 9 displays the HMDS re-sult. The presentation here is a lean 2D display with min-imal occlusion and uses markers for category indication ofeach object. Several clusters can be easily recognized. The“A” marked group is a category mixture while the others arequite homogeneous.

By turning on the labels the document title become vis-ible and the semantic homogeneity can be verified. Fig. 10displays a screen shot after sweeping the navigation focus

Page 7: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

Figure 8. Inspection of a picked node by adjusting thefovea. The nodes’ annotations were generated by evalua-tion of the corresponding prototype vectors. They indicatethe nodes’ contents and show the semantical relationship ofadjacent nodes.

OtherCornShipWheatInterestTradeGrainCrudeMoney-FXAcquisitionEarnH2-MDS

OtherCornShipWheatInterestTradeGrainCrudeMoney-FXAcquisitionEarn

C

E

DA

B

Figure 9. The screenshot of the HMDS visualizes all doc-uments in the selected node #143 positioned in theIH2. Thelegend at the right explains the marker type used for the top10 categories each document can be labels with. The crossis for visual indication of a zero-point and markers A–E arean overlay for explanation.

to the cluster structure “C” in the previously lower left di-rection.

5.2. Similarity Search for Queriesor New Documents

The hybrid approach can also be used to find similar doc-uments to a query within a large collection. This is achievedby generating a query feature vector which is compared toall prototypes. The corresponding best match node is thenvisually highlighted, such that the user can adjust the focusof attention and “zoom” into that region. In order to demon-strate that context plays an important role, we formulate aquery containing the word “strike” (which was the most im-portant entry for the node inspected in Figure 8). Figure 11

CARGILL_U.K._STRIKE_TALKS_BREAK_OFF_WITHOUT_RESULT-12425CARGILL_U.K._STRIKE_TALKS_TO_RESUME_THIS_AFTERNOON-9197

CARGILL_U.K._STRIKE_TALKS_TO_RESUME_TUESDAY-6993

CARGILL_STRIKE_TALKS_CONTINUING_TODAY-5833

CARGILL_U.K._STRIKE_TALKS_TO_CONTINUE_MONDAY-3710

CARGILL_U.K._STRIKE_TALKS_POSTPONED_TILL_MONDAY-1966

CARGILL_U.K._STRIKE_TALKS_POSTPONED-486

OtherCornShipWheatInterestTradeGrainCrudeMoney-FXAcquisitionEarnH2-MDS

CARGILL_U.K._STRIKE_TALKS_BREAK_OFF_WITHOUT_RESULT-12425CARGILL_U.K._STRIKE_TALKS_TO_RESUME_THIS_AFTERNOON-9197

CARGILL_U.K._STRIKE_TALKS_TO_RESUME_TUESDAY-6993

CARGILL_STRIKE_TALKS_CONTINUING_TODAY-5833

CARGILL_U.K._STRIKE_TALKS_TO_CONTINUE_MONDAY-3710

CARGILL_U.K._STRIKE_TALKS_POSTPONED_TILL_MONDAY-1966

CARGILL_U.K._STRIKE_TALKS_POSTPONED-486

OtherCornShipWheatInterestTradeGrainCrudeMoney-FXAcquisitionEarn

Figure 10. Navigation to the “C”-marked document clus-ter in Fig. 9. Now the cluster is focused and labels areturned on. All document are related to a strike at CargillU.K. Ltd’s oilseed processing plant at Seaforth in the be-ginning of 1987. Note how the quartering lines in Fig. 9 aretransfered to otherIH2-lines(!) which appear as circle arcperpendicular to the rim.

shows a map where the focus is centered to the winningnode of the query:“USA leading the strike in a Gulf waragainst Iraq?”.

Fig. 12 presents the drill-down with the HMDS and la-bels the neighboring documents to the new query. We findtexts which deal with tensions in the gulf at that time andalso mentions the word “strike” – but in another mean-ing than in the previously inspected node on a very dis-tant HSOM node. A further query is a query from anothernews stream: CNN reported a very promising article “Bush:Ending Saddam’s regime will bring stability to Mideast”(03/02/274) which we find in the upper left corner in Fig. 12.

Figure 11. A query document was mapped to the HSOMand the fovea moved to the highlighted “best match” node.The automatic annotation scheme provides insightful infor-mations about the semantic content in this area of the map.

4http://www.cnn.com/2003/WORLD/meast/02/27/sprj.irq.bush.speech/index.html

Page 8: Interactive Visualization and Navigation in Large Data ...walter/pub/WalterOntrupWes...perbolic plane (H2): (i) the underlying space grows expo-nentially with its radius around each

.U.S._LAWMAKERS_SUPPORT_GULF_ACTION.<20890>

.SENATE_BACKS_U.S._RETALIATION_IN_GULF.<20828>

.CONVOY_RUNS_GULF_GAUNTLET,_OTHER_SHIPS_STAY_CLEAR.<20774>

.US_SENATE_CUTS_OFF_STALL_TACTICS_ON_GULF_BILL.<20624>.US_WARNS_IRAN,_BEGINS_ESCORTING_TANKER_CONVOY.<20464>

.GULF_AND_WESTERN_<GW>_UPS_INTEREST_IN_NETWORK.<19380>.U.S._HOUSE_PASSES_MIDEAST_GULF_BILL.<18357>

.EC_WATCHING_GULF_WAR_DEVELOPMENTS.<18340>

.U.S._SENATE_TEAM_WANTS_MULTINATIONAL_GULF_FORCE.<18329>

.U.S._HOUSE_PASSES_GULF_BILL_DESPITE_OPPOSITION.<18328>

.U.S._TO_PROTECT_ONLY_AMERICAN_SHIPS.<18231>

.REAGAN_HINTS_U.S._WANTS_HELP_IN_PATROLLING_GULF.<17750>

.U.S._GULF_OF_MEXICO_RIG_COUNT_CLIMBS_TO_38.9_PCT.<17658>

SearchOtherCornShipWheatInterestTradeGrainCrudeMoney-FXAcquisitionEarnH2-MDS

.U.S._LAWMAKERS_SUPPORT_GULF_ACTION.<20890>

.SENATE_BACKS_U.S._RETALIATION_IN_GULF.<20828>

.CONVOY_RUNS_GULF_GAUNTLET,_OTHER_SHIPS_STAY_CLEAR.<20774>

.US_SENATE_CUTS_OFF_STALL_TACTICS_ON_GULF_BILL.<20624>.US_WARNS_IRAN,_BEGINS_ESCORTING_TANKER_CONVOY.<20464>

.GULF_AND_WESTERN_<GW>_UPS_INTEREST_IN_NETWORK.<19380>.U.S._HOUSE_PASSES_MIDEAST_GULF_BILL.<18357>

.EC_WATCHING_GULF_WAR_DEVELOPMENTS.<18340>

.U.S._SENATE_TEAM_WANTS_MULTINATIONAL_GULF_FORCE.<18329>

.U.S._HOUSE_PASSES_GULF_BILL_DESPITE_OPPOSITION.<18328>

.U.S._TO_PROTECT_ONLY_AMERICAN_SHIPS.<18231>

.REAGAN_HINTS_U.S._WANTS_HELP_IN_PATROLLING_GULF.<17750>

.U.S._GULF_OF_MEXICO_RIG_COUNT_CLIMBS_TO_38.9_PCT.<17658>

SearchOtherCornShipWheatInterestTradeGrainCrudeMoney-FXAcquisitionEarn

Q: Bush: Ending Saddam’s regime...<CNN:2003/02/27>

Q: USA leading the strike in a gulf war against iraq?

Figure 12. HMDS location of a manual query (4 o’clock)and another news document from these days (10 o’clock).The title reveal the successful mapping in a meaningfulmanner.

6. Discussion and Conclusion

Document visualization efforts likeThemeScapes[17]or the SOM based WebSOM [4] as well as Skupin’s carto-graphic approach [13] have impressively demonstrated theusefulness of compressed, map-like 2D-representations ofmassive data collections, even if the data items contain ex-tremely high-dimensional information such as text.

Recent work, such as [6, 11, 16] shows that the task ofinformation visualization can significantly benefit from theuse of hyperbolic space as a projection manifold. On theone hand it gains the exponentially growing space aroundeach point which provides extra space for compressing se-mantic relationships. On the other hand the Poincare modeloffers superb visualization and navigation properties, whichwere found to yield significant improvement in task timecompared to traditional browsing methods [10]. By sim-ple mouse interaction the focus can be transfered to anylocation of interest. The core area close to the center ofthe Poincare disk magnifies the data with a zoom factor of0.5 and decreases exponentially to the outer area. By thismeans a very natural visualization behavior is constructed:The fovea is an area with high resolution, while remote areaare gradually compressed and still visible as context.

Another advantage is scalability. Due to the favorablelinear scaling ofO(N) the HSOM can be used to forman initial overview map for very large data collections.This map then offers all strengths of the hyperbolic fo-cus+context navigation, permitting the user to rapidly nar-row down the to-be-investigated data to a much smaller sub-set which then can be interactively mapped to the individuallevel with the HMDS technique. Again, the same hyper-bolic focus+context navigation is available.

Both approaches produce spatial representations of datasimilarity: The HSOM produces on a coarse level the “mas-ter map”, providing the thematic overview. Additionally,the neurons of the HSOM can be regarded as data collect-

ing agents offering the potential to visualize temporal de-velopments in data streams on the map. In the second stagethe HMDS can represent the semantic closeness of the in-dividual documents and is able to give a much more preciserepresentation since is is decoupled from the rigid grid.

While the hybrid scheme may appear conceptually sim-ple, we think that hybrid approaches to strike a flexiblebalance between scaling of computational demands andachievable precision can be crucial for making new methodsapplicable to massive data collections, an important goal to-wards which the present research is meant to be a modestbut useful step.

References

[1] T. Cox and M. Cox. Multidimensional Scaling. Chapmanand Hall, 1994.

[2] H.S.M. Coxeter. Non-Euclidean Geometry. University ofToronto Press, 1957.

[3] T. Kohonen. Self-Organizing Maps, volume 30 ofSpringerSeries in Information Sciences. Springer, 1995.

[4] T. Kohonen et al. Organization of a massive documentcollection. IEEE TNN Spec Issue Neural Networks for DataMining and Knowledge Discovery, 11(3):574–585, 2000.

[5] J. Lamping, R. Rao, and P. Pirolli. A focus+contexttechnique based on hyperbolic geometry for viewing largehierarchies. InACM SIGCHI, pages 401–408, 1995.

[6] J. Lamping and R. Rao. Laying out and visualizing largetrees using a hyperbolic space. InACM Symp User InterfaceSoftware and Technology, pages 13–14, 1994.

[7] F. Morgan. Riemannian Geometry: A Beginner’s Guide.Jones and Bartlett Publishers, 1993.

[8] T. Munzner. H3: Laying out large directed graphs in 3dhyperbolic space. InProc IEEE Symp Info Vis, pages 2–10,1997.

[9] P. Pirolli, S. Card, and M. M. Van Der Wege. Visualinformation foraging in a focus + context visualization. InCHI, pages 506–513, 2001.

[10] K. Risden, M. Czerwinski, T. Munzner, and D. Cook. Aninitial examination of ease of use for 2d and 3d informationvisualizations of web content. Int J Human ComputerStudies, 53(5):695–714, 2000.

[11] H. Ritter. Self-organizing maps on non-euclidean spaces. InKohonen Maps, pages 97–110. Elsevier, 1999.

[12] J. W. Sammon, Jr. A non-linear mapping for data structureanalysis.IEEE Trans Computers, 18:401–409, 1969.

[13] A. Skupin. A cartographic approach to visualizing confer-ence abstracts.IEEE Computer Graphics and Applications,pages 50–58, 2002.

[14] J.A. Thorpe. Elementary Topics in Differential Geometry.Springer, 1979.

[15] J. Walter. H-MDS: a new approach for interactive visualiza-tion with multidimensional scaling in the hyperbolic space.Information Systems, (in print), 2003.

[16] J. Walter and H. Ritter. On interactive visualization ofhigh-dimensional data using the hyperbolic plane. InACMSIGKDD Int Conf Knowledge Discovery and Data Mining,pages 123–131. 2002.

[17] J. Wise. The ecological approach to text visualizationt.J AmSoc Information Sci, 50(13):1224–1233, 1999.


Recommended