1
1
Interactive Visual Exploration of High Dimensional Datasets
Jing Yang
Fall 2008
2
Challenges of High Dimensional Datasets
High dimensional datasets are common: digital libraries, bioinformatics, simulations, process monitoring, and surveys
Example: Ticdata2000 dataset: 86 dimensionsOHSUMED dataset: 215 dimensions SkyServer dataset: 361 dimensions
Challenges of visualizing high dimensional datasets:Clutter on the screenDifficult user navigation in the data space
2
3
Example
215*215 = 46,225 plots
OHSUMED dataset: 215 dimensions
215 axes
4
Visual Hierarchical Dimension Reduction (VHDR)
J. Yang, M.O. Ward, E.A. Rundensteinerand S. Huang
VisSym’03
3
5
Motivation - Dimension ReductionIdea:
Project a high-dimensional dataset to a lower-dimensional subspaceVisualize data items in the lower-dimensional subspace
Existing Approaches:Principal Component Analysis Multidimensional ScalingKohonen’s Self Organizing Map
Problems: Information lossNo intuitive meaning of generated dimensionsLittle user interaction allowed.
6
Inspiration
Hierarchical parallel coordinate: data item hierarchy
4
7
Key Ideas of VHDR
Use dimension hierarchy to convey dimension relationshipsAllow users to learn the dimension hierarchy Allow users to select dimensions or dimension clusters to form subspaces of interests
8
Dimension Hierarchy
Similar dimensions form cluster, clusters are grouped into larger clusters
a dimension hierarchy of a 5-d dataset
5
9
VHDR Framework
Step 1: build dimension hierarchy Step 2: navigate and manipulate dimension hierarchyStep 3: interactively select clusters from dimension hierarchy to form lower-dimensional subspaces
10
Overview
6
11
Build Dimension Hierarchy
Automatic dimension clusteringCluster dimensions according to dissimilarities*among them*Dissimilarity - measure of how dimensions are dissimilar to each other
Manual hierarchy modificationDiscussion:
How to calculate dissimilarity between two dimensions?
Ref: Ankerst, M., Berchtold, S., and Keim, D. A. Similarity clustering of dimensions for an enhanced visualization of multidimensional data. InfoVis’98
12
Navigate and Manipulate Dimension Hierarchy
InterRing - Radial space filling hierarchy navigation tool [yang:2002]ModificationSelectionRadius distortionCircular distortionRolling up/Drilling downRotationZooming/Panning
7
13
Construct Lower-Dimensional Subspaces
Strategy 1: construct a subspace with closely related dimensions
14
Construct Lower-Dimensional Subspaces
Strategy 2: construct a subspace that covers major variance of the dataset
8
15
Dimension Cluster Representation
Representative Dimension - a dimension that represents a cluster of dimensions
Approaches to assigning or generating a representative dimension:
1. Select a dimension from the cluster 2. Average all dimensions in the cluster3. Use principal component analysis to generate
weighted sum of dimensions within a cluster
16
Examples
Approach - averageApproach - select
representative dimensionrepresentative dimension
9
17
Dissimilarity Representation
Approaches:Axis WidthThree AxesDiagonal PlotsOuter and Inner SticksMean-Band
Goal: visualize dissimilarity of dimensions in a dimension cluster.
Example: select 3 dimension clusters (dimensions) in Census-Income dataset
18
Dissimilarity Representation : Axis Width Method
10
19
Dissimilarity Representation : Three Axis Method
20
Dissimilarity Representation : Three Axis Method
11
21
Dissimilarity Representation : Diagonal Plots Method
No dissimilarities representation
Dissimilarities represented in diagonal plots
22
GeneralityVHDR is a general framework that can be
extended to multiple display techniques
We have applied VHDR to:
Parallel CoordinatesStar GlyphsScatterplot MatricesDimensional Stacking
Hierarchical Parallel CoordinatesHierarchical Star GlyphsHierarchical Scatterplot
MatricesHierarchical Dimensional
Stacking
12
23
Other Clustering Approach
Visualization of Large-Scale Customer Satisfaction Surveys Using a Parallel Coordinate Tree, D. Brodbeck et. al. Infovis2003
24
13
25
Interactive Hierarchical Dimension Ordering, Spacing and Filtering For Exploration of High Dimensional Datasets
Jing Yang, Wei Peng, Matthew O. Wardand Elke A. Rundensteiner
InfoVis’03
26
Motivation
Large number of dimensions need to be managed
Ordering, spacing, filtering etc.
14
27
Overview
General: includes dimension ordering, dimension spacing and dimension filteringInteractive: allows user interactions throughout the whole processHierarchical: groups dimensions into a hierarchy and builds most algorithms and user interactions upon this hierarchy to increase scalability
28
Dimension Ordering (1)
Random order
15
29
Dimension Ordering (2)
Ordered by similarity
30
Dimension Ordering (3)Order dimensions according to different
purposes:Similarity-oriented ordering: put similar dimensions close to each otherImportance-oriented ordering: map more important dimensions to more significant positions or attributes. The order of importance can be decided by Principal Component Analysis (PCA)
16
31
Dimension Ordering (4)Challenges for ordering high dimensional datasets:
Similarity-oriented ordering is NP-CompleteIt is hard to decide the order of the importance of a large number of dimensions using PCA
Our solution: reduce the complexity of the ordering problem using the dimension hierarchyOrder each dimension clusterthe order of the dimensions is decided in a depth-first traversal of the dimension hierarchy
32
Hierarchical OrderingIllustration
17
33
Dimension Ordering (6)
Random order Similarity-oriented order
34
Dimension Spacing (1)
Idea of dimension spacing:Convey dimension relationship information by varying the spacing between adjacent axes
18
35
Dimension Spacing (2)
Dimensions spaced according to similarity: similar dimensions are close to each other
36
p g(1)
19
37
Dimension Spacing Distortion (2)
Before After
38
Dimension Filtering (1)
Idea of dimension filtering:Similar dimensions can be omitted;Unimportant dimensions can be omitted.
20
39
Dimension Filtering (2)
Unfiltered Filtered
40
Conclusion
The proposed approachImproves the manageability of dimensions in high dimensional data sets and reduces the complexity of the ordering, spacing and filtering tasks;Allows flexible user interactions for dimension ordering, spacing and filtering with dimension hierarchies.
21
41
Value and Relation (VaR) Display
Jing Yang, Anilkumar Patro, Shiping Huang, NishantMehta, Matthew O. Ward and Elke A. Rundensteiner
InfoVis’04
42
Motivation
Challenges:Can high dimensional datasets be visualized without dimension reduction to avoid information loss ?Can dimension relationships be visualized in the same display as data values?
22
43
Challenge - Visualization without Dimension Reduction
Visualize SkyServer dataset (361 dimensions) using existing techniques: Parallel Coordinates: 361 axesScatterplot Matrix: 130,321 scatterplotsPixel-Oriented techniques without overlaps: 50,000 data items: 18,050,000 pixels (23 times of number of pixels in a 1024*768 screen)
Hint:Use Pixel-Oriented techniques and allow overlaps
44
Challenge - Dimension Relationship Visualization
Sorting dimensions in a 1D or 2D grid [Ankerst 98]
Not effective beyond hundreds of dimensions
Spacing between dimensions [Yang 2003]
Only relationships of adjacent dimensions are revealed Pixel-Oriented: Sort 50 dimensions
in a 2D grid [Ankerst 98]
23
45
Challenge - Dimension Relationship Visualization (con.)
SPIRE Galaxies: Map data items to a 2D display using MDS [Wise: 95]
Recall data item relationship visualization:MDS: SPIRE Galaxies [Wise:95]
Hint: Using MDS to layout dimensions
46
Our Proposal: Value and Relation (VaR) Display
d1 d2 d3 d4
d1d2d3
00.70.60.7
0.700.30.2
0.60.300.1
0.70.20.10
d4
d4d1d2
d3
Multi-Dimensional Scaling
Pixel-Oriented glyph
24
47
SkyServer dataset: 361 dimensions, 50,000 data items
Value and Relation Display
Features: Explicitly conveys data values without dimension reductionExplicitly conveys dimension relationshipsProvides a rich set of interaction tools
48
Overlap Detection and Reduction
Extent ScalingDynamic MaskingZooming and PanningShowing NamesLayer ReorderingManual Relocation
Automatic Shifting
SkyServer Dataset
25
49
Distortion
Goal: Focus-within-context
Features: Enlarges clicked glyphs Keeps size of other glyphs
SkyServer Dataset
50
Data Item Reordering
Pixel-oriented techniques:Data item ordering is critical
VaR display: Initial displayManual reordering
Census-Income-Part Dataset: 42 dimensions, 20,000 data items
26
51
Comparing
Goal: Compare base dimension with all others
Feature: Coloring by value difference of dimensions being compared
AAUP Dataset: 14 dimensions, 1,131 data items
52
Selection
Goal:Select dimensions for further interaction or visualization
Selection tools in VaR display:Manual selection - flexibilityAutomatic selection - efficiency
Select related dimensionsSelect unrelated dimensions
27
53
Automatic Selection for Unrelated Dimensions
Input: A base dimension“Related” threshold
Output: Dimensions covering major data variance
Algorithm: Iteratively select unrelated dimensions and filter related dimensions
Related work:Maximum subspace [MacEachren:03] SkyServer Dataset
54
Scale to Large Datasets
Store glyphs as texture objectsExtent scaling and relocating: resize, relocate texture objects ☺Reordering and recoloring: regenerate texture objects
Use random sampling Users interactively set thresholdRandom sampling is triggered automatically
Without sampling (16K data items)
With sampling (5K data items)
Out5D Dataset
28
55
Discussion
Is pixel-oriented technique the only choice for generating dimension glyphs?Histogram, Scatterplot, …
Is 2D MDS the only approach to layout dimensions? 3D MDS, SOM, Treemap, Animation…
Is correlation the most informative relationshipamong dimensions? Multivariate relationships
56
Value and Relation Display: Interactive Visual Exploration of Large Datasets with Hundreds of Dimensions.
J. Yang, D. Hubball, M. Ward, E. Rundensteiner and W. Ribarsky
IEEE Transactions on Visualization and Computer Graphics 13(3)
29
57
XRay Dimension Glyphs
Each glyph: a scatterplotmatrix
X: a base dimension that is the same for all glyphsY: the dimension it represents
Density based displayBright: sparseDark: dense
Unoccupied area: semi-transparent
58
A real dataset with 89 dimensions and 10,417 data items in Pixel and XRay Vars.
XRay Dimension Glyphs
30
59
Jigsaw Map Dimension LayoutDimension hierarchyUsing H-Curve to create a Jigsaw Map
M. Wattenberg. A note on space-filling visualizations and space-filling curves. InfoVis 2005, pages 181–186
60
A real dataset with 838 dimensions and 11,413 data items in Pixel-Jigsaw VaR and XRay-Jigsaw VaR
Jigsaw Map Dimension Layout
31
61
Rainfall Dimension LayoutMetaphor: RainCenter Bottom: focus dimension DSpeed of a dimension: related to its correlation to DTime: user controllable
62
Rainfall Dimension Layout
32
63
Data Item Selection and Masking
Visual query style data item selectionData item based masking
(a) No mask (b) Opaque mask (c) Semi-transparent mask
64
Labeling
(a) All labels are shown
(b) Labels of selected dimensions are shown
(c) Angled labels in Jigsaw map layout
33
65
Possible Applications of VaR Display
Interactively exploring high dimensional dataRevealing data item relationshipsRevealing dimension relationships
Guiding automatic data analysisAssessing resultsManually tuning parameters
Human-driven dimension reductionConstructing subspaces using selection toolsVisualizing subspaces in VaR or other displays
66
Semantic Image Browser: Bridging Information Visualization with Automated Intelligent Image Analysis
Jing Yang1, Jianping Fan1, Daniel Hubball1, Yuli Gao1, Hangzai Luo1, William Ribarsky1, and Matthew Ward2
1 University of North Carolina at Charlotte2Worcester Polytechnic Institute
Acknowledgements: This work is supported by NVAC
34
67
Motivation
Interactive image exploration:Applications: personal image management, satellite image analysis, ...
Background: Automated semantic image analysisGap between semantic image analysis and image exploration
Goals:Facilitate image exploration using analysis resultsEvaluate, monitor and improve analysis processes
68
Semantic Image Browser Overview
Annotation engineAutomated semantic image classification process
Multiple coordinated viewsImage view – MDS, Rainfall, SequentialContent view – VaR
Interactions Search by sample imageSearch by semantic contentInteractive annotation examination and modificationZooming, panning, distortion
35
69
Annotation Engine
Content-Based Image Annotation [fan:2004]
Low level visual featuresSemantic contentsSemantic concepts
Semantic contents: high dimensional datasetdata items: imagesdimensions: contentsvalues: 1 (image contains the content) or 0 (otherwise)
70
Image View – MDS layout
Corel collection (1100 images, 20 contents)
36
71
Navigation Tools
72
Image View – Rainfall Layout
37
73
Content View
VaR display [yang:2004]Content blocks
Pixel-oriented techniques [Keim 94]Color assignment
Unselected images:Red - 1 Gray – 0
Selected images:Blue – 1Light gray - 0
MDS layout of content blocksInteractions
Corel image collection (1100 images, 20 contents)
74
Search by Sample Image
38
75
Search by Semantic Content
76
Annotation Evaluation and Modification
Case 1: RedflowerCase 2: Sailcloth
39
77
User StudySubjects
10 UNCC students Dataset: Corel dataset (20 contents, 1100 images)Systems compared
No annotation: Unsorted Thumbnails in ExplorerSemantic contents: Semantic image browserSemantic concepts: Thumbnails sorted by concepts in Explorer
TasksTask1: Find three given imagesTask2: Find images with certain contentsTask3: Estimate percentage of images containing certain contents
78
User Study ResultsTask1: Find three given images
Result: Sorted Explorer was better than Semantic browserSemantic browser was better than Unsorted Explorer
Major reason: Annotations in the semantic concept level were more “error tolerant”
Task2: Find images with certain contentsResult was similar to task1
Task3: Estimate percentage of images containing certain contents
Result: Semantic browser was faster and more accurate than sorted Explorer and unsorted Explorer
Post experiment questionnaire (1 to 10 scale)Semantic browser was preferred Semantic browser was useful
40
79
Multivariate Visual Explanation for High Dimensional Datasets
S. Barlowe, T. Zhang, Y. Liu, J. Yang and D. Jacobs
VAST 2008
80
Worldview GapWorldview Gap - gap between what is being shown and what actually needs to be shown to draw a straightforward representational conclusion for decision making
- Amar and stasko, InfoVis 2004 best paper
Filling Worldview Gap:Our approach - Embedding automatic analysis
into information visualization
41
81
Multivariate Visual Explanation
With Scatt Barlowe, Tianyi Zhang, Yujie Liu, and Donald Jacobs
82
Motivation
Understanding multivariate relationships is critical in a vast number of applicationsExample:
Economic forecasting
42
83
What is the relationship?
Scatterplot MatrixParallel Coordinates
y0 = x0x1 + x2
84
Worldview Gap
Worldview Gap - gap between what is being shown and what actually needs to be shown to draw a straightforward representational conclusion for decision making
- Amar and stasko, InfoVis 2004 best paper
43
85
Multivariate Visual ExplanationGoals:
Multivariate relationship understandingDimension Reduction Model Construction
Approach: Integrate partial derivative calculation into multivariate visualization
Partial derivative calculation and inspectionStep by step visual exploration with interactive model construction and dimension reduction
86
Partial Derivative
Derivative: measurement of how a functionchanges when values of its inputs change
Example: derivative at a point in time of the position of a car: instantaneous speed
Partial derivative of a function of several variables: derivative with respect to one of those variables with the others held constant
44
87
Partial Derivative Inspection
Partial derivative calculation introduces errors
88
Partial Derivative InspectionVisually present errors to users
Error inspection of a segmented dataset:y = 8x0 +x1 if x0 ≥ 0.6 and x1 ≤ 0.3 y = x0−7x1 otherwise
45
89
Visual Exploration of Partial Derivatives
Show all partial derivatives together with the original dimensions? Scalability Challenge: 4-d dataset with dependent variable y and independent variables x0, x1, and x2
1st order derivatives: ∂y/∂x0, ∂y/∂x1, ∂y/∂x22nd order derivatives: ∂y∂y/∂x0∂x0, ∂y∂y/∂x0∂x1, ∂y∂y/∂x0∂x2, ∂y∂y/∂x1∂x1, ∂y∂y/∂x1∂x2, ∂y∂y/∂x2∂x2
Screen will be cluttered!
90
Visual Exploration of Partial Derivatives
Examine all types of relationships from one display? Users would be overwhelmed!
46
91
Step By Step Visual Exploration
Different types of correlations are examined in different stepsCorrelations easier to be detected will be examined firstVariable with detected relationships will be excluded from further analysis
92
Step1: 1st Order Partial Derivative Histograms
Display: the histograms of 1st order partial derivativesInformation to be detected:
Significant independent variablesIgnorable independent variablesIndependent variables linearly impact dependent variable Independent variable Positively or negatively impact?
47
93
94
48
95
96
Step2: 1st Order Partial Derivativesvs. Original Dimensions Scatterplots
Information:Entangled?
Dataset: y0 = x0x1 + x2, 1000 data items
49
97
Coordinated Visual Exploration
98
Model Construction