10/1/2007 TCSS555A Isabelle Bichindaritz 1
Main Concepts of Data Mining
Introduction to Data Preprocessing
10/1/2007 TCSS555A Isabelle Bichindaritz 2
Learning Objectives
• Study some examples of data mining systems
• Understand why to preprocess the data.
• Understand how to understand the data (descriptive data summarization)
10/1/2007 TCSS555A Isabelle Bichindaritz 3
Acknowledgements
Some of these slides are adapted from Jiawei Han and Micheline Kamber
10/1/2007 TCSS555A Isabelle Bichindaritz 4
Learning Objectives
• Study some examples of data mining systems
• Understand why to preprocess the data.
• Understand how to understand the data (descriptive data summarization)
10/1/2007 TCSS555A Isabelle Bichindaritz 5
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
OtherDisciplines
InformationScience
MachineLearning Visualization
10/1/2007 TCSS555A Isabelle Bichindaritz 6
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
10/1/2007 TCSS555A Isabelle Bichindaritz 7
Major Issues in Data Mining (1)• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
10/1/2007 TCSS555A Isabelle Bichindaritz 8
Major Issues in Data Mining (2)• Issues relating to the diversity of data types
– Handling relational and complex types of data– Mining information from heterogeneous databases and
global information systems (WWW)
• Issues related to applications and social impacts– Application of discovered knowledge
• Domain-specific data mining tools• Intelligent query answering• Process control and decision making
– Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy
10/1/2007 TCSS555A Isabelle Bichindaritz 9
Main Concepts in Data Mining• Data mining: discovering interesting patterns from large amounts of
data
• A natural evolution of database technology, in great demand, with wide applications
• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
• Classification of data mining systems
• Major issues in data mining
10/1/2007 TCSS555A Isabelle Bichindaritz 10
Case-Based ReasoningCase-Based Reasoning
• Case-based reasoning (CBR)– Problem-solving method from artificial
intelligence (AI) that proposes to reuse previously solved and memorized problem situations, called cases
– Instance-based method from machine learning– Can be used for classification/prediction tasks
10/1/2007 TCSS555A Isabelle Bichindaritz 11
Case-Based ReasoningCase-Based Reasoning
NewCase
Target case
Interpretation
Retrieve
ReuseRevise
Retain
RetrievedCase
Solved CaseSolutionSolution
Tested Case
USER INTERFACE
PROBLEM
SOLUTION
CASE BASEPrevious Cases
10/1/2007 TCSS555A Isabelle Bichindaritz 12
Fifth Workshop on Case-Based Reasoning in the Health Sciences
Isabelle BichindaritzUniversity of Washington, Tacoma, Washington, USA
Stefania MontaniUniversity of Piemonte Orientale, Italy
10/1/2007 TCSS555A Isabelle Bichindaritz 13
Workshop Stats
• Papers accepted: 10 papers
• Attendees: 19 participants
• Good news !!!
10/1/2007 TCSS555A Isabelle Bichindaritz 14
Workshop Goals
• Provide a forum for identifying important contributions and opportunities for research on the application of CBR to the Health Sciences
• Promote the systematic study of how to apply CBR to the Health Sciences
• Showcase applications of CBR in the Health Sciences
10/1/2007 TCSS555A Isabelle Bichindaritz 15
A CBR Solution for Missing Medical Data
Olga Vorobieva and Rainer Schmidt
Institute for Medical Informatics and Biometry University of Rostock, Germany
Alexander Rumiantzev
Pavlov State Medical University, St.Petersburg, Russia
10/1/2007 TCSS555A Isabelle Bichindaritz 16
Summary• Application domain
dialysis medicineeffects of fitness on dialysis
• System contextISOR, a CBR system that explains the exceptional cases – those for which fitness does not improve renal function
• Task / problem addressedrestoration of missing data
• Research hypothesiscase-based reasoning can be applied to restore missing data in a dataset/case base
• Main contributionsynergy between CBR and statistics (statistical modeling).
10/1/2007 TCSS555A Isabelle Bichindaritz 17
10/1/2007 TCSS555A Isabelle Bichindaritz 18
A Case-Based Reasoning Approach A Case-Based Reasoning Approach to Dose Planning in Radiotherapyto Dose Planning in Radiotherapy
Xueyan Song1, Sanja Petrovic1, and Santhanam Sundar 2
1Automated Scheduling, Optimisation and Planning GroupSchool of Computer Science
University of Nottingham, UK
2Dept. of Oncology, City Hospital Campus, Nottingham University Hospitals NHS Trust, Nottingham, UK
10/1/2007 TCSS555A Isabelle Bichindaritz 19
Summary• Application domain
dose planning in radiotherapy for prostate cancer
• System contexttrade-off between the benefit in terms of cancer control and the risk in terms of harmful side effects to neighboring tissues
• Task / problem addressedplanning problem – designing a radiotherapy dose planning
• Research hypothesiscase-based reasoning can be applied to propose dose plans
• Main contributionfuzzy representation of attribute values and similarity measurefusion of similar cases by Dempster-Shafer theory.
10/1/2007 TCSS555A Isabelle Bichindaritz 20
10/1/2007 TCSS555A Isabelle Bichindaritz 21
On-Line Domain Knowledge Management for
Case-Based Medical Recommendation
Amélie Cordier1,Béatrice Fuchs1,Jean Lieber2, and Alain Mille1
1LIRIS CNRS, UMR 5202, Université Lyon 1, INSA Lyon, Université Lyon 2, ECL
43, bd du 11 Novembre 1918, Villeurbanne Cedex, France,{Amelie.Cordier, Beatrice.Fuchs, Alain.Mille}@liris.cnrs.fr
2LORIA (UMR 7503 CNRS–INRIA–Nancy Universities),BP 239, 54506 Vandoeuvre-lès-Nancy, France
10/1/2007 TCSS555A Isabelle Bichindaritz 22
Summary• Application domain
breast cancer treatment
• System contextKasimir is a knowledge management and decision-support system in oncology focusing on case-based protocol treatment recommendations
• Task / problem addressedplanning problem – recommending a treatment plan based on a protocol
• Research hypothesesconservative adaptation is recommended for adapting a protocol to a new case through case-based reasoningnew domain knowledge can be acquired by analysis of failures
• Main contributionimprovement of adaptationmethod for learning from failures of the case-based reasoning.
10/1/2007 TCSS555A Isabelle Bichindaritz 23
10/1/2007 TCSS555A Isabelle Bichindaritz 24
Concepts for Novelty Detection and Handling
based on Case-Based Reasoning
Petra PernerPetra PernerInstitute of Computer Vision and applied Computer Sciences, IBaIInstitute of Computer Vision and applied Computer Sciences, IBaI
10/1/2007 TCSS555A Isabelle Bichindaritz 25
Summary• Application domain
Hep-2 cell image interpretation
• System contextcase-based image interpretation
• Task / problem addressedclassification problem – improve recognition of over 30 different nuclear and cytoplasmic patterns when patterns change over time or new patterns emerge
• Research hypothesiscase-based reasoning can be applied to the problem of novelty detection and also of concept drift
• Main contributionnovel application for CBR: detecting novelty, detecting concept drift.
10/1/2007 TCSS555A Isabelle Bichindaritz 26
10/1/2007 TCSS555A Isabelle Bichindaritz 27
Similarity of Medical Cases in Health Care
Using Cosine Similarity and Ontology
Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo von Schéelevon Schéele
Mälardalen University, Department of Computer Science and Mälardalen University, Department of Computer Science and ElectronicsElectronics
PO Box 883 SE-721 23, Västerås, SwedenPO Box 883 SE-721 23, Västerås, Sweden{firstname.lastname}@mdh.se{firstname.lastname}@mdh.se
10/1/2007 TCSS555A Isabelle Bichindaritz 28
Summary• Application domain
any medical domain
• System contextelectronic medical records
• Task / problem addressedretrieval task – finding similar cases represented with structured and semi-structured data
• Research hypothesisa hybrid similarity measure based on combining the cosine similarity measure, an ontology, and the nearest neighbor method permit to successfully retrieve similar cases
• Main contributionsynergy between case-based reasoning and information retrieval.
10/1/2007 TCSS555A Isabelle Bichindaritz 29
10/1/2007 TCSS555A Isabelle Bichindaritz 30
Towards Case-Based Reasoning for Diabetes
ManagementCindy Marling1, Jay Shubrook2 and Frank Schwartz2
1 School of Electrical Engineering and Computer ScienceRuss College of Engineering and TechnologyOhio University, Athens, Ohio 45701, USA
[email protected] Appalachian Rural Health Institute, Diabetes and Endocrine Center
College of Osteopathic MedicineOhio University, Athens, Ohio 45701, USA
10/1/2007 TCSS555A Isabelle Bichindaritz 31
Summary• Application domain
type I diabetes management
• System contextreal-time monitoring of glucose level through insulin pump
• Task / problem addressedtreatment planning – adjusting insulin dosage
• Research hypothesiscase-based reasoning can adjust insulin dosage in real timecases required for the future CBR system can be acquired through an online Web-based interface
• Main contributionplanning the development of a case-based reasoning system for automatic type I diabetes monitoring.
10/1/2007 TCSS555A Isabelle Bichindaritz 32
Hypothetico-Deductive Case-Based Reasoning
David McSherry
School of Computing and Information Engineering,University of Ulster, Northern Ireland
10/1/2007 TCSS555A Isabelle Bichindaritz 33
Summary• Application domain
contact lenses classification
• System contextconversational CBR
• Task / problem addressedclassification problem – recommending type of contact lenses
• Research hypothesisa hypothetico-deductive CBR approach to test selection can minimize the number of tests required to confirm a hypothesis proposed by the system or user
• Main contributionsynergy between case-based reasoning and hypothetico-deductive reasoningexplanations in CBR.
10/1/2007 TCSS555A Isabelle Bichindaritz 34
10/1/2007 TCSS555A Isabelle Bichindaritz 35
Other Papers Summaries• Case-based Reasoning for managing non-
compliance with clinical guidelines, Stefania Montani, University of Piemonte Orientale, Alessandria, Italy A CBR system able to
– Retrieve similar past episodes (cases) of non-compliance to guidelines, to be suggested to the physician
– Learn more general indications from ground non-compliance cases, adoptable for a formal GL revision by an experts committee
• CBR for Temporal Abstractions Configuration in Haemodyalisis, Leonardi Giorgio, Bottrighi Alessio, Portinale Luigi, Montani Stefania, University of Piemonte Orientale, Alessandria, ItalyA CBR system able to choose the appropriate parameters for the configuration of temporal abstractions in medical domain of haemodyalisis
10/1/2007 TCSS555A Isabelle Bichindaritz 36
Other Papers Summaries
• Prototypical Cases for Knowledge Prototypical Cases for Knowledge Maintenance in Biomedical CBR, Maintenance in Biomedical CBR, Isabelle Bichindaritz, University of Washington, Tacoma, WA, USAPrototypical cases have served various purposes in biomedical CBR systems, among which to organize and structure the memory, to guide the retrieval as well as the reuse of cases, and to serve as bootstrapping a CBR system memory when real cases are not available in sufficient quantity and/or quality. Knowledge maintenance is yet another role that these prototypical cases can play in biomedical CBR systems
10/1/2007 TCSS555A Isabelle Bichindaritz 37
Discussion • Trends and issues
– Integration of CBR with electronic patient records and/or in clinical practice (Begum et al., Marling et al.)
– Importance of prototypical cases (Bichindaritz)– Incompleteness / non-reliability of cases or CBR system
knowledge (Vorobieva et al., Cordier et al., Bichindaritz) – Novel domains of applications for CBR (Perner,
Leonardi et al., Montani) – Need for synergy with other AI methods (Song et al.,
McSherry)
10/1/2007 TCSS555A Isabelle Bichindaritz 38
Discussion • Pearls of wisdom
– Remember Occam’s razor – introducing complexity in CBR should be carefully justified
– Knowledge in medical cases / domain knowledge is often questionable – finding methods for dealing with this reality is essential for the development of CBR in biomedical domains
– CBR can be promoted as the methodology of choice for evidence gathering in evidence-based medicine
10/1/2007 TCSS555A Isabelle Bichindaritz 39
Future Plans
• A second special issue on CBR in the Health Sciences, based on papers from this Fifth Workshop on CBR in the Health Sciences is going to be published in Computational Intelligence.
• The Web-site (version 1.beta) and mailing list for our research group are now live:http://www.cbr-health.orghttp://www.cbr-biomed.org
10/1/2007 TCSS555A Isabelle Bichindaritz 40
10/1/2007 TCSS555A Isabelle Bichindaritz 41
10/1/2007 TCSS555A Isabelle Bichindaritz 42
Learning Objectives
• Study some examples of data mining systems
• Understand why to preprocess the data.
• Understand how to understand the data (descriptive data summarization)
10/1/2007 TCSS555A Isabelle Bichindaritz 43
Why Data Preprocessing?• Data mining aims at discovering relationships and other
forms of knowledge from data in the real world.• Data map entities in the application domain to symbolic
representation through a measurement function.• Data in the real world is dirty
– incomplete: missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
– noisy: containing errors, such as measurement errors, or outliers– inconsistent: containing discrepancies in codes or names– distorted: sampling distortion
• No quality data, no quality mining results! (GIGO)– Quality decisions must be based on quality data– Data warehouse needs consistent integration of quality data
10/1/2007 TCSS555A Isabelle Bichindaritz 44
Multi-Dimensional Measure of Data Quality
• Data quality is multidimensional:– Accuracy– Preciseness (=reliability)– Completeness– Consistency– Timeliness– Believability (=validity)– Value added– Interpretability– Accessibility
• Broad categories:– intrinsic, contextual, representational, and accessibility.
10/1/2007 TCSS555A Isabelle Bichindaritz 45
Major Tasks in Data Preprocessing• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies and errors
• Data integration– Integration of multiple databases, data cubes, or files
• Data transformation– Normalization and aggregation
• Data reduction– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization– Part of data reduction but with particular importance, especially for numerical data
10/1/2007 TCSS555A Isabelle Bichindaritz 46
Forms of data preprocessing
10/1/2007 TCSS555A Isabelle Bichindaritz 47
Learning Objectives
• Study some examples of data mining systems
• Understand why to preprocess the data.
• Understand how to understand the data (descriptive data summarization)
10/1/2007 TCSS555A Isabelle Bichindaritz 48
Mining Data Descriptive Characteristics
• Motivation– To better understand the data: central tendency, variation and spread
• Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals– Data dispersion: analyzed with multiple granularities of precision
– Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures– Folding measures into numerical dimensions
– Boxplot or quantile analysis on the transformed cube
10/1/2007 TCSS555A Isabelle Bichindaritz 49
Measuring the Central Tendency• Mean (algebraic measure) (sample vs. population):
– Weighted arithmetic mean:
– Trimmed mean: chopping extreme values
• Median: A holistic measure
– Middle value if odd number of values, or average of the middle two values
otherwise
– Estimated by interpolation (for grouped data):
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula:
n
iix
nx
1
1
n
ii
n
iii
w
xwx
1
1
widthfreq
lfreqNLmedian
median
))(2/
(1
)(3 medianmeanmodemean
N
x
10/1/2007 TCSS555A Isabelle Bichindaritz 50
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric,
positively and negatively skewed data
positively skewed negatively skewed
symmetric
10/1/2007 TCSS555A Isabelle Bichindaritz 51
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, M, Q3, max
– Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
– Standard deviation s (or σ) is the square root of variance s2 (or σ2)
n
i
n
iii
n
ii x
nx
nxx
ns
1 1
22
1
22 ])(1
[1
1)(
1
1
n
ii
n
ii x
Nx
N 1
22
1
22 1)(
1
10/1/2007 TCSS555A Isabelle Bichindaritz 52
Boxplot Analysis• Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e.,
the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to Minimum
and Maximum
10/1/2007 TCSS555A Isabelle Bichindaritz 53
Visualization of Data Dispersion: 3-D Boxplots
10/1/2007 TCSS555A Isabelle Bichindaritz 54
Properties of Normal Distribution Curve
• The normal (distribution) curve– From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
10/1/2007 TCSS555A Isabelle Bichindaritz 55
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary• Histogram: x-axis are values, y-axis repres. frequencies
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
• Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence
10/1/2007 TCSS555A Isabelle Bichindaritz 56
Histogram Analysis
• Graph displays of basic statistical class descriptions– Frequency histograms
• A univariate graphical method
• Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
10/1/2007 TCSS555A Isabelle Bichindaritz 57
Histograms Often Tells More than Boxplots
• The two histograms shown in the left may have the same boxplot representation– The same values for:
min, Q1, median, Q3, max
• But they have rather different data distributions
10/1/2007 TCSS555A Isabelle Bichindaritz 58
Quantile Plot• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
10/1/2007 TCSS555A Isabelle Bichindaritz 59
Quantile-Quantile (Q-Q) Plot• Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another• Allows the user to view whether there is a shift in going from
one distribution to another
10/1/2007 TCSS555A Isabelle Bichindaritz 60
Scatter plot• Provides a first look at bivariate data to see clusters of points,
outliers, etc• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane
10/1/2007 TCSS555A Isabelle Bichindaritz 61
Loess Curve• Adds a smooth curve to a scatter plot in order to provide better
perception of the pattern of dependence• Loess curve is fitted by setting two parameters: a smoothing parameter,
and the degree of the polynomials that are fitted by the regression
10/1/2007 TCSS555A Isabelle Bichindaritz 62
Positively and Negatively Correlated Data
• The left half fragment is positively
correlated
• The right half is negative correlated
10/1/2007 TCSS555A Isabelle Bichindaritz 63
Not Correlated Data
10/1/2007 TCSS555A Isabelle Bichindaritz 64
Data Visualization and Its Methods• Why data visualization?
– Gain insight into an information space by mapping data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities, relationships among data
– Help find interesting regions and suitable parameters for further quantitative analysis
– Provide a visual proof of computer representations derived
• Typical visualization methods:– Geometric techniques
– Icon-based techniques
– Hierarchical techniques
10/1/2007 TCSS555A Isabelle Bichindaritz 65
Direct Data Visualization
Ribbons w
ith Tw
ists Based on V
orticity
10/1/2007 TCSS555A Isabelle Bichindaritz 66
Geometric Techniques
• Visualization of geometric transformations and projections of the data
• Methods– Landscapes
– Projection pursuit technique
• Finding meaningful projections of multidimensional data
– Scatterplot matrices
– Prosection views
– Hyperslice
– Parallel coordinates
10/1/2007 TCSS555A Isabelle Bichindaritz 67
Scatterplot Matrices
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
Use
d by
erm
issi
on o
f M
. W
ard,
Wor
cest
er P
olyt
echn
ic In
stitu
te
10/1/2007 TCSS555A Isabelle Bichindaritz 68
news articlesvisualized asa landscape
Use
d by
per
mis
sion
of B
. Wrig
ht, V
isib
le D
ecis
ions
Inc.
Landscapes
• Visualization of the data as perspective landscape• The data needs to be transformed into a (possibly artificial) 2D spatial
representation which preserves the characteristics of the data
10/1/2007 TCSS555A Isabelle Bichindaritz 69Attr. 1 Attr. 2 Attr. kAttr. 3
• • •
Parallel Coordinates
• n equidistant axes which are parallel to one of the screen axes and correspond to the attributes
• The axes are scaled to the [minimum, maximum]: range of the corresponding attribute
• Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute
10/1/2007 TCSS555A Isabelle Bichindaritz 70
Parallel Coordinates of a Data Set
10/1/2007 TCSS555A Isabelle Bichindaritz 71
Icon-based Techniques• Visualization of the data values as features of icons
• Methods:
– Chernoff Faces
– Stick Figures
– Shape Coding:
– Color Icons:
– TileBars: The use of small icons representing the relevance
feature vectors in document retrieval
10/1/2007 TCSS555A Isabelle Bichindaritz 72
Chernoff Faces• A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
• The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening): Each assigned one of 10
possible values, generated using Mathematica (S. Dickson)
• REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993
• Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
10/1/2007 TCSS555A Isabelle Bichindaritz 73
census data showing age, income, gender, education, etc.
used
by
perm
issi
on o
f G
. G
rinst
ein,
Uni
vers
ity o
f M
assa
chus
ette
s at
Low
ell
Stick Figures
10/1/2007 TCSS555A Isabelle Bichindaritz 74
Hierarchical Techniques
• Visualization of the data using a hierarchical partitioning into subspaces.
• Methods– Dimensional Stacking
– Worlds-within-Worlds
– Treemap
– Cone Trees
– InfoCube
10/1/2007 TCSS555A Isabelle Bichindaritz 75
Dimensional Stacking
attribute 1
attribute 2
attribute 3
attribute 4
• Partitioning of the n-dimensional attribute space in 2-D subspaces which are ‘stacked’ into each other
• Partitioning of the attribute value ranges into classes the important attributes should be used on the outer levels
• Adequate for data with ordinal attributes of low cardinality
• But, difficult to display more than nine dimensions
• Important to map dimensions appropriately
10/1/2007 TCSS555A Isabelle Bichindaritz 76
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
Dimensional Stacking
10/1/2007 TCSS555A Isabelle Bichindaritz 77
Tree-Map• Screen-filling method which uses a hierarchical partitioning of
the screen into regions depending on the attribute values
• The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)
MSR Netscan Image
10/1/2007 TCSS555A Isabelle Bichindaritz 78
Tree-Map of a File System (Schneiderman)