+ All Categories
Home > Documents > Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data....

Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data....

Date post: 19-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
27
Visualizing and Exploring Data
Transcript
Page 1: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Visualizing and Exploring Data

Page 2: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Visual Methods for finding structures in data

• Power of human eye/brain to detect structures– Product of eons of evolution

• Display data in ways that capitalize on human pattern processing abilities

• Can find unexpected relationships– Limitation: very large data sets

Page 3: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Exploratory Data Analysis

• Explore the data without any clear ideas of what we are looking for

• EDA techniques are– Interactive– Visual

• Many graphical methods for low-dimensional data• For higher dimensions -- Principal Components

Analysis

Page 4: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Topics in Visualization

1. Summarizing DataMean, Variance, Standard Deviation, Skewness

2. Tools for Single Variables (histogram)3. Tools for Pairs of Variables (scatterplot)4. Tools for Multiple Variables5. Principal Components Analysis

– Reduced number of dimensions

Page 5: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

1. Summarizing the data

• Centrality– Minimizes the sum of squared errors to all samples– If there are n data values, mean is the value such that the sum of n

copies of the mean equals the sum of data values• Measures of Location

– Mean is a measure of location– Median (value that has equal no of points above and

below)– Quartile (value greater than a quarter of the data points)

∑=

=n

iix

nMean

1)(1 , µ

Page 6: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Measures of Dispersion, or Variability

2

1

2 ])([1

1 , µσ −−

= ∑=

n

iix

nVariance

Average squared errorin mean representing data

2

1

2 ])([1

1 ,Deviation Standard µσ −−

= ∑=

n

i

ixn

2/32

3

))ˆ)(((

)ˆ)((Skewness

∑∑

−=

µ

µ

ix

ix Measures how much the datais one-sided (single long tail)

Page 7: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

2. Tools for Displaying Single Variables

• Basic display for univariate data is thehistogram– No of values of the variable that lie in

consecutive intervals

Page 8: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Histogram of supermarket credit card usageManydid not use it at all

These used itevery weekexcept holidays

weeks

Page 9: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Histogram of Diastolic blood pressure of individuals (UCI ML archive)

Zero BP meansdata missing

Page 10: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Smoothing estimates

• Kernel Function K• Estimated density at point x is

∑=

−=

n

i hixxK

nxf

1))((1)(ˆ

• Gaussian Kernel with std dev h2)(

21

),( ht

CehtK−

= )( where ixxt −=

Page 11: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Kernel Estimateswith different values of h:Small values lead to spiky estimates

Data is right skewedwith hint of multimodality

2)(21

),( ht

CehtK−

=

Higher smoothing

Page 12: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

3. Tools for Displaying Relationship between two variables• Box Plots• Scatter Plots• Contour Plots• Time as one of the two variables

Page 13: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Box Plot

Median

UpperQuartile

LowerQuartile

1.5 times inter-quartile range

Page 14: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

HealthyDiabetic

MultipleVariables

Page 15: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Scatterplot

Credit card repayment data

Highly correlated dataSignificant number depart from pattern: worth investigating

Page 16: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Scatterplot Disadvantages1. With large no of data points reveals little structure

2. Can conceal overprinting which can be significant for multimodal data

Page 17: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Contourplot1. Overcomes some scatterplot problems

Unimodalitycan be seen:Not apparentin scatterplot

2. Requires a 2-D density estimate to be constructed with a 2-D kernel

Page 18: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Display when one of the variables is time

AnnualFees introduced

Jan 1963 Dec 1970

Peaks in early and late summer and around new year

Page 19: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.
Page 20: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Tools for Displaying More than Two Variables

• Scatter plots for all pairs of variables• Trellis Plot• Parallel Coordinates Plot

Page 21: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

More than two variables

• Sheets of Paper and Computer screens are fine for two variables

• Need projections from higher-dimensional data to 2-D plane

• Methods– Examine all pairs of variables

• Scatterplot matrix• Trellis plot• Icons

Page 22: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

IndependentCPU performanceScatter Plot Matrix

209 CPU data:Cycle TimeMinimum MemoryMaximum MemoryCache Size (Kb)Minimum ChannelsMaximum ChannelsRelative PerformanceEstimated rel perf (wrt IBM)

Correlated

Page 23: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Disadvantage of Scatter Plot Matrices

• Scatter Plot Matrices are multiple bivariatesolutions

• Not a multivariate solution• Such projections sacrifice

information3 variables8 cubes: alternately empty and fullEach 1-D and 2-D projection isuniformly distributed!

2-dprojection

Page 24: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Trellis Plot

• Rather than displaying scatter plot for each pair of variables

• Fix a particular pair of variables and produce a series of scatter plots, histograms, time series plots, contour plots etc

Page 25: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Male Female

Younger

Older

EpilepticSeizures in 2 weekperiod

EpilepticSeizures in later 2 weekperiod

Best fit line

Trellis Plot(with scatterplots)

Page 26: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

Icon Plot

Star Plot: Each direction correspondsto a variable. Length correspondsto a value

53 samples of minerals12 chemical properties

Page 27: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis.

ParallelCoordinatesPlot

Each path representsan individual

Each countRepresents 2-weekperiod


Recommended