+ All Categories
Home > Technology > Haim Lecture2 Data And Data Modeling 2ppg

Haim Lecture2 Data And Data Modeling 2ppg

Date post: 20-Jul-2015
Category:
Upload: marco-silva
View: 477 times
Download: 0 times
Share this document with a friend
Popular Tags:
96
1 Lecture 2 Part 1 – Data and data modeling Introduction Data has many sources it can be gathered from sensors or surveys, or generated by simulations and computations Data can be raw (untreated) or derived from raw data via some process, such as smoothing, noise-removal, scaling, or interpolation
Transcript
Page 1: Haim Lecture2 Data And Data Modeling 2ppg

1

Lecture 2

Part 1 – Data and data modeling

Introduction

• Data has many sources

it can be

gathered from sensors or surveys, or

generated by simulations and computations

• Data can be

raw (untreated) or

derived from raw data via some process,

such as

smoothing, noise-removal, scaling, or interpolation

Page 2: Haim Lecture2 Data And Data Modeling 2ppg

2

Data Models and Management

• Data

• Data Objects and Models

• Visualization Objects

• Metadata

• Data Retrieval

• DBMS

Data, tasks and simple visualizations

• Data

1D, 2D, 3D, …, nD

Structured and unstructured

• Tasks

Present, Confirm, Explore

Query

Summarize

Analyze

• Simple Visualizations

Points

Line and curves

Charts and graphs

Page 3: Haim Lecture2 Data And Data Modeling 2ppg

3

• Very large number of parameters

more than 105

• Very large data sets

more than 107

• Multiple data types

discrete and continuous

• Noisy data

often not uniform

• Missing values

could be important

• Lots of different tasks

• Many visualizations

Some Key Data Factors?

Data sets

• List of records

• Each record consists of one or more observations

• Each observation or variable may be

A single number or symbol

A more complex structure

• Each variable may be independent or dependent

• Data may be generated by a process or function

independent variables define function’s domain

dependent variables define function’s range

Page 4: Haim Lecture2 Data And Data Modeling 2ppg

4

Types or Categories of Data

• Categorical

• Continuous

• Nominal

• Ordinal

• Interval

• Ration

• Qualitative

levels proposed by Stanley Smith Stevens in

1946 article On the theory of scales of measurement.

Categorical Data

• Values each having label or category

may / may not have

ordering relationship

could have implied / imposed partial order

distance metric

absolute zero

Page 5: Haim Lecture2 Data And Data Modeling 2ppg

5

• Marital status

Single, married, divorced, widowed, …

• Profession

Teacher, student, janitor, tailor, pilot, …

• Weapon used

Machine gun, rifle, gun, knife, paper clip, …

• Insurance company

AAA, Hanover, Aetna, …

• Employed

Yes, no, not sure

Continuous Data

• Numeric values each belonging to some interval

almost all interval values possible

• Weight -- pounds

• Height -- meters

• Age -- years

• Gene expression

• Salary

• Distance from pump -- feet

• Temperature -- degrees

Page 6: Haim Lecture2 Data And Data Modeling 2ppg

6

Nominal Data (symbolic, categorical)

• Categorical data

order of categories arbitrary, but

numerals assigned as labels or names

• Variables placed into mutually exclusive

categories

• members of one category qualitatively different

from members of any other category

• ==> classification without ordering

• Examples …

• Examples

hair color: brown, black, blond, red

gender: male, female

genomic base pairs (A, C, T, G)

marital status: yes, no

Page 7: Haim Lecture2 Data And Data Modeling 2ppg

7

Nominal Data (symbolic, categorical)

• Mapping of numbers to labels possible

(e.g. male = 0, female = 1)

• One value not necessarily greater than another

• Statistical computations typically have no

meaning

(although mode can)

Nominal Data (symbolic, categorical)

• may be only way to measure qualitative variable

(religion, gender)

• Operations

equality / inequality (same / different)

• does not establish quantitative relationship

between categories

Page 8: Haim Lecture2 Data And Data Modeling 2ppg

8

Ordinal Data

• order of categories relevant, and

• numerals / labels having interpretation assigned

to labels

• Categorization of data with ordering

order information available, but

no information about magnitude of distance

between adjacent categories

• Some statistical computations may not have any

meaning

• E. g., …

e. g.,

• Perceptual difficulty scale

very difficult = 10, moderately difficult = 8,

average difficulty = 5, …, easy = 0

• Weapon used by severity

Machine gun = 1, rifle = 2, gun = 3, knife = 4,

paper clip = 5, …

• Likert scale of agreement

5 = strongly agree, to 1 = strongly disagree

Page 9: Haim Lecture2 Data And Data Modeling 2ppg

9

Ordinal Data

• Operations

Equality / inequality

less than / more than (order)

• Example: students 1st, 2nd, 3rd

1st better than 2nd -- by how much?

cannot compare differences between

categories

Numeric (discrete vs. continuous)

• Discrete values: Integer

Numerical distance between adjacent units is equal

• Continuous values: Real

Any value with arbitrary precision is possible

no gaps in scale

• May lack absolute zero

represents complete absence of characteristic being

measured

zero value is arbitrary starting point

could be replaced by any other value

Page 10: Haim Lecture2 Data And Data Modeling 2ppg

10

Numeric (discrete vs. continuous)

• Operations

equality / inequality

less than / more than (order)

addition / subtraction

distance metrics

Interval Data

• Continuous data where

data falls in range of numbers

data differences meaningful, but

ratios may have no meaning, since

ranges can be linearly transformed to other

scales

changing interpretation of zero

Page 11: Haim Lecture2 Data And Data Modeling 2ppg

11

• Distance differences have meaning

90-100 and 80-90 are similar

• Ratios of differences can have meaning

• mean and median have meaning

Examples

• Temperature -- Celsius / Farentheit

Twice the temperature depends on scale used

• IQ measure

Page 12: Haim Lecture2 Data And Data Modeling 2ppg

12

Ratio Data

• Continuous data, where

both differences and ratios meaningful

zero has meaning

• can be classified as Interval data

• ==> can often be classified ratio data

• Geometric mean can only be applied to ratio

data, and

• arithmetic mean extremely meaningful

Examples

• Temperature, mass, energy, ...

• Age, weight

• Number of students at colloquia

Page 13: Haim Lecture2 Data And Data Modeling 2ppg

13

Relationship among categories

• Each category provides more computational

possibilities ==>

Ratio more meaningful than interval

Interval more meaningful than ordinal

Ordinal more meaningful than nominal

References

• Babbie, E., 'The Practice of Social Research', 10th edition,

Wadsworth, Thomson Learning Inc., ISBN 0534620299

• Michell, J. (1986). Measurement scales and statistics: a clash of

paradigms. Psychological Bulletin, 3, 398-407.

• Stevens, S.S. (1946). On the theory of scales of measurement.

Science, 103, 677-680.

• Stevens, S.S. (1951). Mathematics, measurement and

psychophysics. In S.S. Stevens (Ed.), Handbook of experimental

psychology (pp. 1-49). New York: Wiley.

• Velleman, P. F. & Wilkinson, L. (1993). Nominal, ordinal, interval,

and ratio typologies are misleading. The American Statistician, 47(1),

65-72. [On line]

http://www.spss.com/research/wilkinson/Publications/Stevens.pdf

Page 14: Haim Lecture2 Data And Data Modeling 2ppg

14

Typical Data

• Cars

make

model

year

miles per gallon

cost

number of cylinders

weight

...

Typical Data Classes

• 2D scalar

• 3D scalar

• vector data

• organizational data

• complex data models

Page 15: Haim Lecture2 Data And Data Modeling 2ppg

15

2D Scalar

• Sequence of ordered pairs vi = (x, y)

with x and y in some scalar set

• Where indices are, for example

i {1, 2, 3, ..., n}

i {a, b, c, ..., z}

i a subset of R

• Examples

time series

set of points in (x, y) plane

3D Scalar

• Sequence of ordered triplets vi = (x, y, z)

with x, y and z in some scalar set

• Where indices are, for example

i {1, 2, 3, ..., n}

i {a, b, c, ..., z}

i a subset of R

• Examples

time series of 2D points

set of points in (x, y, z) space

Page 16: Haim Lecture2 Data And Data Modeling 2ppg

16

Vector Data

• generalization of the above

• n-dim vectors vk = (x1 , x2 , x3 , ... , xn )

where xi in some scalar set

• indices are, for example

k {1, 2, 3, ..., n}

k {a, b, c, ..., z}

k a subset of R

• Examples

time series of n - 1 dim points

set of points in n dim space

Time Series Data

• generalization of the above

• n-dim vectors vk = (x1 , x2 , x3 , ... , xn )

where xi in some scalar set

• And index set based on time

0 t1 < t2 < t3 < ... < tn

• index set often included as parameter in n-dim vector

• but brought out here as special case

because of its importance

• This identifies time as “special” variable

Page 17: Haim Lecture2 Data And Data Modeling 2ppg

17

Multidimensional Data

• generalization of the above

• n-dimensional vector vk = (x1 , x2 , x3 , ... , xn )

where each xi of some possibly varying data

type

not necessarily all the same

==> Each record consists of number of variables

each having own data type

• index set {k} as before

• extends concept of vector

w/ each coordinate of same data type

to one w/ different data types

• Examples

patient records

census data

Page 18: Haim Lecture2 Data And Data Modeling 2ppg

18

Structured Data

• Data Tables

Often,

take raw data

xform --> more workable form

• Main idea

Individual items called “cases”

Cases have variables (attributes)

Table View for simple records

Can think of as vector valued function:

f (Record 1) = (190, Red, 2.6 )

9.0Green70Record 3

-7.4Blue200Record 2

2.6Red190Record 1

Attribute 3Attribute 2Attribute 1

Page 19: Haim Lecture2 Data And Data Modeling 2ppg

19

Billing System /

Patient Information /

Claims and numerous others

Pharmacy System

Adverse Drug Reactions

Patient drug history

Laboratory Results /

DNA

Tests

Medical Records/

Demographics data

Dictation Transcripts

Other notes

Data provided varies /

project

can include hospital,

outpatient, medical, drug, lab

and other results

Redundant + wide varieties

of data input enriches

outcome hypotheses

Note: data consists of large

varieties of 2D, 3D, & nD

scalar, vector, & structured

data

Sources of Patient Information

Data Models

• Data Objects consist of three parts

Data

Geometry (physical)

Topology (relational)

• Any visualization pipeline includes

data objects

data mappings

data displays

Page 20: Haim Lecture2 Data And Data Modeling 2ppg

20

geometry and topology

• Geometry represents spatial (physical) layout

(embedding) of data in Rn

• Topology represents interconnections

(relationships) between data elements in physical

space.

geometry and topology 2

• Points in space (no connection with each other)

• Lines in space (points may have connection) …

• Surfaces in space …

Page 21: Haim Lecture2 Data And Data Modeling 2ppg

21

Points in space (no connection with

each other)

• Relationship is topology

• Distances may / may not have meaning

• Relative position may / may not have meaning

Lines in space (points may have

connection)

• Line lengths may / may not have meaning

• Above / below may / may not have meaning

Page 22: Haim Lecture2 Data And Data Modeling 2ppg

22

Surfaces in space

• Points / lines may lie on surface

• Distance may have absolute / relative / no

meaning

• Above / below may / may not have meaning

Data databases and subsets

• origin for data objects

• persistent aspect of data objects

• Any visualization pipeline today will deal with

databases

• may be distributed and their schemas different

• ==> great deal of visualization preprocessing

activity is really database work

Page 23: Haim Lecture2 Data And Data Modeling 2ppg

23

simulated orsampled data

derived ormassaged data

logical datarepresentation

data transformations -interpolation, filtering, etc.

representation mappings -geometry, color, sound, etc.

Image

rendering -viewing, shading,device transforms, etc.

DBMS

USER

A Simple Visualization Pipeline

queries and probes

Data Objects: The Role of Metadata

• Advanced system function

provides data description

supports rule-based operation

addresses data import problems

(file format standardization)

• Knowledge engineering and data mining

for determining structure in data automatically

• Metadata complexity

descriptive -- frames

active -- production rules

Page 24: Haim Lecture2 Data And Data Modeling 2ppg

24

A Conceptual Meta-Model for

MetadataTrivedi and Smith, 1991

Metadata

data quality

data dictionary data directory

conceptual

schema

data

content

data

locationdata

access

Metadata Entities

operators

transactions

modeling primitives

procedures / functions

host language programs

Metaprocess Entities

user profiles

maintenance

physical devices

miscellany

Metaenvironment

Entities

Database Management Systems

• Commercial databases

• Example database /visualization application

(GIS)

• Database models

Relational

Object Relational

Object-Oriented

• Database queries

Page 25: Haim Lecture2 Data And Data Modeling 2ppg

25

Commercial Databases

• Relational

Sybase

Oracle

MYSQL

DB2

• Object-Oriented

Objectivity

ObjectDB

Versant

Gemstone

Db4o

Example: Geographic Information

Systems

Country

City

Contains

Neighbors

Adjacent_To

Layer 1

Layer 2inter-layer

relationship

intra-layer

relationship

Page 26: Haim Lecture2 Data And Data Modeling 2ppg

26

Example : Relational Model2 Entity tables 3 Relationship tables

Name Language

SwitzerlandSwitzerland

FrenchGerman

Country

Name Population

GenevaZurich

210,000135,000

City

Country Name City

Switzerland GenevaGermany Bonn

Contains

Country Adjacent Country

Switzerland Germany

Adjacent_to

City Neighbor

Geneva LausanneGeneva Zurich

Neighbors

Example : Object Relational Model

CREATE TYPE Country {

Name CharString

REQUIRED;

Languages CharString

MANY;

Contains City MANY;

}

CREATE TYPE City {

............

}

Retains relationalnotions (e.g.,: key)

Support for aggregationin data structure

More language like

Page 27: Haim Lecture2 Data And Data Modeling 2ppg

27

Example: Object-Oriented Model

class Country : public Region {

String Name;

Set<String> Languages;

Set <Ref<City>> contains

inverse contained_ by;

public:boolean isSpoken (string language);

};

class City {

........

}

almost identical to programming language

supports programminglanguage data structures

encapsulates behavior

Query Language Issues

Formal Visualization Interactions

• Expressive power

What queries can be expressed in query language?

• Extensibility

Can user-defined types be queried like built-in types?

• Result model

How is query result returned to application level?

• Result management

How much of result can be seen at a time ?

Page 28: Haim Lecture2 Data And Data Modeling 2ppg

28

Example: SQL Relational Queries

• Typical query has this form:

select A1, A2, ... , An

from r1, r2, ... , rm

where P

a simple example...

select temp, press

from measures

where temp > 35 and press = 10.5 and region =

“southwest”

• Predicate can be complex

can include any number of logical connectives

can include sub-queries that perform an initial selection

region temp press ••

Example: Object Oriented Query Language

• Query Interpretation:

retrieve countries whose names start with “Ge” into an

os_list.

os_Set country_extent;

os_list ge_list =

country_extent->query

( “Country”,

this->name==“Ge*”,

...

);

Particular data structure

within database can be

queried. Queries can also

apply to entire database

Result is a complex

data structure

No impedance mismatch

between query language

and programming language

Page 29: Haim Lecture2 Data And Data Modeling 2ppg

29

Trees, Graphs and Networks

Lots of examples

Hierarchical data

File systems are typical hierarchical data

Page 30: Haim Lecture2 Data And Data Modeling 2ppg

30

Hierarchical data

• hierarchical data can be represented through

Graphs

Example of a Web Site Hierarchy

Hierarchical data

Relational database model

Website ID Parent ID Child ID Name of the Website

0 NULL 1 Index

0 0 2 Index

0 0 3 Index

0 0 4 Index

0 0 5 Index

0 0 6 Index

0 0 7 Index

0 0 8 Index

0 0 9 Index

1 0 10 About Me

2 0 NULL Resume

3 0 NULL GuestBook

… … … …

10 1 NULL Boot

11 5 NULL 2001

12 5 NULL Dune

13 5 NULL Multiplicity

14 5 18 Star Wars

14 5 19 Star Wars

… … … …

18 14 NULL Books

19 14 NULL Lucas

Page 31: Haim Lecture2 Data And Data Modeling 2ppg

31

Hierarchical data

• In most cases data not given in hierarchical form,

but

• stored in multi-dim variables

• Goal: Transform data into hierarchical form

• Algorithm …

Algorithm:

Repeat

(1) Select dim - sequence of dim selection

important,

always select most important dim

(2) Segment attributes into some classes

provided chosen attributes not categorical

until maximum hierarchy level reached

Page 32: Haim Lecture2 Data And Data Modeling 2ppg

32

Complex structured data (graph)

• graph G = (V, E) consists of

set V, (vertices / nodes)

set E, (edges)

• Each edge assigned in unique way to

ordered / unordered pair of (not necessarily

different) nodes

• edges connect vertices

• …

Complex structured data (graph)

• directed graph …

• undirected graph …

• Network …

• properties, metrics …

Page 33: Haim Lecture2 Data And Data Modeling 2ppg

33

directed graph

• If every e in E assigned to ordered pair of nodes

e = (v, w),

• then graph called directed graph

undirected graph

• If every e in E assigned to unordered pair of

nodes

e = {v, w},

• then graph called undirected graph

Page 34: Haim Lecture2 Data And Data Modeling 2ppg

34

network

• Edges may have additional meanings (weights)

• ==> graph often called network

properties, metrics

• can define

cyclic, acyclic, DAG, tree,

various metrics

such as

in-number

out-number

Page 35: Haim Lecture2 Data And Data Modeling 2ppg

35

Complex structured data (graph)

Example of

directed graph

(Paper flow in

government system)

Complex structured data (graph)

Example of

undirected graph

(Social network

of 9-11 Terrorists)

Page 36: Haim Lecture2 Data And Data Modeling 2ppg

36

Data preprocessing

• Metadata and statistics

• Missing values

• Data Cleansing

• Normalization

• Segmentation

• Sampling and subsetting

• Dimensional reduction

• Aggregation and summarization

• Smoothing and filtering

Metadata and Statistics

• “Data about Data”

• describes content, quality, condition, other characteristics

of data

e.g. min, max, avg, …

• not actual data itself

• may include

Identification (name of dataset, …)

Data Quality (completeness, attribute accuracy,…)

Distribution (formats, media, who holds the data,…)

• Important for correct / useful visualization

Page 37: Haim Lecture2 Data And Data Modeling 2ppg

37

Missing Values and Data Cleaning

• Missing and empty values …

• Problem definition …

• Approximation vs interpolation …

• Linear regression …

• Piecewise polynomial (spline) interpolation …

Missing and empty values

• missing value of variable

actual value exists in real world measurement

made, but

not entered into data set

• empty value in variable

no real world value exists

Page 38: Haim Lecture2 Data And Data Modeling 2ppg

38

Missing and empty values

• Example: Sandwich Shop

sells sandwich

with

turkey

Swiss / American cheese

to determine customer preferences + control inventory,

keeps records of customer purchases

Data structure contains “gender”, “cheese type”

gender: “M”: Male ; “F”: Female

cheese: “S”: Swiss ; “A”: American

Missing and Empty Values

Suppose during recording of sale,

customer requests sandwich with no cheese

salesperson forgets to enter customer’s gender

transaction generates record with

both “gender” and “cheese” w/ no entry

Page 39: Haim Lecture2 Data And Data Modeling 2ppg

39

Missing and empty values

“gender” can be Male or Female (missing value)

“cheese” not measured because no value exists

(empty value)

Problem Definition

• General Problem with missing values twofold

some information content may be missing

Example: Credit application may warn and identify

that certain useful information appears as a result of

certain fields not completed by an applicant

missing value necessary for computation

Example: Age important for estimating reliability

Page 40: Haim Lecture2 Data And Data Modeling 2ppg

40

Problem Definition

• Create and insert some replacement value for missing

value

objective is to insert value that neither adds nor

subtracts information from data set

Note that for age this is tricky (older typically increases

reliability) and we might decide not to fill in values

• Solution

Use approximation or interpolation to find

missing values

Problem Definition

• General Problem:

given set of n points {xi, yi} with i = 1, ..., n

find function y = f (x) for which yi = f (xi) for all i = 1, ..., n

There may be

several such functions, or even

no simple ones that can deal with

Page 41: Haim Lecture2 Data And Data Modeling 2ppg

41

Problem Definition

• Information carried in

relationship between

values within single variable (its distribution)

and

relationship to other variables

Approximation vs Interpolation

• For approximation want

| f (x) – f (xi) | for small > 0

• For interpolation want

| f (x) – f (xi) | = 0

Note that approximation less stringent

Page 42: Haim Lecture2 Data And Data Modeling 2ppg

42

Approximation vs Interpolation

• Approximation

Regression (linear, quadratic, …)

• Interpolation

Polynomial (Lagrange basis, Newton form)

Piecewise polynomial (cubic splines, …)

Orthogonal polynomials (Legendre, …)

Trigonometric functions

Approximation

Climate data approximation based on triangulation

(a), (f) temperature

(b) air pressure

(c) humidity

(d) sea surface temperature

(e) vapor pressure

Page 43: Haim Lecture2 Data And Data Modeling 2ppg

43

Linear Regression

• Concept

determine the value of one variable given the value of

another variable

• It assumes that the variables’ values change, one with

the other, in some mathematically defined way

• The technique involved simply fits a manifold through the

two-dimensional state space formed by the two variables

• Example

in case of linear regression a straight line is fit through a

two-dimensional point set

Linear Regression

Page 44: Haim Lecture2 Data And Data Modeling 2ppg

44

Linear Regression

Linear Regression

• Linear regression techniques involve discovering the joint

variability of two or more variables

• Linear regression determines which values of the

predicted variable match values of the predictor variable

• Joint variability

measure how one variable varies as another one varies

Page 45: Haim Lecture2 Data And Data Modeling 2ppg

45

Linear Regression

• Linear regression tries to discover the parameters of the

straight line equation that best fits the data point

• The expression describing a straight line is

y = a x + b

where b is a constant that indicates where the straight

line crosses the y-axis in state space (the y-intercept)

and a represents the slope of the line

Linear Regression

Linear Regression minimizes the least square error

( ) minˆ

2

1=

n

i

iiyy

:i

y

:ˆi

yi

yEstimated value

Actual valuei

y

Page 46: Haim Lecture2 Data And Data Modeling 2ppg

46

Linear Regression Solution

Determine b

( ) ( )( )( )

=22 xxn

yxxynb

Step 1

Linear Regression Solution

Determine a once b is known

Step 2

xbya =

is the mean value of x

is the mean value of yy

x

Page 47: Haim Lecture2 Data And Data Modeling 2ppg

47

Piecewise Polynomial (Spline) Interpolation

Piecewise Polynomial (Spline)

Interpolation

• Set of data points

• Passing a single polynomial through many data points

can sometimes lead to oscillations in the interpolating

polynomial

• The interpolating polynomial linking the data point is most

often selected to be a cubic

• Cubics are differentiable and provide second order

continuity (the derivatives of neighbouring cubics can be

matched)

( ) ( ){ }nn

yxyx ,,...,,00

( ) ( )11

,,, ++ kkkk yxyx

Page 48: Haim Lecture2 Data And Data Modeling 2ppg

48

Piecewise Polynomial (Spline)

Interpolation

( ) ( ) ( ) ( )33,

2

2,1,0, kkkkkkkkxxsxxsxxssxs +++=

[ ]1

, +kkxxx

( ) kkk yxs =

( ) ( )111 +++ =

kkkkxsxs

( ) ( )1

'

11

'

+++ =kkkk

xsxs

( ) ( )1

''

11

''

+++=

kkkkxsxs

subject to the following constraints:

Interpolation of ( )kk yx ,

Continuity of interpolant

Continuity of first derivatives

Continuity of second derivatives

Piecewise Polynomial (Spline)

Interpolation

Page 49: Haim Lecture2 Data And Data Modeling 2ppg

49

Normalization

Normalization Example

Chicago-County

LA-County

Page 50: Haim Lecture2 Data And Data Modeling 2ppg

50

Normalization Example (con‘t)

LA-CountyChicago-County

Normalization

Page 51: Haim Lecture2 Data And Data Modeling 2ppg

51

Normalization (con‘t)

Normalization Example (con‘t)

Page 52: Haim Lecture2 Data And Data Modeling 2ppg

52

Normalization (con‘t)

Normalization Example (con‘t)

Page 53: Haim Lecture2 Data And Data Modeling 2ppg

53

Normalization (con‘t)

Normalization Example (con‘t)

Page 54: Haim Lecture2 Data And Data Modeling 2ppg

54

Normalization Example (con‘t)

Normalization Example (con‘t)

Page 55: Haim Lecture2 Data And Data Modeling 2ppg

55

Normalization (con‘t)

Segmentation

• Manual/Automatic Segmentation

• Problem Definition

• k-Means

• Linkage -based Methods

• Kernel Density Estimation

Page 56: Haim Lecture2 Data And Data Modeling 2ppg

56

Manual/Automatic Segmentation

• Manual Segmentation

based upon

• Attribute values/ranges

• Topological properties

• Automatic Segmentation Algorithms (Clustering

Algorithms)

k-Means,

Kernel Density Estimation, …

Problem Definition

Given:

A data set with N d-dimensional data items

Task:

Determine a (natural) partitioning of the

data set into a number of clusters (k) and

a noise parameter

Page 57: Haim Lecture2 Data And Data Modeling 2ppg

57

Problem Definition

• Effective and efficient clustering algorithms for large high-

dimensional data sets with high noise level

• Requires Scalability with respect to

the number of data points (N)

the number of dimensions (d)

the noise level

k-Means

• k-Means

Determine k prototypes (p) of a given data set

Assign data points to nearest prototype

Minimize distance criterion:

min),(1 1= =

k

i

N

j

i

ji xpd

Page 58: Haim Lecture2 Data And Data Modeling 2ppg

58

k-Means 2

• Iterative Algorithm

Shift the prototypes towards the mean of their point set

Re-assign the data points to the nearest prototype

k-Means

Page 59: Haim Lecture2 Data And Data Modeling 2ppg

59

Linkage-based Methods

• Single LinkageConnected components for distance d

Linkage-based Methods

• Method of Wishart

Reduce data

set

Apply Single

Linkage

Page 60: Haim Lecture2 Data And Data Modeling 2ppg

60

Kernel Density Estimation

Density FunctionDensity Function

Influence Function: Influence of a data point in its neighborhood

Density Function: Sum of the influences of all data points

Kernel Density Estimation

• Influence Function

The influence of a data point y at a point x in the data

space is modeled by a function

e.g.,

y x

Page 61: Haim Lecture2 Data And Data Modeling 2ppg

61

Kernel Density Estimation

• Density Function

The density at a point x in the data space is defined as

the sum of the influences of all data points xi, i.e.

Kernel Density Estimation

Page 62: Haim Lecture2 Data And Data Modeling 2ppg

62

Subsetting

Start with a set of data items and generate a subset

of these data items

• Sampling

Random sampling

• Querying

SQL

Sampling

• Motivation: data set is much larger than possible (time-

and/or space-wise) to work on

• Example: voters of an election

( too large to study all of them, so use a representative

sample)

• Important:

The selected subset must be selected such that it

represents some well defined characteristics of the whole

data set especially those we‘re interested in

Page 63: Haim Lecture2 Data And Data Modeling 2ppg

63

Sampling

• Types of sampling

Non-probabilistic samples

Sample selected on some non-random basis (such as

volunteers, accidental, convenience, self-selected, etc.)

Probabilistic samples

Sample selected on the basis of random selection so

that every element of the data set has an equal chance of being

selected

Sampling

• Types of probabilistic sampling

Simple random sampling

Systematic random sampling

Stratified random sampling

Cluster random sampling

Biased sampling

Page 64: Haim Lecture2 Data And Data Modeling 2ppg

64

Simple Random Sampling

A random sampling strategy is the least biased sampling method. Using

this method, the locations were determined by generating a list of random

coordinates and placing the points at those coordinates.

Systematic random sampling

• Elements are numbered 1 to N in some order

• Every k-th element is selected

starting with a randomly chosen

number between 1 and k

n

Nk

Page 65: Haim Lecture2 Data And Data Modeling 2ppg

65

Stratified random sampling

• The data set is divided into non-overlapping subsets

called strata

• Sampling from the strata is simple random

Cluster random sampling

• The sample consists of a selection from randomly chosen groups of

neighbouring elements (clusters)

• Clusters need not necessarily be natural aggregates, but can simply

be artificial

divide the population into

population clusters based on

geographical location (districts,

counties, states, ...)

Page 66: Haim Lecture2 Data And Data Modeling 2ppg

66

Biased sampling

Dimensional Reduction

What is the problem?

• Large number of features represent an object

• The data is difficult to visualize, especially when some of

the features are not discriminatory

• Irrelevant features may cause a reduction in the accuracy

of the analysis algorithms

Page 67: Haim Lecture2 Data And Data Modeling 2ppg

67

Problem Definition

Concept

• Identify the most important features of an object

to simplify the processing without loss of quality

to directly visualize the two/three most important

features

Problem Definition

• Solution

The simplest approach is to identify important attributes

based on input from domain experts

Another common approach is Principal Component

Analysis (PCA) which defines new attributes (principal

components or PCs) as mutually-orthogonal linear

combinations of the original attributes

Page 68: Haim Lecture2 Data And Data Modeling 2ppg

68

• Goal

to discover the key hidden factors that explain the data

to reduce the dimensionality of the data

• Similar to cluster centroids

Principal Component Analysis

PCA (con‘t)

Page 69: Haim Lecture2 Data And Data Modeling 2ppg

69

Page 70: Haim Lecture2 Data And Data Modeling 2ppg

70

Computing the Eigenvalues

The Eigenvalues

Page 71: Haim Lecture2 Data And Data Modeling 2ppg

71

Eigenvalues (con‘t)

Eigenvalues (con‘t)

Page 72: Haim Lecture2 Data And Data Modeling 2ppg

72

PCA – Dimension Reduction

Data can be projected onto a subspace spanned by the

most important eigenvectors

XCXPCA

=

where the matrix Ckm contains the k eigenvectors

corresponding to the k largest eigenvectors

PCA – Dimension Reduction

• PCA is optimal way to project data in the mean squaresense:

the squared error introduced in the projection isminimized over all projections onto a k dimensionalspace

• But the eigenvalue decomposition of the data covariancematrix (size for m-dimensional data) is veryexpensive to compute

mm

Page 73: Haim Lecture2 Data And Data Modeling 2ppg

73

SVD – Dimension Reduction

• Singular value decomposition:

where orthogonal matrices U and V, contain the

left and right singular vectors of X, diagonal

matrix S contains the singular values of X

SVD – Dimension Reduction

Data can be projected onto a subspace spanned by theleft singular vectors corresponding to the k largestsingular values

where matrixk

Ukm contains these k

singular vectors

Page 74: Haim Lecture2 Data And Data Modeling 2ppg

74

3.2.7 Aggregation / Summarization

• Aggregation Functions

count the items in a data set

• For example, the count of the items in (1, 3, 6, 4) is 4

sum the items in a list

• For example, the sum of the list (1, 3, 6, 4) is 14

average (avg) of all items in a data set

• For example, the avg of the items in (1, 3, 6, 4) is 3.5

Effect of Display Resolution

• Data sets are large (large result sets)

millions of data points/results for a query

• Limitation of screen resolution

1 - 3 M pixel on average currently

• This forces coding in space, sound, time, and

creativity

adaptive system and user interfaces

abstracts and level-of-detail

Page 75: Haim Lecture2 Data And Data Modeling 2ppg

75

Simple Visualizations

Simple

• Tables

• 2D and 3D Scatterplots

• Statistical Charts

• Line and Multi-line Graphs

• Polar Chart

• Images

Complex

• Matrix of Scatterplots

• Heatmaps

• Surfaces

• Volumes

Interactions

• Select

• Probe

• Query

• Analyze

• Explore

The Base Visualization Techniques

Page 76: Haim Lecture2 Data And Data Modeling 2ppg

76

Visualization Operations (queries)

• Data selection operations

sampling, association, etc.

• Data manipulation operations

functions, filters, interpolations, etc.

• Representation operations

attribute mappings, color maps, etc.

• Image orientation / viewing operations

pan, zoom, rotate, lighting, etc.

• Visualization interactions

direct manipulation, data selection, etc.

Page 77: Haim Lecture2 Data And Data Modeling 2ppg

77

Univariate

Bivariate (or scatterplots)

Page 78: Haim Lecture2 Data And Data Modeling 2ppg

78

Scatterplot Matrix

Trivariate

Page 79: Haim Lecture2 Data And Data Modeling 2ppg

79

Alternatives

Page 80: Haim Lecture2 Data And Data Modeling 2ppg

80

10th Century Timeline

Lines and curves

The same curve under different The same curve under different scalingsscalings ……

Page 81: Haim Lecture2 Data And Data Modeling 2ppg

81

0

2

4

6

8

10

12

14

83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98bananas

apples

pears

year

Trends in fruit sales

Excel Graphs and Charts

Page 82: Haim Lecture2 Data And Data Modeling 2ppg

82

Images, Surfaces and Volumes

ExVis MRI

1988

Page 83: Haim Lecture2 Data And Data Modeling 2ppg

83

Electron density of C-60

Page 84: Haim Lecture2 Data And Data Modeling 2ppg

84

HIV Reverse Transcriptase Inhibitor

ESP

0.25

0.20

0.15

0.10

0.05

0.00

- 0.05

Van der Waals surface colored by Electrostatic Potential

Page 85: Haim Lecture2 Data And Data Modeling 2ppg

85

Maps

Maps

Page 86: Haim Lecture2 Data And Data Modeling 2ppg

86

Surfaces

Volumes

Page 87: Haim Lecture2 Data And Data Modeling 2ppg

87

Surfaces and Volumes

Graphs and Networks

Page 88: Haim Lecture2 Data And Data Modeling 2ppg

88

Graph Layout Optimization

TAURINE AND HYPOTAURINE METABOLISM

Cysteine

metabolism

Cyanonoamino ac id

metabolismGlutathione

metabolism

Cysteine

metabols im

1.13.11.191.8.1.3

4.1.1.294.1.1.29

4.1.1.15

4.4.1.12

4.1.1.29

2.3.1.65

2.3.2.2

2.7.3.4

4.4.1.10

1.13.11.20

3-Su lfino-L -alanine

Cyst eamine

Taurine

L-Cyst eate

Tauro ch olate

Acetate

Su lfoace taldehyd e

Ise thionate

5-G luta myl-taurine Ta uro cyamine

T auro cyaminephosp hate

L-Cyst eine

Excr etion

Excr etion

Hyp otaurine

2.3.1.652.3.1.65

2

Before

Cyst ei nemetabol ism

Cyanonoamino acidmetabolism

Glutathionemetabolism

C ysteinemetabolsim

1.13.11.19

1.8.1.3

4.1.1.294.1.1.29

4.1.1.15

2.3.1.65

4.4.1.12

4.1.1.29

2.3.1.652.6.1.55

Excretion

E xcreti on2.3.2.2

2.7.3.4

4.4.1.101.13.11.20

3-Sulfino-L-alani ne

Hypotaurine

Cyst eamine

Taurine

L-Cysteate

Taurocholate

Acetate

Sulfoacetaldehyd e

Isethionate

5-Glutamyl -taurine

Taurocya mi ne

Taurocya mi nephosphate

L-CAfter

Query-based Layouts

Page 89: Haim Lecture2 Data And Data Modeling 2ppg

89

adding simple interaction

to static visualization

is surprisingly powerful

Starfield (Early Spotfire)

Page 90: Haim Lecture2 Data And Data Modeling 2ppg

90

Starfield (Early Spotfire) 2

add interaction to everything?

Page 91: Haim Lecture2 Data And Data Modeling 2ppg

91

Page 92: Haim Lecture2 Data And Data Modeling 2ppg

92

Trains from Paris to Lyon

Page 93: Haim Lecture2 Data And Data Modeling 2ppg

93

Trains from Paris to Lyon

Trains from Paris to Lyon

Page 94: Haim Lecture2 Data And Data Modeling 2ppg

94

3444

33

811

639

10

531

1200

220

UK

Channel Islands

Midlands

North England

Northern Ireland

Scotland

South England

Wales

hotels region

tourist board areas

3444

33

811

639

89

7

209

111

223

10

531

1200

220

UK

Channel Islands

Midlands

North England

Cumbria

Isle of Man

North West

Northumbria

Yorkshire & Humberside

Northern Ireland

Scotland

South England

Wales

hotels region

tourist board areas

Page 95: Haim Lecture2 Data And Data Modeling 2ppg

95

interaction is the key

add interaction to everything.

interaction is the key

add interaction to everything.

Page 96: Haim Lecture2 Data And Data Modeling 2ppg

96

Visualizations

• Which visualization to use?

requires taxonomy and more understanding– plots, surfaces, volumes, iconographic displays, ...

research issue - domain dependent

• How best to support the human data explorer?

provide user interactions (selections, navigation)

• What can be automated?

data mining, kdd, user advice, ...


Recommended