+ All Categories
Home > Documents > Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur...

Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur...

Date post: 21-May-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
37
Measurements and Data Sargur Srihari University at Buffalo The State University of New York
Transcript
Page 1: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Measurements and Data

Sargur Srihari

University at Buffalo The State University of New York

Page 2: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Topics

•  Types of Data •  Distance Measurement •  Data Transformation •  Forms of Data •  Data Quality

2 Srihari

Page 3: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Importance of Measurement •  Aim of mining structured data is to discover relationships

that exist in the real world –  business, physical, conceptual

•  Instead of looking at real world we look at data describing it

•  Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure

•  Numerical relationships between variables capture relationships between objects

•  Measurement process is crucial

3 Srihari

Page 4: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Types of Measurement •  Ordinal,

–  e.g., excellent=5, very good=4, good=3… •  Nominal

–  e.g., color, religion, profession –  Need non-metric methods

•  Ratio –  e.g., weight –  has concatenation property, two weights add to balance a

third: 2+3 = 5 –  changing scale (multiply by constant) does not change ratio

•  Interval –  e.g., temperature, calendar time –  Unit of measurement is arbitrary, as well as origin 4

Page 5: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Operational Measurement •  Measuring Programming Effort (Halstead 1977)

Programming effort e = am(n+m)log(a+b)/2b a = no of unique operators b = no of unique operands n = no of total operator occurences m = no of operand occurences

•  Defines programming effort as well as a way of measuring it.

•  Operational measurements are concerned with prediction whereas non-operational measurements are concerned with description

5 Srihari

Page 6: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Distance and Similarity •  Many data mining techniques are based on similarity measures

between objects –  nearest-neighbor classification –  cluster analysis, –  multi-dimensional scaling

•  s(i,j): similarity, d(i,j): dissimilarity •  Possible transformations:

d(i,j)= 1 – s(i,j) or d(i,j)=sqrt(2*(1-s(i,j))

•  Proximity is a general term to indicate similarity and dissimilarity •  Distance is used to indicate dissimilarity

6 Srihari

Page 7: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Metric Properties

1.  d(i,j) > 0 Positivity 2.  d(i,j) = d(j,i) Commutativity 3.  d(i,j) < d(i,k) + d(k,j) Triangle Inequality

A metric is a dissimilarity (distance) measure that satisfies:

i j

i

j k 7 Srihari

Page 8: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Examples of Metrics

•  Euclidean Distance dE – Standardized (divide by variance) – Weighted dWE

•  Minkowski measure – Manhattan Distance

•  Mahanalobis Distance dM – Use of Covariance

•  Binary data Distances

Srihari 8

Page 9: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Euclidean Distance between Vectors

•  Euclidean distance assumes variables are commensurate •  E.g., each variable a measure of length

•  If one were weight and other was length there is no obvious choice of units

•  Altering units would change which variables are important

x

y

x1 y1

x2

y2

9 Srihari

Page 10: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Standardizing the Data when variables are not commensurate

•  Divide each variable by its standard deviation –  Standard deviation for the kth variable is

where

•  Updated value that removes the effect of scale:

10 Srihari

Page 11: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Weighted Euclidean Distance

•  If we know relative importance of variables

11 Srihari

Page 12: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Use of Covariance in Distance

•  Similarities between cups

•  Suppose we measure cup-height 100 times and diameter only once –  height will dominate although 99 of the height

measurements are not contributing anything •  They are very highly correlated •  To eliminate redundancy we need a data-

driven method –  approach is to not only to standardize data in each

direction but also to use covariance between variables

12 Srihari

Page 13: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Covariance between two Scalar Variables

•  A scalar value to measure how x and y vary together •  Obtained by

–  multiplying for each sample its mean-centered value of x with mean-centered value of y

–  and then adding over all samples •  Large positive value

–  if large values of x tend to be associated with large values of y and small values of x with small values of y

•  Large negative value –  if large values of x tend to be associated with small values of y

•  With d variables can construct a d x d matrix of covariances –  Such a covariance matrix is symmetric.

Cov(x,y) =1n

x(i) − x_

i=1

n

∑ y(i) − y_

Sample means

13

Page 14: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

For Vectors: Covariance Matrix and Data Matrix

•  Let X = n x d data matrix •  Rows of X are the data vectors x(i) •  Definition of covariance:

•  If values of X are mean-centered –  i.e., value of each variable is relative to the sample

mean of that variable –  then V=XTX is the d x d covariance matrix

14 Srihari

Page 15: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Correlation Coefficient Value of Covariance is dependent upon ranges of x and y

Dependency is removed by dividing values of x by their standard deviation and values of y by their standard deviation

With p variables, can form a d x d correlation matrix 15 Srihari

Page 16: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Correlation Matrix Housing related variables across city suburbs (d=11) 11 x 11 pixel image (White 1, Black -1) Columns 12-14 have values -1,0,1 for pixel intensity reference Remaining represent corrrelation matrix

Reference for -1, 0,+1

Variables 3 and 4 are highly negatively correlated with Variable 2 Variable 5 is positively correlated with Variable 11 Variables 8 and 9 are highly correlated

Page 17: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Mahanalobis Distance between samples x(i) and x(j) is:

Incorporating Covariance Matrix in Distance

d x d 1 x d d x 1

dM discounts the effect of several highly correlated variables 17 Srihari

T is transpose Σ is d x d covariance matrix Σ-1 standardizes data relative to Σ

Matrix multiplication yields a scalar value

Page 18: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Generalizing Euclidean Distance

Minkowski or Lλ metric

•  λ = 2 gives the Euclidean metric •  λ = 1 gives the Manhattan or City-block metric

•  λ = ∞ yields

18 Srihari

Page 19: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Distance Measures for Binary Data •  Most obvious measure is Hamming Distance normalized by number of bits

•  If we don’t care about irrelevant properties had by neither object we have Jaccard Coefficient

•  Dice Coefficient extends this argument –  If 00 matches are irrelevant then 10 and 01 matches should have half relevance

•  Generalization to discrete values (non-binary) –  Score 1 for if two objects agree and 0 otherwise

•  Adaptation to mixed data types –  Use additive distance measures 19

Proportion of variables on which objects have same value

Example: two documents do not have certain terms

Page 20: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors

where

* *

*

20 Srihari

Page 21: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors

where

* *

*

21 Srihari

Page 22: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Weighted Dissimilarity Measures for Binary Vectors

•  Unequal importance to ‘0’ matches and ‘1’ matches

•  Multiply S00 with β ([0,1]) •  Examples:

22 Srihari

Page 23: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Transforming the Data

Model depends on form of data

If Y is a function of X2 then we could use quadratic function or choose U= X2 and use a linear fit

Page 24: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

V1 is non-linearly Related to V2

V3=1/V2 is linearly related to V1

V1

V2

24 Srihari

Page 25: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Square root transformation keeps the variance constant

Variance increases (regression assumes variance is constant)

25 Srihari

Page 26: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Forms of Data

Standard Data (Data Matrix) Multirelational Data

String Event Sequence Hierarchical Data

Page 27: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Data Matrix

•  Simplest form of data •  A set of d measurements on objects o(1)…o(n)

–  n rows and d columns •  Also called standard data, data matrix or table

27 Srihari

Page 28: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Multirelational Data (multiple data matrices)

Name Department Name

Age Salary

Department Name

Budget Manager

Payroll Database

Department Table

Can be combined together to form a data matrix with fields name, department-name, age, salary, budget, manager Or create as many rows as department-names Flattening requires needless replication (Storage issues)

Page 29: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

String Data

•  Sequence of symbols from a finite alphabet – Standard matrix form is unsuitable

•  Sequence of values from a categorical variable – Standard English text (alphanumeric characters,

spaces, punctuation marks) – Protein and DNA/RNA sequences (A,C,G,T)

29 Srihari

Page 30: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Event Sequence Data

•  Sequence of pairs of the form {event, occurrence time}

•  A string where each sequence item is tagged with an occurrence time – Telecommunication alarm log – Transaction data (records of retail or financial) – Can occur asynchronously

30 Srihari

Page 31: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Data Quality

31 Srihari

Page 32: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Data Quality for Individual Measurements

•  Data Mining Depends on Quality of data •  Many interesting patterns discovered may

result from measurement inaccuracies. •  Sources of error

– Errors in measurement – Carelessness –  Instrumentation failure –  Inadequate definition of what we are measuring

32 Srihari

Page 33: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Precision and Accuracy •  Precise Measurement

– Small variability (measured by variance) – Repeated measurements yield same value – Many digits of precision is not necessarily

accurate (results of calculations give many digits) •  Accurate

– Not only small variability but close to true value •  Precise measurement of height with shoes will not give

an accurate measurement •  Mean of repeated measurements and true value is

“Bias” 33

Page 34: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Data Quality for Collections of Data

•  Collections of Data – Much of statistics is concerned with inference from

a sample to a population – How to infer things from a fraction about entire

population – Two sources of error:

•  sample size and bias

34 Srihari

Page 35: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Sample Size •  Confidence Intervals

35 Srihari

Page 36: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Biased Sample

•  Inappropriate samples – To calculate average weight of people in New York

it would be inappropriate to restrict samples to women, or to office workers

•  Random sample is key to make valid inferences – Stratification (gender, age, education, occupation) – Proportional representation

36 Srihari

Page 37: Measurements and Data - Welcome to CEDARsrihari/CSE626/Lecture... · Measurements and Data Sargur Srihari University at Buffalo The State University of New York . Topics ... two weights

Outlier

Anomalous Observations

37 Srihari


Recommended