+ All Categories
Home > Documents > Lecture Notes for Chapter 2 Introduction to Data...

Lecture Notes for Chapter 2 Introduction to Data...

Date post: 03-Apr-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
49
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar
Transcript
Page 1: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Data Mining: Data

Lecture Notes for Chapter 2

Introduction to Data Mining

by

Tan, Steinbach, Kumar

Page 2: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

What is Data?

Collection of data objects and

their attributes

An attribute is a property or

characteristic of an object

– Examples: eye color of a

person, temperature, etc.

– Attribute is also known as

variable, field, characteristic,

or feature

A collection of attributes

describe an object

– Object is also known as

record, point, case, sample,

entity, or instance

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Page 3: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Types of Attributes

There are different types of attributes

– Nominal

Examples: ID numbers, eye color, zip codes

– Ordinal

Examples: rankings (e.g., taste of potato chips on a scale

from 1-10), grades, height in {tall, medium, short}

– Interval

Examples: calendar dates, temperatures in Celsius or

Fahrenheit.

– Ratio

Examples: monetary, currency

Page 4: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

Attribute

Type

Description

Examples

Operations

Nominal

The values of a nominal attribute are

just different names, i.e., nominal

attributes provide only enough

information to distinguish one object

from another. (=, )

zip codes, employee

ID numbers, eye color,

sex: {male, female}

mode, entropy,

correlation, 2 test

Ordinal

The values of an ordinal attribute

provide enough information to order

objects. (<, >)

hardness of minerals,

{good, better, best},

grades, street numbers

median, percentiles,

rank correlation

Interval

For interval attributes, the

differences between values are

meaningful, i.e., a unit of

measurement exists.

(+, - )

calendar dates,

temperature in Celsius

or Fahrenheit

mean, standard

deviation, Pearson's

correlation, t and F

tests

Ratio

For ratio variables, both differences

and ratios are meaningful. (*, /)

monetary quantities,

electrical current

geometric mean,

harmonic mean,

percent variation

Page 5: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Discrete and Continuous Attributes

Discrete Attribute

– Has only a finite or countably infinite set of values

– Examples: zip codes, counts, or the set of words in a collection of documents

– Often represented as integer variables.

– Note: binary attributes are a special case of discrete attributes

Continuous Attribute

– Has real numbers as attribute values

– Examples: temperature, height, or weight.

– Practically, real values can only be measured and represented using a finite number of digits.

– Continuous attributes are typically represented as floating-point variables.

Page 6: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Types of data sets

Record – Data Matrix

– Document Data

– Transaction Data

Graph – World Wide Web

– Molecular Structures

Ordered – Spatial Data

– Temporal Data

– Sequential Data

– Genetic Sequence Data

Page 7: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Record Data

Data that consists of a collection of records, each

of which consists of a fixed set of attributes

Tid Refund Marital

Status Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 8: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Data Matrix

If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as

points in a multi-dimensional space.

Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n

columns, one for each attribute

Page 9: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Document Data

Each document becomes a `term' vector,

– each term is a component (attribute) of the vector,

– the value of each component is the number of times

the corresponding term occurs in the document.

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Page 10: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Transaction Data

A special type of record data, where

– each record (transaction) involves a set of items.

– For example, consider a grocery store. The set of

products purchased by a customer during one

shopping trip constitute a transaction, while the

individual products that were purchased are the items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Page 11: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Graph Data

Examples: Generic graph and HTML Links

5

2

1

2

5

<a href="papers/papers.html#bbbb">

Data Mining </a>

<li>

<a href="papers/papers.html#aaaa">

Graph Partitioning </a>

<li>

<a href="papers/papers.html#aaaa">

Parallel Solution of Sparse Linear System of Equations </a>

<li>

<a href="papers/papers.html#ffff">

N-Body Computation and Dense Linear System Solvers

Page 12: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Graph Data

Page 13: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Chemical Data

Benzene Molecule: C6H6

Page 14: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Ordered Data

Sequences of transactions

An element of

the sequence

Items/Events

Page 15: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Page 16: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Ordered Data

Genomic sequence data

Page 17: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Ordered Data

Genomic sequence data

Page 18: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Ordered Data

Spatio-Temporal Data

Average Monthly

Temperature of

land and ocean

Page 19: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Data Preprocessing

Page 20: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 20

Why Preprocess the Data?

Measures for data quality:

– Accuracy: correct or wrong, accurate or not

– Completeness: not recorded, unavailable, …

– Consistency: some modified but some not …

– Believability: how trustable the data are correct?

– Interpretability: how easily the data can be

understood?

Page 21: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 21

Major Tasks in Data Preprocessing

Data cleaning

– Fill in missing values, smooth noisy data

Data Reduction

– Sampling

– Data Compression

Data transformation and data discretization

– Normalization

Page 22: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 22

Data Cleaning

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,

instrument faulty, human or computer error, transmission error

– incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

e.g., Occupation = “ ” (missing data)

– noisy: containing noise, errors, or outliers

e.g., Salary = “−10” (an error)

– inconsistent: containing discrepancies in codes or names, e.g.,

Age = “42”, Birthday = “03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

– Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?

Page 23: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 23

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing

(when doing classification)—not effective when the % of

missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

– a global constant : e.g., “unknown”, a new class?!

– the attribute mean

– the attribute mean for all samples belonging to the

same class: smarter

Page 24: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Page 25: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

25

Noisy Data

Noise: random error or variance in a measured variable

Incorrect attribute values may be due to

– faulty data collection instruments

– data entry problems

– data transmission problems

– technology limitation

Page 26: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 26

How to Handle Noisy Data?

Binning

– first sort data and partition into (equal-frequency) bins

– then one can smooth by bin means, smooth by bin

boundaries, etc.

Page 27: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 27

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Page 28: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 28

Data Reduction: Sampling

Sampling: obtaining a small sample s to represent the

whole data set N

Key principle: Choose a representative subset of the data

– Simple random sampling may have very poor

performance in the presence of skew

– Develop adaptive sampling methods, e.g., stratified

sampling:

Page 29: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 29

Types of Sampling

Simple random sampling

– There is an equal probability of selecting any particular item

Sampling without replacement

– Once an object is selected, it is removed from the population

Sampling with replacement

– A selected object is not removed from the population

Stratified sampling:

– Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of the

data)

– Used in conjunction with skewed data

Page 30: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 30

Sampling: With or without Replacement

Raw Data

Page 31: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 31

Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Page 32: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Sample Size

8000 points 2000 Points 500 Points

Page 33: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 33

Data Reduction : Data Compression

String compression

– There are extensive theories and well-tuned algorithms

– Typically lossless, but only limited manipulation is possible without expansion

Audio/video compression

– Typically lossy compression, with progressive refinement

– Sometimes small fragments of signal can be reconstructed without reconstructing the whole

Page 34: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 34

Data Compression

Original Data Compressed

Data

lossless

Original Data

Approximated

Page 35: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 35

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values

Methods

– Smoothing: Remove noise from data

– Normalization: Scaled to fall within a smaller, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Page 36: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 36

Normalization

Min-max normalization: to [new_minA, new_maxA]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

– Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

Page 37: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Similarity and Dissimilarity

Page 38: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Similarity and Dissimilarity

Similarity

– Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1]

Dissimilarity

– Numerical measure of how different are two data

objects

– Lower when objects are more alike

– Minimum dissimilarity is often 0

– Upper limit varies

Proximity refers to a similarity or dissimilarity

Page 39: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

n

kkk qpdist

1

2)(

Page 40: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Distance Matrix

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Page 41: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

rn

k

rkk qpdist

1

1

)||(

Page 42: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2. Euclidean distance

Page 43: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Minkowski Distance

Euclidean Distance Matrix

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Manhattan Distance Matrix

Page 44: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Common Properties of a Distance

Distances, such as the Euclidean distance, have some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)

3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is a metric

Page 45: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Similarity Between Binary Vectors

Common situation is that objects, i and j, have only binary attributes

Page 46: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Example

Page 47: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Cosine Similarity

If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,

where indicates vector dot product and || d || is the length of vector d.

Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Page 48: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Correlation

Correlation measures the linear relationship

between objects

Page 49: Lecture Notes for Chapter 2 Introduction to Data Miningturgaybilgin/2013-2014-guz/DataMining/chap2_… · Attribute Type Description Examples Operations Nominal just different names,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Visually Evaluating Correlation

Scatter plots

showing the

similarity from

–1 to 1.


Recommended