© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
What is Data?
Collection of data objects and
their attributes
An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes
Objects
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Types of Attributes
There are different types of attributes
– Nominal
Examples: ID numbers, eye color, zip codes
– Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
Examples: monetary, currency
Attribute
Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
correlation, 2 test
Ordinal
The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
Ratio
For ratio variables, both differences
and ratios are meaningful. (*, /)
monetary quantities,
electrical current
geometric mean,
harmonic mean,
percent variation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Discrete and Continuous Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite number of digits.
– Continuous attributes are typically represented as floating-point variables.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Types of data sets
Record – Data Matrix
– Document Data
– Transaction Data
Graph – World Wide Web
– Molecular Structures
Ordered – Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Record Data
Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital
Status Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Data Matrix
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space.
Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Document Data
Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.
Document 1
se
aso
n
time
ou
t
lost
wi
n
ga
me
sco
re
ba
ll
play
co
ach
tea
m
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Transaction Data
A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Graph Data
Examples: Generic graph and HTML Links
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Graph Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Chemical Data
Benzene Molecule: C6H6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Ordered Data
Sequences of transactions
An element of
the sequence
Items/Events
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Ordered Data
Genomic sequence data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Ordered Data
Genomic sequence data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Data Preprocessing
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 20
Why Preprocess the Data?
Measures for data quality:
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not …
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 21
Major Tasks in Data Preprocessing
Data cleaning
– Fill in missing values, smooth noisy data
Data Reduction
– Sampling
– Data Compression
Data transformation and data discretization
– Normalization
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 22
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 23
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the
same class: smarter
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
25
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 26
How to Handle Noisy Data?
Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
boundaries, etc.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 27
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 28
Data Reduction: Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor
performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 29
Types of Sampling
Simple random sampling
– There is an equal probability of selecting any particular item
Sampling without replacement
– Once an object is selected, it is removed from the population
Sampling with replacement
– A selected object is not removed from the population
Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
– Used in conjunction with skewed data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 30
Sampling: With or without Replacement
Raw Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 31
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Sample Size
8000 points 2000 Points 500 Points
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 33
Data Reduction : Data Compression
String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible without expansion
Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed without reconstructing the whole
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 34
Data Compression
Original Data Compressed
Data
lossless
Original Data
Approximated
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 35
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values
Methods
– Smoothing: Remove noise from data
– Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 36
Normalization
Min-max normalization: to [new_minA, new_maxA]
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
716.00)00.1(000,12000,98
000,12600,73
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Similarity and Dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Similarity and Dissimilarity
Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Euclidean Distance
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
n
kkk qpdist
1
2)(
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
rn
k
rkk qpdist
1
1
)||(
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
r = 2. Euclidean distance
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Minkowski Distance
Euclidean Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Manhattan Distance Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Common Properties of a Distance
Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is a metric
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Similarity Between Binary Vectors
Common situation is that objects, i and j, have only binary attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Correlation
Correlation measures the linear relationship
between objects
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.