+ All Categories
Home > Documents > Data and Data Types

Data and Data Types

Date post: 18-Dec-2021
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
45
Data and Data Types BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
Transcript
Page 1: Data and Data Types

Data and Data Types

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 2: Data and Data Types

GENEL- PUBLIC

What is Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Collection of data objects and their attributes

• An attribute is a property or characteristic of an object –Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, dimension, or feature

• A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance

Page 3: Data and Data Types

GENEL- PUBLIC

Data vs Information vs Knowledge

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 4: Data and Data Types

GENEL- PUBLIC

Knowledge Discovery in Data: Process

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 5: Data and Data Types

GENEL- PUBLIC

Knowledge Discovery in Data: Challenges

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 6: Data and Data Types

GENEL- PUBLIC

Data Come from Everywhere

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 7: Data and Data Types

GENEL- PUBLIC

Attribute (Feature) Values

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

In the fields of machine learning and pattern recognition, a measurable attribute of an observed phenomenon is called a feature (or attribute).

Selecting clear, distinctive and independent features is a critical step for effective pattern recognition, classification and regression algorithms.

Features are usually numeric, but some pattern analysis also uses words and graphs.

Page 8: Data and Data Types

GENEL- PUBLIC

Types of Attributes◦ Nominal

◦ Examples: ID numbers, eye color, zip codes

◦ Ordinal◦ Examples: rankings (e.g., taste of potato chips on a scale from 1-10),

grades, height {tall, medium, short}

◦ Interval◦ Examples: calendar dates, temperatures in Celsius or Fahrenheit.

◦ Ratio◦ Examples: temperature in Kelvin, length, counts, elapsed time (e.g.,

time to run a race)

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 9: Data and Data Types

GENEL- PUBLIC

Discrete and Continuous Attributes

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 10: Data and Data Types

GENEL- PUBLIC

Important Characteristics of Data◦ Dimensionality (number of attributes)

◦ High dimensional data brings a number of challenges

◦ Resolution

◦ Patterns depend on the scale

◦ Size

◦ Type of analysis may depend on size of data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 11: Data and Data Types

GENEL- PUBLIC

Types of Dataset◦ Record Data

◦ Transactional Data◦ Data Matrix◦ Document Data

◦ Temporal Data◦ Time Series Data◦ Sequence Data

◦ Spatial & Spatial-Temporal Data◦ Spatial Data◦ Spatial-Temporal Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

◦ Graph Data

◦ Transactional Data

◦ UnStructured Data

◦ Twitter Status Message

◦ Review, news article

◦ Semi-Structured Data

◦ Paper Publications Data

◦ XML format

Page 12: Data and Data Types

GENEL- PUBLIC

Record Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 13: Data and Data Types

GENEL- PUBLIC

Data Matrix Example for Documents

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 14: Data and Data Types

GENEL- PUBLIC

Data Matrix

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

Page 15: Data and Data Types

GENEL- PUBLIC

Temporal Data – Sequence Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 16: Data and Data Types

GENEL- PUBLIC

Time Series Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 17: Data and Data Types

GENEL- PUBLIC

Biological Sequence Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 18: Data and Data Types

GENEL- PUBLIC

Interval Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 19: Data and Data Types

GENEL- PUBLIC

Spatial & Spatial-Temporal Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 20: Data and Data Types

GENEL- PUBLIC

Spatial & Spatial-Temporal Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 21: Data and Data Types

GENEL- PUBLIC

Spatial & Spatial-Temporal Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 22: Data and Data Types

GENEL- PUBLIC

Graph Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

5

2

1

2

5

Page 23: Data and Data Types

GENEL- PUBLIC

Structured, Semi-structured, Unstructured Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 24: Data and Data Types

GENEL- PUBLIC

Structured, Semi-structured, Unstructured Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 25: Data and Data Types

GENEL- PUBLIC

Can data help us solve specific problems?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 26: Data and Data Types

GENEL- PUBLIC

How should these pictures be placed into 3 groups?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 27: Data and Data Types

GENEL- PUBLIC

How many groups should there be?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 28: Data and Data Types

GENEL- PUBLIC

Which genes are associated with a disease?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 29: Data and Data Types

GENEL- PUBLIC

Where are the faces in this picture?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 30: Data and Data Types

GENEL- PUBLIC

Is it likely that this stock was traded based on illegal insider information?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 31: Data and Data Types

GENEL- PUBLIC

Data QualityWhat kinds of data quality problems?

How can we detect problems with the data?

What can we do about these problems?

Examples of data quality problems: ◦ Noise and outliers

◦ Wrong data

◦ Fake data

◦ Missing values

◦ Duplicate data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 32: Data and Data Types

GENEL- PUBLIC

Noise

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 33: Data and Data Types

GENEL- PUBLIC

Outliers

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

Page 34: Data and Data Types

GENEL- PUBLIC

How to find Outliers?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 35: Data and Data Types

GENEL- PUBLIC

Missing ValuesReasons for missing values

◦ Information is not collected (e.g., people decline to give their age and weight)

◦ Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)

Handling missing values◦ Eliminate data objects or variables

◦ Estimate missing values◦ Example: time series of temperature

◦ Example: census results

◦ Ignore the missing value during analysis

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 36: Data and Data Types

GENEL- PUBLIC

Duplicate DataData set may include data objects that are duplicates, or almost duplicates of one another

◦ Major issue when merging data from heterogeneous sources

Examples:◦ Same person with multiple email addresses

Data cleaning◦ Process of dealing with duplicate data issues

When should duplicate data not be removed?

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 37: Data and Data Types

GENEL- PUBLIC

Distance Functions

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

As the size of the data increases, the Manhattan Distance is preferred to the Euclidean distance metric.

The Minkowski metric is preferred if more detailed distance of the data is required.

Page 38: Data and Data Types

GENEL- PUBLIC

Euclidean Distance

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

Page 39: Data and Data Types

GENEL- PUBLIC

Minkowski Distance

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Consider two points in a 7 dimensional space:

P1: (10, 2, 4, -1, 0, 9, 1)

P2: (14, 7, 11, 5, 2, 2, 18)

If we set p = 4 for this sample calculation, we find the following:

distance_p4 = (-4)^4 + (-5)^4 + (-7)^4 + (-6)^4 + (-2)^4 + (7)^4 + (-17)^4

distance_p4 = 4^4 + 5^4 + 7^4 + 6^4 + 2^4 + 7^4 + 17^4distance_p4 = 256 + 625 + 2401 + 1296 + 16 + 2401 + 83521distance_p4 = 90516minkowski_distance = distance_p4 ^ 0.25minkowski_distance = 90516 ^ 0.25minkowski_distance = 17.3452

Page 40: Data and Data Types

GENEL- PUBLIC

Manhattan DistanceWhen p = 1, Minkowshi distance is same as Manhattan distance.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 41: Data and Data Types

GENEL- PUBLICBİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Page 42: Data and Data Types

GENEL- PUBLIC

Cosine SimilarityIf A and B are two document (text) vectors

Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 43: Data and Data Types

GENEL- PUBLIC

Correlation measures the linear relationship between objects

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 44: Data and Data Types

GENEL- PUBLIC

Corelation Ranges

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 45: Data and Data Types

GENEL- PUBLIC

Corelation Calculationx = (-3, -2, -1, 0, 1, 2, 3)

y = (9, 4, 1, 0, 1, 4, 9)

yi = xi2

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

mean(x) = 0, mean(y) = 4

std(x) = 2.16, std(y) = 3.74


Recommended