Data Mining - Sharif University of...

Data MiningIntroduction

Hamid Beigy

Sharif University of Technology

Fall 1395

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21

Table of contents

1 Introduction

2 Data mining process

3 Major problems in data mining

4 Outline of course


Outline

1 Introduction



4 Outline of course


Introduction

Data mining is the study of collecting, cleaning, processing, analysing, and gaining usefulinsight from data.

Data mining is a broad term that is used to describe different aspects of data processing(depending on the problem domain, applications, formulation, and data representations).

Data mining is needed because virtually all automated systems generate some form ofdata either for diagnostic or analysis purpose. Examples of different kinds of data

World wide web (web logs, web graph, ...).Financial interactions.User interactions.Sensors and internet of things.

We hope that we can extract concise and possibly actionable insight from available datafor application specific goals.

Data may be arbitrary, unstructured, and even in a format that is not immedietly sutablefor automated processing.

To address the above issues, data mining analysts use a pipeline of processing, where thenew data is collected, cleaned, and transformed into standadized format.


Field of data mining

Data Mining

Machine LearningData Base

Visualization

StatisticsHigh Performance

Computing

AlgorithmDesign


Outline

1 Introduction



4 Outline of course


Data mining process

The workflow of a data mining application contains the following phases.Data collection After the collection phase, the data is often stored in a databas (more

generally a data warehouse).Feature extraction and data cleaning When the data is collected, it is often not in a

form that is suitable for processing. For processing, it is transformed intosuitable format such as multidimensional, time-series, or semi-structured.The multidimensional format is the most common one, in which differentfields of the correspond to the different measured properties called asfeatures, attributes, or dimensions. The feature extraction phase performedin parallel with data cleaning, where missing and erroneous parts of the dataare either estimated or corrected. The data may be extracted from multiplesources and integrated into a unified format for processing.

Analytical processing and algorithms The final part of the mining process is to designeffective analytical methods from the processed data.


Data preprocessing phase

The data preprocessing phase is perhaps the most crucial one in the data mining process.This phase begins after the collection of the data, and it consists of the following steps:

Feature extraction Extracting the right features is often a skill that requires anunderstanding of the specific application domain at hand.

Data cleaning The extracted data may have erroneous or missing entries. Therefore,some records may need to be dropped, or missing entries may need to beestimated. Inconsistencies may need to be removed.

Feature selection and transformation When the data is very high dimensional, many datamining algorithms do not work effectively.Furthermore, many of the high- dimensional features are noisy and may adderrors to the data mining process. Therefore, a variety of methods are usedto either remove irrelevant features or transform the current set of featuresto a new data space that is more amenable for analysis.

The data cleaning process requires statistical methods that are commonly used formissing data estimation.


Basic data types

One interesting aspect of the data mining process is the wide variety of data types thatare available for analysis. There are two broad types of data for the data mining process:

Non-dependency oriented data This typically refers to simple data types such asmultidimensional data or text data. These data types are the simplest andmost commonly encountered.

Dependency oriented data In these cases, implicit or explicit relationships may existbetween data items. For example, a social network data set contains a setof vertices (data items) that are connected together by a set of edges(relationships).

If the dependencies between data items are not explicitly specified but areknown to typically exist in that domain, the dependencies are called Implicitdependencies such as temperatures measured from a sensor.The explicit dependencies refers to graph or network data in which edges areused to specify explicit relationships.


Non-dependency oriented data

This is the simplest form of data and typically refers to multidimensional data. This datatypically contains a set of records. A record is also referred to as a data point, instance,example, transaction, entity, tuple, object, or feature-vector, depending on the applicationat hand. Each record contains a set of fields, which are also referred to as attributes,dimensions, and features. These fields describe the different properties of that record.

Definition (Multidimensional Data)

A multidimensional data set D is a set of n records, X1,X2, . . . ,Xn , such that each record Xi

contains a set of d features denoted by (xi1, xi2, . . . , xid).


Non-dependency oriented data

The following data set has two different data types.

The age field has values that are numerical in the sense that they have a natural ordering.Such attributes are referred to as continuous, numeric or quantitative. Data in which allfields are quantitative is also referred to as quantitative data or numeric data.

The attributes such as gender, race, and ZIP code, have discrete values without a naturalordering among them. This type of data is categorical, then such data is referred to asunordered discrete-valued or categorical.

In the case of mixed attribute data, there is a combination of categorical and numericattributes.


Outline

1 Introduction



4 Outline of course


Major problems in data mining

Four problems in data mining are considered fundamental to the mining process. Theseproblems correspond to clustering, classification, association pattern mining, and outlierdetection, and they are encountered repeatedly in the context of many data miningapplications.To answer these questions, one must understand the nature of the typical relationshipsthat data scientists often try to extract from the data.Consider a multidimensional database D with n records, and d attributes. Such adatabase D may be represented as an n × d matrix D, in which each row corresponds toone record and each column corresponds to a dimension. We generally refer to this matrixas the data matrix.Broadly speaking, data mining is all about finding summary relationships between theentries in the data matrix that are either unusually frequent or unusually infrequent.Relationships between data items are one of two kinds:

When considering relationships between columns, the frequent or infrequent relationshipsbetween the values in a particular row are determined. This maps into either the positive ornegative association pattern mining problem. In some cases, one column is considered moreimportant. This problem is referred to as data classification.When considering relationships between rows, the goal is to determine subsets of rows, inwhich values in columns are related. In cases where these subsets are similar, the problem isreferred to as clustering. When entries in a row are very different from the entries in otherrows, then that row becomes interesting as an anomaly. This problem is referred to as outlieranalysis.


Associative pattern mining

In its most primitive form, the association pattern mining problem is defined in thecontext of sparse binary databases. Most customer transaction databases are of this type.

A particularly commonly studied version of this problem is the frequent pattern miningproblem or, more generally, the association pattern mining problem.

Definition (Frequent pattern mining)

Given a binary n × d data matrix D, determine all subsets of columns such that all the valuesin these columns take on the value of 1 for at least a fraction s of the rows in the matrix. Therelative frequency of a pattern is referred to as its support.

Patterns that satisfy the minimum support requirement are often referred to as frequentpatterns.

Frequent patterns represent an important class of association patterns.


Data clustering

A broad and informal definition of the clustering problem is as follows:

Definition (Data Clustering)

Given a data matrix D, partition its rows (records) into sets C1,C2, . . . ,Ck , such that the rows(records) in each cluster are similar to one another.

An important part of the clustering process is the design of an appropriate similarityfunction for the computation process.

Some examples of relevant applications are as follows:

Customer segmentationData summarizationApplication to other data mining problems


Outlier Detection

An outlier is a data point that is significantly different from the remaining data.

Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in thedata mining and statistics literature.

Definition (Outlier Detection)

Given a data matrix D, determine the rows of the data matrix that are very different from theremaining rows in the matrix.

Some examples of relevant applications of outlier detection are as follows:

Intrusion-detection systemsCredit card fraudInteresting sensor eventsMedical diagnosis


Data classification

Many data mining problems are directed toward a specialized goal that is sometimesrepresented by the value of a particular feature in the data. This particular feature isreferred to as the class label.

Definition (Data Classification)

Given an n × d training data matrix D, and a class label value in {1, . . . , k} associated witheach of the n rows in D, create a training model M, which can be used to predict the classlabel of a d-dimensional record X /∈ D.

Some examples of applications where the classification problem is used are as follows:

Target marketing Features about customers are related to their buying behavior with theuse of a training model.

Intrusion detection The sequences of customer activity in a computer system may beused to predict the possibility of intrusions.

Supervised anomaly detection The rare class may be differentiated from the normal classwhen previous examples of outliers are available.


Outline

1 Introduction



4 Outline of course


Outline of course

1 Introduction2 Data collection & preprocessing

Data cleaningData warehouseFeature extraction & selection

3 Frequent pattern miningFrequent itemset miningSummarizing itemsetsGraph & sequence miningEvaluating patterns and association rules

4 Data clusteringRepresentation-based & Hierarchical clusteringDensity-based clusteringSpectral & graph clusteringClustering validation

5 Data classificationProbablistic classifiersDecision tree classifiersLinear discriminant analysis & support vector machinesClassification assessment

6 Outlier detection7 Advanced topic & applications


References

Charu C. Aggarwal, Data Mining, Springer, 2015.

J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, MorganKaufmann, 2012.

M. J. Zaki and W. M. JR, Data Mining and Analysis : Fundamental Concepts andAlgorithms, Cambridge University Press, 2014.


Some relevant journals

1 IEEE Trans on Pattern Analysis and Machine Intelligence

2 Journal of Machine Learning Research

3 Pattern Recognition

4 Machine Learning

5 Neural Networks

6 Neural Computation

7 Neurocomputing

8 IEEE Trans. on Neural Networks and Learning Systems

9 Annuals of Statistics

10 Journal of the American Statistical Association

11 Pattern Recognition Letters

12 Artificial Intelligence

13 Data Mining and Knowledge Discovery

14 IEEE Transaction on Cybernetics (SMC-B)

15 IEEE Transaction on Knowledge and Data Engineering

16 Knowledge and Information Systems


Some relevant conferences

1 Neural Information Processing Systems (NIPS)

2 International Conference on Machine Learning (ICML)

3 European Conference on Machine Learning (ECML)

4 Asian Conference on Machine Learning (ACML2013)

5 Conference on Uncertainty in Artificial Intelligence (UAI)

6 Practice of Knowledge Discovery in Databases (PKDD)

7 International Joint Conference on Artificial Intelligence (IJCAI)

8 IEEE International Conference on Data Mining series (ICDM)


Relevant packages and datasets

1 Packages:

R http://www.r-project.org/

Weka http://www.cs.waikato.ac.nz/ml/weka/

RapidMiner http://rapidminer.com/MOA http://moa.cs.waikato.ac.nz/

2 Datasets:

UCI Machine Learning Repository http://archive.ics.uci.edu/ml/

StatLib http://lib.stat.cmu.edu/datasets/

Delve http://www.cs.toronto.edu/~delve/data/datasets.html


http://www.r-project.org/

http://www.cs.waikato.ac.nz/ml/weka/

http://rapidminer.com/

http://moa.cs.waikato.ac.nz/

http://archive.ics.uci.edu/ml/

http://lib.stat.cmu.edu/datasets/

http://www.cs.toronto.edu/~delve/data/datasets.html

Course evaluation

Evaluation:Mid-term exam 30% 1395/8/15Final exam 30% Sum of all exams ≥ 7.2 for passing

Quiz 15%Homeworks 15%Project 10%

Course page:http://ce.sharif.edu/courses/95-96/1/ce714-1/

TAs :Zohre Fallahnejad [email protected]


http://ce.sharif.edu/courses/95-96/1/ce714-1/

[email protected]

Papers for seminars


Date post:	30-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Data Mining - Sharif University of...

Documents