+ All Categories
Home > Documents > WHAT IS DATA MINING? The process of automatically extracting useful information from large amounts...

WHAT IS DATA MINING? The process of automatically extracting useful information from large amounts...

Date post: 18-Jan-2018
Category:
Upload: albert-atkins
View: 225 times
Download: 0 times
Share this document with a friend
Description:
Cross Industry Standard Process for Data Mining
17
WHAT IS DATA MINING? The process of automatically extracting useful information from large amounts of data. Uses traditional data analysis techniques (statistics) and sophisticated computer algorithms to discover patterns. Uses machine learning techniques to find structural patterns within the data.
Transcript
Page 1: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

WHAT IS DATA MINING?

The process of automatically extracting useful information from large amounts of data.

Uses traditional data analysis techniques (statistics) and sophisticated computer algorithms to discover patterns.

Uses machine learning techniques to find structural patterns within the data.

Page 2: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

Traditional Techniquesmay be unsuitable due to Enormity of data High dimensionality

of data Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 3: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Cross Industry Standard Process for Data Mining

Page 4: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

The Process -- Simplified

pre-processing, data mining results validation

Page 5: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Two Basic Problem Classes

Prediction Methods Use some variables to predict unknown or future values of

other variables.

Description Methods Find human-interpretable patterns that describe the data.

Page 6: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Basic Types of Data Mining Tasks

Classification (predictive) Clustering (descriptive) Association rules (descriptive) Sequential patterns (descriptive or predictive) Regression (predictive) Anomaly Detection (predictive)

Page 7: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Data Mining Techniques

Statistical techniques Clustering Decision trees Subsampling (bootstrapping) Nearest-neighborhoods SOM Bayesian methods

Page 8: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Data Mining Techniques

Artificial Neural Nets Deep Learning (Google DeepMind) PCA Universal Prediction Reinforcement Learning “Compression” Sequence Prediction Techniques Time Series Analysis

Page 9: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Data Mining Techniques

Hidden Markov Models MLN PLN EDA (MOSES) Random Forests Feature Engineering Unsupervised and Semi-Supervised Learning

Page 10: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

DATA MINING TECHNIQUES

Entropy methods Multifractal methods (time series) Log-linear power laws (crash prediction) Wavelet transforms …. …. ….

Page 11: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

CLASSIFICATION: Definition

Given a collection of records (training set ) Each record contains a set of attributes one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 12: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

CLUSTERING: Definition

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one

another. Data points in separate clusters are less similar to one

another.

Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Page 13: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

ASSOCIATION RULE: Definition

Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will

predict occurrence of an item based on occurrences of other items.

Page 14: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

SEQUENTIAL PATTERN: Definition Given is a set of objects, with each object associated

with its own timeline of events, find rules that predict strong sequential dependencies among different events.

Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.

Page 15: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

REGRESSION: Definition

Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

Greatly studied in statistics, neural network fields. Examples:

Predicting sales amounts of new product based on advetising expenditure.

Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

Time series prediction of stock market indices.

Page 16: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

ANOMALY DETECTION: Definition Detect significant deviations from normal behavior Applications:

Credit Card Fraud Detection

Network Intrusion Detection

Page 17: WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

DATA MINING CHALLENGES

Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data


Recommended