+ All Categories
Home > Documents > Introduction to Data Mining

Introduction to Data Mining

Date post: 05-Jan-2016
Category:
Upload: ayan-chakravorty
View: 19 times
Download: 3 times
Share this document with a friend
Description:
A brief introduction of the topic Data Mining. Useful for computer science college students.
Popular Tags:
19
1 Data Mining
Transcript
Page 1: Introduction to Data Mining

1

Data Mining

Page 2: Introduction to Data Mining

Data Mining References Jiawei Han and Micheline Kamber, ”Data Mining: Concepts and

Techniques”, Morgan Kaufmann Publishers, Elsevier, 3rd Edition, 2012.

Margaret H. Dunham, “Data Mining: Introduction and Advanced Topics”, Pearson Education, 2006.

Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction to Data Mining “, Pearson Education, 2006.

Richard O. Duda, Peter E. Hart and David G. Stork , “Pattern Classification”, Wiley Publication, 2nd Edition, 2000.

Ian H. Witten, Eibe Frank and Mark A. Hall, “Data Mining Practical Machine Learning Tools and Techniques”, Morgan Kaufmann Publishers, Elsevier, 3rd Edition, 2011.

IEEE Transactions Knowledge and Data Engineering

ACM Transactions Information Systems Database Systems Internet Technology 2

Page 3: Introduction to Data Mining

Data Mining Objectives

Data Mining or Knowledge Discovery from Data

OBJECTIVES Understanding basic data mining concepts &

techniques: uncovering interesting data patterns, hidden in large data

sets

Development of data mining tools: scalable and efficient

3

Page 4: Introduction to Data Mining

Evolution of Sciences

Before 1600, empirical science

1600-1950s, theoretical science Each discipline has grown a theoretical component. Theoretical models

often motivate experiments and generalize our understanding.

1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational

branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.

1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally

accessible Scientific info. management, acquisition, organization, query, and

visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!

Page 5: Introduction to Data Mining

5

Evolution of Database Technology

1960s: data creation & collection

electronic mode IMS

hierarchical database system by IBM network DBMS

1970s: relational data model relational DBMS implementation

1980s: RDBMS advanced data models

extended-relational, OO, deductive, etc. application-oriented DBMS

spatial, scientific, engineering, etc.

Page 6: Introduction to Data Mining

6

Evolution of Database Technology

1990s: Data mining Data warehousing Multimedia databases Web databases

2000s: Stream data management and mining Data mining and its applications Web technology

XML data integration social networks global information systems

Page 7: Introduction to Data Mining

DM Evolution

7

Page 8: Introduction to Data Mining

8

Data Mining Importance

The Explosive Growth of Data: Terabytes (240 bytes)

Petabytes

Exabytes

Zitabytes

Drowning in DATA, but STARVING for KNOWLEDGE !

Data Tombs to Golden Nuggets

PLATO Greek philosopher and mathematician

Necessity is the Mother of Invention

Data Mining— automated analysis of massive data sets

Page 9: Introduction to Data Mining

9

Data Mining Definition

Data mining definition: Extraction or mining of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from large amounts of data stored in databases, data warehouses, or other information repositories

Alternative names knowledge discovery (mining) in databases (KDD) knowledge extraction data/pattern analysis data archeology data dredging information harvesting business intelligence etc.

Page 10: Introduction to Data Mining

10

Knowledge Discovery (KDD) Process

Data mining core of knowledge discovery process

Data Cleaning (remove noise and inconsistent data)

Data Integration (combine multiple data sources)

Databases

Data Warehouse

Task-relevant Data

Selection(retrieve relevant data)

Data Mining(intelligent methods applied to extract patterns)

Pattern

Transformation (summary, aggregation etc.)

Data Mining

Pattern Evaluation(identify true interesting patterns representing knowledge)

Page 11: Introduction to Data Mining

Data Mining TOOLS

EXPLORE !!!!!!!!!!!!!!

R TOOL

PYTHON TOOL

WEKA TOOL

SPSS TOOL

ORANGE TOOL

CLEMENTINE TOOL

And many more….

References: “DM Papers”

11

Page 12: Introduction to Data Mining

12

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision

MakingData Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Page 13: Introduction to Data Mining

13

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Page 14: Introduction to Data Mining

14

Why Not Traditional Data Analysis?

Tremendous amount of data Algorithms must be highly scalable to handle such as tera-

bytes of data High-dimensionality of data

Micro-array may have tens of thousands of dimensions High complexity of data

Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations

New and sophisticated applications

Page 15: Introduction to Data Mining

15

Data Mining: Classification Schemes

General functionality

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be

discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted

Page 16: Introduction to Data Mining

16

Multi-Dimensional View of Data Mining

Data to be mined Relational, data warehouse, transactional, stream, object-

oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

Knowledge to be mined Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels

Techniques utilized Database-oriented, data warehouse (OLAP), machine learning,

statistics, visualization, etc. Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Page 17: Introduction to Data Mining

Data Warehousing

consolidation of data from several databases which are in turn maintained by individual business units along with historical and summary information

17

Roll-up

Page 18: Introduction to Data Mining

Multi-Tiered ArchitectureMulti-Tiered Architecture

18

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

othersources

Data Storage

OLAP Server

Page 19: Introduction to Data Mining

Data Mining Research Publications

Tayal, D. K., Jain, A., Arora, S. , Agarwal, S., Gupta, T. and Tyagi, N., “Crime Detection and Criminal Identification in India Using Data Mining Techniques”, Artificial Intelligence & Society (AIS), SPRINGER, vol. 30, no. 1, pp. 117-127, Feb 2015. [Indexed: Scopus, Google Scholar, EDSCO, ACM Digital Library, DBLP]

Jain, A. Yadav, D., and Tayal, D. K., “ NER for Hindi Language Using Association Rules”, International Conference on Data Mining and Intelligent Computing (ICDMIC 2014), IGDTUW Delhi, India, IEEE, 5th-6th Sept 2014. [Indexed: Scopus]

19


Recommended