+ All Categories
Home > Documents > CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.1

Date post: 10-Jul-2015
Category:
Upload: tommy96
View: 367 times
Download: 0 times
Share this document with a friend
Popular Tags:
36
CSE5230 - Data Mining, 2002 Lecture 1.1 Data Mining - CSE5230 David Squire [email protected] Room 5.23A B Block, Caulfield Ph. 9903 1033 (thanks to Robert Redpath for initial development of course resources) CSE5230/DMS/2002/1
Transcript
Page 1: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.1

Data Mining - CSE5230

David Squire

[email protected]

Room 5.23A B Block, Caulfield

Ph. 9903 1033

(thanks to Robert Redpath for initial development of course resources)

CSE5230/DMS/2002/1

Page 2: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.2

Lecture Outline

◆ Unit Outline◆ Definitions of Data Mining◆ A Case Study◆ The Process of Knowledge Discovery

❖ Data Selection❖ Data Preprocessing❖ Data Mining

◆ Data Mining Tasks◆ Data Mining Techniques◆ Data Mining & Data Warehousing, OLAP

Page 3: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.3

Course outline

◆ Objectives◆ Assessment◆ Lectures, the lecturer and consultation◆ Recommended reading◆ Unit web site

Page 4: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.4

Objectives

◆ To develop knowledge of techniques and methods for data mining in large databases, including both those currently being used and those which are presently being researched

◆ At the end of the unit the student should be able to

❖ describe the algorithms underlying the most common state-of-the-art data mining tools

❖ make an informed choice of data mining tool for a given problem.

Page 5: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.5

Assessment◆ Students will form groups of three or four. Each group

will prepare a research paper on a particular data mining technique and its applications.

◆ At the end of the semester, all the group papers will be bound together to form a book, where each paper will be a chapter. The production of this book is the aim of the whole class for the semester.

❖ Group Research paper 70%❖ Presentation of the paper 20%❖ Individual Literature survey 5%❖ Attendance at student paper presentations 5%

◆ See the Unit Outline handout for further details

Page 6: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.6

Lectures

◆ The lectures will be held in lecture room S2.32 from 4:00 p.m. to 6:00 p.m. on Mondays.

◆ Notes for each week will be made available on the subject web page in PowerPoint and Postscript formats

❖ It is your responsibility to ensure that you have copies of all notes, including the assignments

Page 7: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.7

Lecturer and Tutorials

◆ Lecturer:

David SquireRoom 5.23A Building B - Caulfield campusEmail: [email protected] Phone: 9903 1033

◆ Tutorials Times:❖ Monday 6:00pm to 8:00pm - B3.44, B3.46❖ Monday 8:00pm to 10:00pm - B3.48

(note: no tutorials in week 1)

Page 8: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.8

Recommended Reading (1)◆ There is no prescribed text. Many books have

relevant chapters for the unit:❖ Berry J.A. & Linoff G.; Data Mining Techniques: For

Marketing, Sales, and Customer Support ; John Wiley & Sons, Inc.; 1997

❖ Cabena P., Hadjinian P., Stadler R., Verhees J., Zanasi A.; Discovering Data Mining: From Concept to Implementation; Prentice Hall PTR, 1998

❖ Fayyad U., Piatetsky-Shapiro G., Smyth P., and Uhurusamy R. (eds); Advances in Knowledge Discovery and Data Mining; AAAI Press, 1996

❖ Kennedy R.L., Lee Y., Van Roy B., Reed C.D., Lippman R.P.; Solving Data Mining Problems Through Pattern Recognition; Prentice Hall PTR, 1997

❖ Witten I. H. and Frank, E.; Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations; Morgan Kaufmann, 1999

Page 9: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.9

Recommended Reading (2)

◆ You will also have to read extensively in journals and conference proceedings to prepare your research papers. Many links to these resources are provided at the unit web site:http://www.csse.monash.edu.au/courseware/cse5230/

◆ Information on the site will include:❖ Lectures (in Powerpoint and Postscript formats)❖ Links relevant to the subject❖ Other relevant documents and information

◆ You should check the unit web site each week

Page 10: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.10

What is Data Mining?

Group Exercise

◆ Break into groups of 4 or 5 (i.e. your neighbours, don’t move around the room)

◆ Take 5 minutes to write down a definition of data mining - this can be in point form

◆ After 5 minutes, we will collect definitions from the class

Page 11: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.11

Definitions of Data Mining (1)◆ Many Definitions

❖ “Data mining is an interdisciplinary field bringing togther techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large data bases”

Evangelos Simoudis in Cabena et al.❖ “Data mining is the extraction of implicit, previously

unknown, and potentially useful information from data”Witten & Frank

❖ “Data mining… is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules”

Berry & Linoff❖ “Data mining is a term usually applied to techniques that

can be used to find underlying structure and relationships in large amounts of data”

Kennedy et al.

Page 12: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.12

Definitions of Data Mining (2)◆ Use of analytical tools to discover knowledge

in a collection of data❖ The knowledge takes the form of patterns, relationships

and facts which would not otherwise be immediately apparent

◆ These analytical tools may be drawn from a number of disciplines, which include:

❖ machine learning❖ pattern recognition❖ statistics❖ artificial intelligence❖ human-computer interaction❖ information visualization❖ and many more...

Page 13: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.13

Data Mining

◆ Why has the area appeared?❖ Large volumes of data stored by organizations in a

competitive environment combined with advances in technologies which can be applied to the data

◆ Background and evolution❖ The failure of traditional approaches

◆ The need for Data Mining❖ Niche marketing, customer retention, the internet

◆ The means to implement Data Mining❖ The data warehouse, the available computing power,

effective modeling approaches

Page 14: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.14

A Case Study - Data Preparation(Cabena et al. page 106)

◆ Health Insurance Commission Australia❖ 550Gb online; 1300Gb in 5 year history DB❖ Aim to prevent fraud and inappropriate practice❖ Considered 6.8 million visits requesting up to 20

pathology tests and 17,000 doctors❖ Descriptive variables were added to the GP records❖ Records were pivoted to create separate records for each

pathology test❖ Records were then aggregated by provider number (GP)❖ An association discovery operation was carried out

Page 15: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.15

An Association Rule

◆ The Rule❖ When a customer buys a shirt, in 70% of cases, he or she

will also buy a tie❖ The Confidence Factor is 70%

◆ The Support Factor❖ This occurs in 13.5% of all purchases❖ The Support Factor is 13.5%

Page 16: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.16

Case Study - Modeling and Analysis (1)

◆ Rules with a confidence factor greater than 50% were considered

◆ The software Intelligent Miner (IBM) was used ◆ The level of support was gradually reduced

❖ i.e. the number of records to which the rule applied was reduced

◆ Rules considered to be noise were excluded.◆ Domain knowledge indicated that some tests

should be excluded and more useful rules were revealed

Page 17: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.17

Case Study - Modeling and Analysis (2)◆ GP profiling was carried out◆ The new segments were related back to

existing classifications of GPs◆ Some rules corresponded to expensive tests

that could be substituted

Page 18: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.18

Episodes Database GP Database

Rules 1% supportIf test A then test B will occur in 62%

of cases

Segment 1 Segment 2 97 GPs 206 GPsScore = 1.8 Score = 2.7

Data Preparation Merge

Association Discovery Database Segmentation

Page 19: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.19

Data Mining for Business Decision Support

(From Berry & Linoff 1997)

◆ Identify the business problem◆ Use data mining techniques to transform the

data into actionable information◆ Act on information◆ Measure the results

Page 20: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.20

The Process of Knowledge Discovery (1)

◆ Pre-processing❖ data selection❖ cleaning❖ coding

◆ Data Mining❖ select a model❖ apply the model

◆ Analysis of results and assimilation❖ Take action and measure the results

Page 21: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.21

The Process of Knowledge Discovery (2)

Data Cleaning & Enrichment

Coding Data mining Reportingselection

-domain consistency- clustering

- segmentation-de-duplication - prediction-disambiguation

Requirement Action

Feedback

Operational data External data

The Knowledge Discovery in Databases (KDD) process (Adriaans/Zantinge)

Information

Page 22: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.22

Data Selection

◆ Identify the relevant data, both internal and external to the organization

◆ Select the subset of the data appropriate for the particular data mining application

◆ Store the data in a database separate from the operational systems

Page 23: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.23

Data Preprocessing (1)

◆ Cleaning❖ Domain consistency: replace certain values with null❖ De-duplication: customers are often added to the DB on

each purchase transaction❖ Disambiguation: highlighting ambiguities for a decision

by the user» e.g. if names differed slightly but addresses were the

same

Page 24: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.24

Data Preprocessing (2)

◆ Enrichment❖ Additional fields are added to records from external

sources which may be vital in establishing relationships.

◆ Coding❖ e.g. take addresses and replace them with regional codes❖ e.g. transform birth dates into age ranges

◆ It is often necessary to convert continuous data into range data for categorization purposes.

Page 25: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.25

Data Mining

◆ Preliminary Analysis❖ Much interesting information can be found by querying

the data set❖ May be supported by a visualization of the data set.

◆ Choose a one or more modeling approaches◆ There are two styles of data mining

❖ Hypothesis testing❖ Knowledge discovery

◆ The styles and approaches are not mutually exclusive

Page 26: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.26

Data Mining Tasks◆ Various taxonomies exist. Berry & Linoff define 6

tasks:❖ Classification❖ Estimation❖ Prediction❖ Affinity Grouping❖ Clustering❖ Description

◆ The tasks are also referred to as operations. Cabena et al. define 4 operations:

❖ Predictive Modeling❖ Database Segmentation❖ Link Analysis❖ Deviation Detection

Page 27: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.27

Classification

◆ Classification involves considering the features of some object then assigning it it to some pre-defined class, for example:

❖ Spotting fraudulent insurance claims❖ Which phone numbers are fax numbers❖ Which customers are high-value

Page 28: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.28

Estimation

◆ Estimation deals with numerically valued outcomes rather than discrete categories as occurs in classification.

❖ Estimating the number of children in a family❖ Estimating family income

Page 29: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.29

Prediction

◆ Essentially the same as classification and estimation but involves future behaviour

◆ Historical data is used to build a model explaining behaviour (outputs) for known inputs

◆ The model developed is then applied to current inputs to predict future outputs

❖ Predict which customers will respond to a promotion❖ Classifying loan applications

Page 30: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.30

Affinity Grouping

◆ Affinity grouping is also referred to as Market Basket Analysis

◆ A common example is which items are bought together at the supermarket. Once this is known, decisions can be made on, for example:

❖ how to arrange items on the shelves❖ which items should be promoted together

Page 31: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.31

Clustering

◆ Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields)

◆ In clustering there are no pre-defined classes. Self-similarity is used to group records. The user must attach meaning to the clusters formed

◆ Clustering often precedes some other data mining task, for example:

❖ once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster

Page 32: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.32

Description

◆ A good description of data can provide understanding of behaviour

◆ The description of the behaviour can suggest an explanation for it as well

◆ Statistical measures can be useful in describing data, as can techniques that generate rules

Page 33: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.33

Deviation Detection

◆ Records whose attributes deviate from the norm by significant amounts are also called outliers

◆ Application areas include:❖ fraud detection❖ quality control❖ tracing defects.

◆ Visualization techniques and statistical techniques are useful in finding outliers

◆ A cluster which contains only a few records may in fact represent outliers

Page 34: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.34

Data Mining Techniques◆ Query tools◆ Decision Trees◆ Memory-Based Reasoning◆ Artificial Neural Networks◆ Genetic Algorithms◆ Association and sequence detection◆ Statistical Techniques◆ Visualization◆ Others (Logistic regression,Generalized

Additive Models (GAM), Multivariate Adaptive Regression Splines (MARS), K Means Clustering, ...)

Page 35: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.35

Data Mining and the Data Warehouse

◆ Organizations realized that they had large amounts of data stored (especially of transactions) but it was not easily accessible

◆ The data warehouse provides a convenient data source for data mining. Some data cleaning has usually occurred. It exists independently of the operational systems

❖ Data is retrieved rather than updated❖ Indexed for efficient retrieval❖ Data will often cover 5 to 10 years

◆ A data warehouse is not a pre-requisite for data mining

Page 36: CSE5230 - Data Mining, 2002 Lecture 1.1

CSE5230 - Data Mining, 2002 Lecture 1.36

Data Mining and OLAP

◆ Online Analytic Processing (OLAP)◆ Tools that allow a powerful and efficient

representation of the data◆ Makes use of a representation known as a

cube◆ A cube can be sliced and diced◆ OLAP provide reporting with aggregation and

summary information but does not reveal patterns, which is the purpose of data mining


Recommended