+ All Categories
Home > Documents > CRISP-DM methodology for computational modelling projects

CRISP-DM methodology for computational modelling projects

Date post: 27-Jan-2015
Category:
Upload: tommy96
View: 119 times
Download: 3 times
Share this document with a friend
Description:
 
Popular Tags:
29
CRISP-DM: a methodology for applied computational modelling (required for CMD cw2) Based on Intro to Data Mining: CRISP-DM Prof Chris Clifton, Purdue Univ Thanks to Laura Squier, SPSS for some of the material used
Transcript
Page 1: CRISP-DM methodology for computational modelling projects

CRISP-DM: a methodology for applied computational modelling

(required for CMD cw2)

Based on Intro to Data Mining:

CRISP-DM

Prof Chris Clifton, Purdue UnivThanks to Laura Squier, SPSS for some of the material used

Page 2: CRISP-DM methodology for computational modelling projects

CS490D 2

A methodology for projects

• Cross-Industry Standard Process for Data Mining (CRISP-DM): computational modelling of datasets to derive “knowledge”

• European Community funded effort to develop framework for data mining and computational modeling projects

• Goals:– Encourage interoperable tools across entire data mining

process

– Take the mystery/high-priced expertise out of simple data mining and computational modelling tasks

Page 3: CRISP-DM methodology for computational modelling projects

CS490D 3

Why Should There be a Standard Process?

• Framework for recording experience– Allows projects to be

replicated

• Aid to project planning and management

• “Comfort factor” for new adopters– Demonstrates maturity of

computational modelling and data mining

– Reduces dependency on “stars”

The process must be reliable The process must be reliable and repeatable by people with and repeatable by people with little data mining background.little data mining background.

Page 4: CRISP-DM methodology for computational modelling projects

CS490D 4

Process Standardization

• CRoss Industry Standard Process for Data Mining• Initiative launched Sept.1996• http://www.crisp-dm.org/ • SPSS/ISL, NCR, Daimler-Benz, OHRA• Funding from European commission• Over 200 members of the CRISP-DM SIG worldwide

– DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, ..

– System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, …

– End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...

Page 5: CRISP-DM methodology for computational modelling projects

CS490D 5

CRISP-DM

• Non-proprietary• Application/Industry

neutral• Tool neutral• Focus on business issues

– modelling is core, but the methodology includes steps before and after involving client business

• Framework for guidance• Experience base

– Templates for new applications

Page 6: CRISP-DM methodology for computational modelling projects

CS490D 6

CRISP-DM: Overview

Page 7: CRISP-DM methodology for computational modelling projects

CS490D 7

CRISP-DM: Phases• Business Understanding

– Understanding application domain, project objectives and requirements– Data mining problem definition

• Data Understanding– Initial data collection and familiarization– Identify data quality issues– Initial, obvious results

• Data Preparation– Record and attribute selection– Data cleansing

• Modeling– Run the computational modelling tools, to derive results

• Evaluation– Determine if results meet business objectives– Identify business issues that should have been addressed earlier

• Deployment– Put the resulting models into practice– Set up for repeated/continuous mining of the data

Page 8: CRISP-DM methodology for computational modelling projects

CS490D 8

BusinessUnderstanding

DataUnderstanding

EvaluationDataPreparation

Modeling

Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria

Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits

Determine Data Mining GoalData Mining GoalsData Mining Success Criteria

Produce Project PlanProject PlanInitial Asessment of Tools and Techniques

Collect Initial DataInitial Data Collection Report

Describe DataData Description Report

Explore DataData Exploration Report

Verify Data Quality Data Quality Report

Data SetData Set Description

Select Data Rationale for Inclusion / Exclusion

Clean Data Data Cleaning Report

Construct DataDerived AttributesGenerated Records

Integrate DataMerged Data

Format DataReformatted Data

Select Modeling TechniqueModeling TechniqueModeling Assumptions

Generate Test DesignTest Design

Build ModelParameter SettingsModelsModel Description

Assess ModelModel AssessmentRevised Parameter Settings

Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models

Review ProcessReview of Process

Determine Next StepsList of Possible ActionsDecision

Plan DeploymentDeployment Plan

Plan Monitoring and MaintenanceMonitoring and Maintenance Plan

Produce Final ReportFinal ReportFinal Presentation

Review ProjectExperience Documentation

Deployment

Phases and Tasks

Page 9: CRISP-DM methodology for computational modelling projects

CS490D 9

Example applications

• PAST cw applying CRISP-DM:– Technologies for Knowledge Discovery (MSc

Module, discontinued): analysis of patient data from Duke Hospital referrals

– CMD last year: analysis of Schools League Tables to choose a “good” name for a school

• YOU will also need CRISP-DM for YOUR coursework (not the same)

Page 10: CRISP-DM methodology for computational modelling projects

CS490D 10

Phases in the DM Process (1)

• Business Understanding:– Statement of Business

Objective– Statement of Data

Mining objective– Statement of Success

Criteria

Page 11: CRISP-DM methodology for computational modelling projects

CS490D 11

Phases in TKD’04 cw (1)

• Business Understanding:– Explore patient data,

what can be predicted?

– Explore evidence for/against hypotheses

– Statement of Success Criteria: evidence found?

Page 12: CRISP-DM methodology for computational modelling projects

CS490D 12

Phases in CMD’05 cw (1)

• Business Understanding:– Business Objective:

school name reflecting good performance

– Data Mining objective: find name attributes which predict performance

– Success Criteria: evidence for new school name

Page 13: CRISP-DM methodology for computational modelling projects

CS490D 13

Phases in the DM Process (2)

• Data Understanding– Explore the data and

verify the quality– Find outliers

Page 14: CRISP-DM methodology for computational modelling projects

CS490D 14

Phases in TKD’04 cw (2)

• Data Understanding– Explore the data and

verify the quality: small dataset can be explored with visualization tools

– Find outliers: sparse dataset, many outliers!

Page 15: CRISP-DM methodology for computational modelling projects

CS490D 15

Phases in CMD’05 cw (2)

• Data Understanding– Explore data, decide

metric for good/bad “performance”

– Select and download data for LEAs

– Find outliners, e.g. Special Schools

Page 16: CRISP-DM methodology for computational modelling projects

CS490D 16

Phases in the DM Process (3)

Data preparation:• Takes usually over 90% of the

time– Collection– Assessment– Consolidation and Cleaning

• table links, aggregation level, missing values, etc

– Data selection• active role in ignoring

non-contributory data?• outliers?• Use of samples• visualization tools

– Transformations - create new variables

Page 17: CRISP-DM methodology for computational modelling projects

CS490D 17

Phases in the TKD’04 cw (3)

Data preparation:• Takes usually over 90% of the

time– Not a lot todo: data

supplied in ARFF format!– Transformations - create

new variables – but not in this case.

Page 18: CRISP-DM methodology for computational modelling projects

CS490D 18

Phases in CMD’05 cw (3)Data preparation:• Takes usually over 90% of the

time– Download, save as text-file– Data selection

• ignore non-contributory data e.g. SEN

• Outliers eg Special Schools

• Use of samples?– Transformations - create

new variables: features of name e.g. Grammar, High, town-name, syllables; mapping from existing data to “useful” variables may be non-trivial!

Page 19: CRISP-DM methodology for computational modelling projects

CS490D 19

Phases in the DM Process(4)

• Model(l)ing• Selection of the

modeling techniques is based upon the data mining objective– Modeling can be an

iterative process - different for supervised and unsupervised learning; may model for either description or prediction

Page 20: CRISP-DM methodology for computational modelling projects

CS490D 20

Phases in the TKD’04 cw (4)

• Model building• Try different WEKA

models and options• Easy to “play and see

what happens”• Keep output or

screenshots for models which seem to show evidence

Page 21: CRISP-DM methodology for computational modelling projects

CS490D 21

Phases in CMD’05 cw (4)

• Model building– data mining objective:

find name attributes which predict good performance

– Try various types of model, note any which provide evidence linking an attribute to performance

Page 22: CRISP-DM methodology for computational modelling projects

CS490D 23

Phases in the DM Process(5)

• Model Evaluation– Evaluation of model:

how well it performed on training/test data

– Methods and criteria depend on model type:

• e.g., confusion matrix, mean error rate

– More importantly: are results “useful” for business objective?

Page 23: CRISP-DM methodology for computational modelling projects

CS490D 24

Phases in TKD’04 cw (5)

• Model Evaluation– Evaluation of model:

how well it performed on training/test data

– Methods and criteria depend on model type:

• e.g., confusion matrix, mean error rate

– More importantly: evidence for/against hypotheses about medical predictions

Page 24: CRISP-DM methodology for computational modelling projects

CS490D 25

Phases in CMD’05 cw (5)

• Model Evaluation– Error rates, confusion

matrix with names predicting “good” schools

– Evaluation wrt business objectives: have you found any indicators of a “good” school name?

– Don’t just present the model, explain how it is evidence

Page 25: CRISP-DM methodology for computational modelling projects

CS490D 26

Phases in the DM Process (6)

• Deployment– Report to client– Determine how the

results need to be utilized, who needs to use them?

– How often do they need to be used – should the exercise be repeated?

Page 26: CRISP-DM methodology for computational modelling projects

CS490D 27

Phases in TKD’04 cw (6)

• Deployment– Report to client (me!)– Submit cw, but no

need to think of “future plans”

Page 27: CRISP-DM methodology for computational modelling projects

CS490D 28

Phases in CMD’05 cw (6)

• Deployment– Write a Report, section

for each CRISP-DM Phase, plus Appendices

– Deployment section of report: review of the exercise

• Client should deploy results by:– Utilizing results as

business rules: ?LGS/LGHS merger?

Page 28: CRISP-DM methodology for computational modelling projects

CS490D 29

Why CRISP-DM?

• The data mining process must be reliable and repeatable by people with little data mining skills (e.g. IT consultants, students?...)

• CRISP-DM provides a uniform framework for – guidelines – experience documentation

• CRISP-DM is flexible to account for differences – Different business/agency problems– Different data types (numeric, database, text, ...)

Page 29: CRISP-DM methodology for computational modelling projects

CS490D 32

CRISP-DM: Summary• Business Understanding

– Understanding project objectives and requirements– Data mining problem definition

• Data Understanding– Initial data collection and familiarization– Identify data quality issues– Initial, obvious results

• Data Preparation– Record and attribute selection– Data cleansing

• Modeling– Run the computational modelling tools

• Evaluation– Determine if results meet business objectives– Identify business issues that should have been addressed earlier

• Deployment– Put the resulting models into practice– Set up for repeated/continuous mining of the data


Recommended