Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | christian-zurlo |
View: | 29 times |
Download: | 4 times |
Page 1/71
Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure
Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure
Authors: Nemanja Jovanovic, [email protected] Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected]
http://galeb.etf.bg.ac.yu/~vm
Page 2/71
Data Mining in the NutshellData Mining in the NutshellData Mining in the NutshellData Mining in the Nutshell
Uncovering the hidden knowledge
Huge n-p complete search space
Multidimensional interface
Page 3/71
A Problem …A Problem …A Problem …A Problem …
You are a marketing manager for a cellular phone company
Problem: Churn is too high
Bringing back a customer after quitting is both difficult and expensive
Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)
You pay a sales commission of 250$ per contract
Customers receive free phone (cost 125$) with contract
Turnover (after contract expires) is 40%
Page 4/71
… … A SolutionA Solution… … A SolutionA Solution
Three months before a contract expires, predict which customers will leave
If you want to keep a customer that is predicted to churn, offer them a new phone
The ones that are not predicted to churn need no attention
If you don’t want to keep the customer, do nothing
How can you predict future behavior?
Tarot Cards?
Magic Ball?
Data Mining?
Page 6/71
The DefinitionThe DefinitionThe DefinitionThe Definition
Automated
The automated extraction of predictive information from (large) databases
Extraction
Predictive
Databases
Page 8/71
Repetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar Activity
1613 – Galileo Galilei
1859 – Heinrich Schwabe
Page 9/71
The Return of theThe Return of theHalley CometHalley Comet
The Return of theThe Return of theHalley CometHalley Comet
1910 1986 2061 ???
1531
1607
1682
239 BC
Edmund Halley (1656 - 1742)
Page 10/71
Data Mining is NotData Mining is NotData Mining is NotData Mining is Not
Data warehousing
Ad-hoc query/reporting
Online Analytical Processing (OLAP)
Data visualization
Page 11/71
Data Mining isData Mining isData Mining isData Mining is
Automated extraction of predictive informationfrom various data sources
Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines
Page 12/71
Data Mining canData Mining canData Mining canData Mining can
Answer question that were too time consuming to resolve in the past
Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision
Page 13/71
Focus of this PresentationFocus of this PresentationFocus of this PresentationFocus of this Presentation
Data Mining problem types
Data Mining models and algorithms
Efficient Data Mining
Available software
Page 14/71
Data MiningData Mining Problem Types Problem Types
Data MiningData Mining Problem Types Problem Types
Page 15/71
Data Mining Problem TypesData Mining Problem TypesData Mining Problem TypesData Mining Problem Types
6 types
Often a combination solves the problem
Page 16/71
Data Description and Data Description and SummarizationSummarization
Data Description and Data Description and SummarizationSummarization
Aims at concise description of data characteristics
Lower end of scale of problem types
Provides the user an overview of the data structure
Typically a sub goal
Page 17/71
SegmentationSegmentationSegmentationSegmentation
Separates the data into interesting and meaningful subgroups or classes
Manual or (semi)automatic
A problem for itself or just a step in solving a problem
Page 18/71
ClassificationClassificationClassificationClassification
Assumption: existence of objects with characteristics that belong to different classes
Building classification models which assign correct labels in advance
Exists in wide range of various application
Segmentation can provide labels or restrict data sets
Page 19/71
Concept DescriptionConcept DescriptionConcept DescriptionConcept Description
Understandable description of concepts or classes
Close connection to both segmentation and classification
Similarity and differences to classification
Page 20/71
Prediction (Regression)Prediction (Regression)Prediction (Regression)Prediction (Regression)
Similar to classification - difference:discrete becomes continuous
Finds the numerical value of the target attribute for unseen objects
Page 21/71
Dependency AnalysisDependency AnalysisDependency AnalysisDependency Analysis
Finding the model that describes significant dependences between data items or events
Prediction of value of a data item
Special case: associations
Page 23/71
Neural NetworksNeural NetworksNeural NetworksNeural Networks
Characterizes processed data with single numeric value
Efficient modeling of large and complex problems
Based on biological structures Neurons
Network consists of neurons grouped into layers
Page 24/71
Neuron FunctionalityNeuron FunctionalityNeuron FunctionalityNeuron Functionality
I1
I2
I3
In
Output
W1
W2
W3
Wn
f
Output = f (W1*I1, W2*I1, …, Wn*In)Output = f (W1*I1, W2*I1, …, Wn*In)
Page 25/71
Training Neural NetworksTraining Neural NetworksTraining Neural NetworksTraining Neural Networks
Page 26/71
Neural Networks - ConclusionNeural Networks - ConclusionNeural Networks - ConclusionNeural Networks - Conclusion
Once trained, Neural Networks can efficiently estimate value of output variable for given input
Neurons and network topology are essentials
Usually used for prediction or regression problem types
Difficult to understand
Data pre-processing often required
Page 27/71
Decision TreesDecision TreesDecision TreesDecision Trees
A way of representing a series of rules that lead to a class or value
Iterative splitting of data into discrete groups maximizing distance between them at each split
CHAID, CHART, Quest, C5.0
Classification trees and regression trees
Unlimited growth and stopping rules
Univariate splits and multivariate splits
Page 28/71
Decision TreesDecision TreesDecision TreesDecision Trees
Balance>10 Balance<=10
Age<=32 Age>32
Married=NO Married=YES
Page 30/71
Rule InductionRule InductionRule InductionRule Induction
Method of deriving a set of rules to classify cases
Creates independent rules that are unlikely to form a tree
Rules may not cover all possible situations
Rules may sometimes conflict in a prediction
Page 31/71
Rule InductionRule InductionRule InductionRule Induction
If balance>100.000 then confidence=HIGH & weight=1.7
If balance>25.000 andstatus=married
then confidence=HIGH & weight=2.3
If balance<40.000 then confidence=LOW & weight=1.9
Page 32/71
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
Usage of knowledge of previously solved similar problems in solving the new problem
Assigning the class to the group where most of the k-”neighbors” belong
First step – finding the suitable measure for distance between attributes in the data
How far is black from green?
+ Easy handling of non-standard data types
- Huge models
Page 33/71
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
Page 34/71
Data Mining Models Data Mining Models and Algorithmsand Algorithms
Data Mining Models Data Mining Models and Algorithmsand Algorithms
Logistic regression
Discriminant analysis
Generalized Adaptive Models (GAM)
Genetic algorithms
Etc…
Many other available models and algorithms
Many application specific variations of known models
Final implementation usually involves several techniques
Selection of solution that match best results
Page 36/71
Don’t Mess With It!
YES NO
YES
You Shouldn’t Have!
NO
Will it ExplodeIn Your Hands?
NO
Look The Other Way
Anyone ElseKnows? You’re in TROUBLE!
YESYES
NO
Hide ItCan You Blame Someone Else?
NO
NO PROBLEM!
YES
Is It Working?
Did You Mess With It?
Page 37/71
DM Process ModelDM Process ModelDM Process ModelDM Process Model
CRISP–DM – tends to become a standard
5A – used by SPSS Clementine(Assess, Access, Analyze, Act and Automate)
SEMMA – used by SAS Enterprise Miner(Sample, Explore, Modify, Model and Assess)
Page 38/71
CRISP - DMCRISP - DMCRISP - DMCRISP - DM
CRoss-Industry Standard for DM
Conceived in 1996 by three companies:
Page 39/71
CRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodology
Four level breakdown of the CRISP-DM methodology:
Phases
Generic Tasks
Process Instances
Specialized Tasks
Page 40/71
Mapping generic modelsMapping generic modelsto specialized modelsto specialized models
Mapping generic modelsMapping generic modelsto specialized modelsto specialized models
Analyze the specific context
Remove any details not applicable to the context
Add any details specific to the context
Specialize generic context according toconcrete characteristic of the context
Possibly rename generic contents to provide more explicit meanings
Page 41/71
Generalized and Specialized Generalized and Specialized CookingCooking
Generalized and Specialized Generalized and Specialized CookingCooking
Preparing food on your own Find out what you want to eat
Find the recipe for that meal
Gather the ingredients
Prepare the meal
Enjoy your food
Clean up everything (or leave it for later)
Raw stake with vegetables?
Check the Cookbook or call mom
Defrost the meat (if you had it in the fridge)
Buy missing ingredients or borrow the from the neighbors
Cook the vegetables and fry the meat
Enjoy your food or even more
You were cooking so convince someone else to do the dishes
Page 42/71
CRISP – DM modelCRISP – DM modelCRISP – DM modelCRISP – DM model
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Business understanding
Data understanding
Datapreparation
ModelingEvaluation
Deployment
Page 43/71
Customizing a Web PageCustomizing a Web PageCustomizing a Web PageCustomizing a Web Page
User-friendly design
Prediction of the users interests
Reduction of server workload
Reduction of Web traffic
Page 45/71
Business UnderstandingBusiness UnderstandingBusiness UnderstandingBusiness Understanding
Determine business objectives
Assess situation
Determine data mining goals
Produce project plan
Page 46/71
Business Understanding - OutputsBusiness Understanding - OutputsBusiness Understanding - OutputsBusiness Understanding - Outputs
Background
Business objectives and success criteria
Inventory of resources
Requirements, assumptions, and constrains
Risks and contingencies
Terminology
Costs and benefits
Data mining goals and success criteria
Project plan
Initial assessment of tools and techniques
Page 47/71
Customizing a Web Page – Customizing a Web Page – Business Understanding ExampleBusiness Understanding Example
Customizing a Web Page – Customizing a Web Page – Business Understanding ExampleBusiness Understanding Example
Business objectives
Assess situation
Data mining goals
Project plan
Make the users surfing more comfortable
Decrease of overhead for users
Reduction of workload and Web traffic
Make the users surfing more comfortable
Decrease of overhead for users
Reduction of workload and Web traffic
Find the patterns in the user behavior
Page 48/71
Data UnderstandingData UnderstandingData UnderstandingData Understanding
Collect initial data
Describe data
Explore data
Verify data quality
Page 49/71
Data Understanding - OutputsData Understanding - OutputsData Understanding - OutputsData Understanding - Outputs
Data collection report
Data description report
Data exploration report
Data quality report
Background of data
List of data sources
For each data source, method of acquisition
Problems encountered in data acquisition
Detailed description of each data source
List of tables or other database objects
Description of each field including units, codes, etc.
Expected regularities or patterns and methods of detection
Regularities or patterns found (expected and unexpected)
Any other surprises
Conclusions for data transformation, data cleaning and any other pre-processing
Conclusions related to data mining goals or business objectives
Approach taken to assess data quality
Results of data quality assessment
Page 50/71
Customizing a Web Page – Customizing a Web Page – Data Understanding ExampleData Understanding ExampleCustomizing a Web Page – Customizing a Web Page –
Data Understanding ExampleData Understanding Example
Collecting the data
Data description
Results of data exploring
Verification of the quality of the data
Update the server to monitor user behavior
Record the users activities into a storage
Analyze recorded data
Decide which data is usable for mining
Page 51/71
Data PreparationData PreparationData PreparationData Preparation
Select data
Clean data
Construct data
Integrate data
Format data
Page 52/71
Data Preparation - OutputsData Preparation - OutputsData Preparation - OutputsData Preparation - Outputs
Dataset description report
Background including broad goals andplan for pre-processing
Description of pre-processing
Detailed description of resultant datasets
Rational for inclusion/exclusion of attributes
Discoveries made during pre-processing and implications for further work
Dataset
Page 53/71
Customizing a Web Page – Customizing a Web Page – Data Preparation ExampleData Preparation ExampleCustomizing a Web Page – Customizing a Web Page – Data Preparation ExampleData Preparation Example
Decide from what period will the users monitored actions be considered
Make assumptions about unnecessary monitored data and discard them
Classify user actions into categories, group interesting links, etc…
If more information about user is available from other sources, use them
Transform data into suitable forms so several modeling techniques could be applied
Page 54/71
ModelingModelingModelingModeling
Select modeling technique
Generate test design
Build model
Assess model
Page 55/71
Modeling - OutputsModeling - OutputsModeling - OutputsModeling - Outputs
Assessment of DM results with respect to business success criteria
Test design
Model description
Model assessment
Broad description of the type of model and the training data to be used
Explanation of how the model will be tested or assessed
Description of any data required for testing
Description of any planned examination of modelsby domain or data experts
Type of model and relation to data mining goals
Parameter settings used to produce model
Detailed description of the model and any special features
Conclusions regarding patterns in the data
Overview of assessment process including deviations from the test plan
Detailed assessment of the model
Comments on models by domain or data experts
Insights into why a certain modeling technique and certain parameter setting lead to good/bad results
Page 56/71
Customizing a Web Page – Customizing a Web Page – Modeling ExampleModeling Example
Customizing a Web Page – Customizing a Web Page – Modeling ExampleModeling Example
The problem is prediction of behavior
Regression could be a good solution due to distinct nature of the data
Create the software according to the project plan
Observe the behavior of the software
Tune the model after each evaluation phase if needed
Page 57/71
EvaluationEvaluationEvaluationEvaluation
Evaluate results
Review process
Determine next steps
results = models + findings
Page 58/71
Evaluation - OutputsEvaluation - OutputsEvaluation - OutputsEvaluation - Outputs
Assessment of DM results with respect to business success criteria
Review of process
List of possible actions
Review of Business Objectives andBusiness Success Criteria
Comparison between success criterion and DM results
Conclusion about achievability of success criterion and suitability of data mining process
Review of “Project Success”
Are there new business objectives?
Page 59/71
Customizing a Web Page – Customizing a Web Page – Evaluation ExampleEvaluation Example
Customizing a Web Page – Customizing a Web Page – Evaluation ExampleEvaluation Example
Observe the model behavior at work
Collect response from Beta testers
Check user satisfaction
Check server and network engagement
Classify results
Determine which parameter of the model should be changed
Present new ideas and modifications
Step back into previous phases as needed
Page 60/71
DeploymentDeploymentDeploymentDeployment
Plan deployment
Plan monitoring and maintenance
Produce final report
Review project
Page 61/71
Deployment - OutputsDeployment - OutputsDeployment - OutputsDeployment - Outputs
Monitoring and maintenance plan
Final report Overview of deployment results and indicationwhich of results may require updating
Description of how updating will be triggered
Description of how updating will be performed
Summary of Business Understanding(background, objectives and success criteria)
Summary of data mining process
Summary of data mining results
Summary of results evaluation
Summary of deployment and maintenance plan
Cost/benefit analysis
Conclusions for the business
Conclusions for future data mining
Page 62/71
Customizing a Web Page – Customizing a Web Page – Deployment ExampleDeployment Example
Customizing a Web Page – Customizing a Web Page – Deployment ExampleDeployment Example
Make the feature available to all users
Make plan for maintenance and user feedback
Analyze costs and benefits
Summarize the whole documentation
Summarize network and server additional activity
Collect the new ideas
Award according to results
Leave space for upgrade
Page 65/71
Available SoftwareAvailable SoftwareAvailable SoftwareAvailable Software
Discussion of data mining vendors and software is not included into this slide set
Page 70/71
CreditsCreditsCreditsCredits
Anne Stern, SPSS, Inc.
Djuro Gluvajic, ITE, Denmark
Obrad Milivojevic, PC PRO, Yugoslavia
Page 71/71
ReferencesReferencesReferencesReferences
Bruha, I., ‘Data Mining, KDD and Knowledge Integration: Methodology and A case Study”, SSGRR 2000
Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R., “Advances in Knowledge Discovery and Data Mining”, MIT Press, 1996
Glumour, C., Maddigan, D., Pregibon, D., Smyth, P., “Statistical Themes nad Lessons for Data Mining”, Data Mining And Knowledge Discovery 1, 11-28, 1997
Hecht-Nilsen, R., “Neurocomputing”, Addison-Wesley, 1990
Pyle, D., “Data Preparation for Data Mining”, Morgan Kaufman, 1999
galeb.etf.bg.ac.yu/~vm
www.thearling.com
www.crisp-dm.com
www.twocrows.com
www.sas.com/products/miner
www.spss.com/clementine