Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | teresa-lewis |
View: | 213 times |
Download: | 1 times |
Lecture 10
1Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
BIS4435
Lecture : Data Mining
Dr. Nawaz KhanSchool of Computing ScienceE-mail: [email protected]
Lecture 10
2Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
BIS4229 – Industrial Data Management Technologies
Reading Assignment
Core Text: GC DL materials on the WebCT: Unit 11 Connolly, T. and Begg, C., 2002, Database Systems: A
Practical Approach to Design, Implementation, and Management, Addison Wesley, Harlow, England
Additional Reading: Fundamentals of Database Systems. R. Elmasri and S. B.
Navathe, 4th Edition, 2004, Addison-Wesley, ISBN 0-321-12226-7: Chapter 27
Data Warehousing, Data Mining, and OLAP, Alex Berson and Stephen J. Smith, McGraw-Hill, 1997, ISBN 0-07-006272-2 (Chapters 17, 18)
Other resources on the Internet
Lecture 10
3Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
BIS4229 – Industrial Data Management Technologies
Data MiningOutline
DW & DM: differences The Definition Application areas Comparison with query and Web site analysis tools DM Process Applications, Models and Algorithms Summary Q&A
Lecture 10
4Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDW & DM: differences
Data Warehouse
Data Mart
Access Tools
InformationDelivery System
Data Transformation
Operational DataMetadata
Lecture 10
5Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDW & DM: differences
They have the same purpose - decision support DW assembles, formats, and organises historical data to
answer user query as it is - depends on content of DW DW will not attempt to extract further information or predict
trends and patterns from data DM will extract previously unknown and useful information as
well as predict trends and patterns DM can be performed on DW and/or traditional DB, files
Lecture 10
6Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningThe Definition
DM is the process of extracting previously unknown, valid and actionable information from large sets of data Unknown - look for things that are not intuitive Valid - useful Actionable - translate into business advantageExample:
Rule 1: people don’t buy shares when political situation is not stable
Rule 2: share market is less active when people don’t want to spend
Outcome statement 1 based on rule 1 and 2 is:
Share market is less active when political situation is not stable
Outcome statement 2 based on rule 1 and 2 is:
People don’t want to spend when political situation is not stable
Lecture 10
7Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningApplication areas
Direct Marketing The ability to predict who is most likely to be interested in what products can
save companies immense amounts in marketing expenditures
Trend Analysis Understanding trends in the marketplace is a strategic advantage, because
it is useful in reducing costs and timeliness to market
Security Fraud detection: data mining techniques can help discover which
insurance claims, cellular phone calls, or credit card purchases are likely to be fraudulent
IDS (intrusion detection systems)
Forecasting in Financial Markets Mining Online – WebKDD
Web sites today find themselves competing for customer loyalty. It costs little for customer to switch to competitors
Text Mining - intelligent document analysis
Lecture 10
8Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningComparison with query and Web site analysis tools
Query Tools vs. DM Tools Both allow user to ask questions of DBMS/DW - find out facts Query tool - users make assumption, query based on hypothesis Data mining tool - no assumption when making query (goal)
Example queries:
1. What is the number of white shirt sold in the north vs the south?
2. What are the most significant factors involved in high, medium, and low sales volumes of white shirt?
Data mining tool - discover relationships and hidden patterns that are not obvious
Trend - integrate data mining in query tools
Lecture 10
9Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningComparison with query and Web site analysis tools
OLAP Tools vs. DM Tools OLAP - designed to answer top-down queries OLAP - provides multidimensional data analysis, data can
be broken down and summarised OLAP - query-driven, user-driven, verification-driven
Data mining - bottom-up, requires no assumption Data mining - focus on finding patterns Data mining - data-driven, discovery-driven, identify
facts/conclusions based on patterns discovered For example, OLAP may tell a bookseller about total number of books it
sold in a region during a quarter. Statistics can provide another dimension about these sales. Data mining, on the other hand, can tell you the patterns of these sales, i.e., factors influencing the sales.
Lecture 10
10Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
DM Technologies(see Unit 20 - WebCT)
Data Mining
Database Management and Warehousing
Machine Learning
Statistics
Decision Support
Parallel Processing
Visualisation
Lecture 10
11Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDM Process - Overview
Business objectives data preparation DM results analysis & knowledge assimilation
Mining data is only one step in the overall process Business objectives drive the entire process Data preparation requires the most efforts Iterative process with many loop backs over one or more steps Labour intensive exercise, far from autonomous
Data Sources
Selected data
Pre-processed data
Transformed data
Extracted data
Assimilated knowledge
Lecture 10
12Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDM Process – Data Preparation
Data Selection Data Pre-processing Data Transformation
Data Selection - identify data sources and extract data for preliminary analysis in preparation for further mining
Process of choosing data to analyse decide dependent variable - data (field) to be analysed decide active variable - data actively used in mining
decide useful data dimension choose useful (descriptive) fields in the dimension consider adding other useful dimension
Lecture 10
13Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDM Process – Data Preparation
Data Selection Data Pre-processing Data Transformation
Data Pre-processing - ensure quality of the selected data Data mining is at best as good as the data it is representing Data quality
redundant data incorrect or inconsistent data noisy data - outliers - values that are significantly out of line
bad outlier & good outliers missing values - value not present or deleted
eliminate observations that have missing values - loss info. replace missing values predict value using predictive model
Lecture 10
14Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDM Process – Data Preparation
Data Selection Data Pre-processing Data Transformation
Data transformation – pre-processed data converted to analytical data model.
Data is refined to suite the input format required by DM algorithms
Techniques for data conversion simple calculation (SQL) to derive new data fields data reduction: combine several existing variables into one new
variable to reduce the total number of variable continuous values are scaled/normalised same order of magnitude discretisation: quantitative variables into categorical variables one-of-N: convert a categorical variable to a numeric representation
Lecture 10
15Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDM Process – Data Mining & Results Analysis
DM - apply selected DM algorithm(s) to the pre-processed data Inseparable from results analysis - done by data & business
analyst The two are linked in an interactive process - DM definition Results analysis - depend on application developed
Segmentation - change base variable may improve result Prediction - accuracy and input sensitivity analysis, overtraining Association - iteration required for discovering actionable rules
Lecture 10
16Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningDM Process – Knowledge Assimilation
Close the loop Objective - take action according to the new, valid and
actionable information discovered Challenges -
present discovery in convincing, business-oriented way formulate ways to best exploit discovery
Lecture 10
17Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningApplications, Models and Algorithms
Predictive Modelling –Classification Human learning experience - observations form a model of the
essential, underlying characteristics of some phenomenon - generalisation ability
In DM, predictive model can analyse a DB to determine some essential characteristics about data and make predictions
Market Management
Risk Management
Fraud Management
Typical Applications
Target marketing Customer relationship
management
Market basket analysis Cross selling Market segmentation
Forecasting Customer retention Quality control Competitive analysis
Fraud detection
Models Predictive Modelling (Classification)
Segmentation (Clustering)
Link Analysis
Deviation Detection
Techniques Decision tree Memory-based
learning Neural networks
Geometric Neural networks
Associations discovery (Market Basket Analysis)
Visualisation Statistics
Lecture 10
18Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Predictive Modelling –Classification Supervised learning - correct answer to some already solved
cases must be given to the model before it can make prediction about the new observations
Model developed in 2-phase Training - build a model based on large proportion (90%) of
available data Testing - try out the model on previously unseen data (10%) to
determine its accuracy and performance characteristics
2 types of predictive modelling Classification - classify data into some pre-defined classes Value prediction - predict continuous numeric value for database
record
Algorithms – decision trees, neural networks, rule induction
Lecture 10
19Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Segmentation – Clustering Segmentation can discover homogeneous sub-population -
customer profiling/target marketing Segmentation (Clustering) - partition DB into segments (clusters) of
similar records, and segments (clusters) are resulting groups of data records
Similarity is defined by a measure depends on the distance of records from centre of the cluster - Euclidean distance
A(a1,a2, …, an), B(b1, b2, …, bn)
Dist(A, B) = ((a1-b1)2 + (a2-b2)2 + … + (an-bn)2)1/2
Clustering is unsupervised learning - the types of clusters or number of clusters are not given - true discovery nature of DM
Algorithm – neural networks
Lecture 10
20Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Link Analysis / Deviation Detection Link analysis seeks to establish links between individual records or
sets of records in the DB Association discovery - market basket analysis - one transaction Sequential pattern discovery - sequence information over time
Deviation detection - further investigate outliers Applications - fraud detection
Lecture 10
21Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data MiningApplications, Models and Algorithms
Market Management
Risk Management
Fraud Management
Typical Applications
Target marketing Customer relationship
management
Market basket analysis Cross selling Market segmentation
Forecasting Customer retention Quality control Competitive analysis
Fraud detection
Models Predictive Modelling (Classification)
Segmentation (Clustering)
Link Analysis
Deviation Detection
Techniques Decision tree Memory-based
learning Neural networks
Geometric Neural networks
Associations discovery (Market Basket Analysis)
Visualisation Statistics
Lecture 10
22Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Decision Trees Decision tree (IF - THEN) - as a commonly used machine learning
algorithm are powerful and popular tools for classification and prediction
Attempt to split DB among desired categories and identify important cluster features
Tree construction choose an attribute (field) for testing - root node of tree number of values of the attribute - branches from the root node
– binary - yes/no type of questions– multiple - complex questions with more than two answer
Algorithm - ID3 (Interactive Dichotomizer), C4.5, C5.0, CART (chi-squared automatic integration detection)
rank all features in terms of effectiveness in partitioning the set of classification - information gain
make the most effective features as the root node recur on each branch
Lecture 10
23Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Decision Trees
Optimal tree produced by ID3 root node - “Colour”, most information gain 4 branches - “striped”, “tawny”, “brown” & “grey” recur on branch “striped” & “tawny”
Diet Size Colour Habitat Species
meatmeatmeatmeatgrassgrassgrass
largelargesmallsmalllargesmalllarge
stripedtawnystripedbrownstripedgrey
tawny
junglejunglehousejungleplainsplainsplains
tigerlion
tabbyweaselzebrarabbit
antelope
Lecture 10
24Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Colour
Habitat Diet
stripedtawny brown
grey
weasel rabbit
jungle house plains grass meat
tiger tabby zebra antelope lion
Lecture 10
25Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Applications, Models and Algorithms
Neural Networks An NN is used to simulate the operation of the brain An NN consists of large number of processors (neurons/nodes) and
links (connections) - representing knowledge An NN is trained with large amount of data and rules about data
relationships - memorise A well trained NN can learn association and similarity – generalise
Supervised learning: NN is trained with sets of inputs and desired outputs If the actual output is different from the desired output, the network
adjust its internal connection strengths (weights) to reduce the difference
This process continues until the network gets the I/O patterns correct or until an acceptable error rate is attained
Unsupervised learning - Self-Organising Map (SOM)
Lecture 10
26Dr. Nawaz Khan, School of Computing ScienceE-mail: [email protected]
Data Mining Summary
DW & DM: differences The definition Application areas Comparison with query and Web site analysis tools DM Process
Data preparation (60% of the whole time) DM (~10% of the time)
Applications, Models and Algorithms (decision trees, neural networks, etc.)
Next week: Revision