of 25
8/3/2019 Data Mining Presentation(2)
1/25
DATA MINING
Prepared by:
Srinivasan 109CE21
Manas 109CE22
Karan 109CE23
8/3/2019 Data Mining Presentation(2)
2/25
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information repositories
We are drowning in data, but starving for knowledge!
Solution:Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
8/3/2019 Data Mining Presentation(2)
3/25
Lots of data is being collectedand warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Cardtransactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
8/3/2019 Data Mining Presentation(2)
4/25
Data collected and stored atenormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating geneexpression data
scientific simulationsgenerating terabytes of data
Traditional techniques infeasible for raw
data Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
8/3/2019 Data Mining Presentation(2)
5/25
Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
Alternative names and their inside stories: Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, dataarcheology, business intelligence, etc.
8/3/2019 Data Mining Presentation(2)
6/25
Decisions in data mining
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Data mining tasks Descriptive data mining
Predictive data mining
8/3/2019 Data Mining Presentation(2)
7/25
Databases to be mined
Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,statistics, visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis,DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
8/3/2019 Data Mining Presentation(2)
8/25
Prediction Tasks Use some variables to predict unknown or future values of
other variables
Description Tasks
Find human-interpretable patterns that describe the data.
Common data mining tasks Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
8/3/2019 Data Mining Presentation(2)
9/25
In terms of software and the marketing thereofData Mining != Data Analysis
Data Mining implies software uses someintelligence over simple grouping andpartitioning of data to infer new information.
Data Analysis is more in line with standardstatistical software (ie: web stats). Theseusually present information about subsets andrelations within the recorded data set (ie:browser/search engine usage, average visit
time, etc. )
8/3/2019 Data Mining Presentation(2)
10/25
CLUSTERING
8/3/2019 Data Mining Presentation(2)
11/25
Given a set of data points, each having a set ofattributes, and a similarity measure amongthem, find clusters such that Data points in one cluster are more similar to one
another.
Data points in separate clusters are less similar to oneanother.
Similarity Measures: EuclideanDistance if attributes are continuous.
Other Problem-specific Measures.
8/3/2019 Data Mining Presentation(2)
12/25
Market Segmentation:
Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with adistinct marketing mix.
Approach:
x Collect different attributes of customers based on their
geographical and lifestyle related information.
x Find clusters of similar customers.
x Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
8/3/2019 Data Mining Presentation(2)
13/25
Document Clustering: Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them. Approach: To identify frequently occurring terms
in each document. Form a similarity measurebased on the frequencies of different terms. Useit to cluster.
Gain: Information Retrieval can utilize theclusters to relate a new document or search termto clustered documents.
8/3/2019 Data Mining Presentation(2)
14/25
Given a set of records each of which contain somenumber of items from a given collection;
Produce dependency rules which will predict occurrence of anitem based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Newspaper, Milk
4 Beer, Bread, Newspaper, Milk
5 Coke, Newspaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Newspaper, Milk} --> {Beer}
Rules Discovered:
{Milk} --> {Coke}
{Newspaper, Milk} --> {Beer}
8/3/2019 Data Mining Presentation(2)
15/25
Marketing and Sales Promotion: Let the rule discovered be
{Milk, } --> {Coke}
Coke as consequent => Can be used to determine whatshould be done to boost its sales.
Milk in the antecedent => Can be used to see whichproducts would be affected if the store discontinuesselling milk.
Milk in antecedent andCoke in consequent => Can beused to see what products should be sold with milk topromote sale of coke!
8/3/2019 Data Mining Presentation(2)
16/25
Supermarket shelf management.
Goal: To identify items that are bought
together by sufficiently many customers. Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
A classic rule --x If a customer buys Newspaper and milk, then he
is very likely to buy beer:
8/3/2019 Data Mining Presentation(2)
17/25
ADVANTAGES
AND DISADVANTAGES
OF
DATA MINING
8/3/2019 Data Mining Presentation(2)
18/25
Detect significant
deviations from normal
behavior
Applications: Credit Card Fraud Detection
Network Intrusion
Detection
8/3/2019 Data Mining Presentation(2)
19/25
Induction vs Deduction
Deductive reasoning is truth-preserving:
1. All horses are mammals2. All mammals have lungs
3. Therefore, all horses have lungs
Induction reasoning adds information:
1. All horses observed so far have lungs.2. Therefore, all horses have lungs.
8/3/2019 Data Mining Presentation(2)
20/25
Data mining: the core ofknowledge discoveryprocess.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant DataData Selection
Data Preprocessing
Data Mining
Pattern Evaluation
8/3/2019 Data Mining Presentation(2)
21/25
Learning the application domain: relevant prior knowledge and goals of application
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of
effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction,invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,etc.
Use of discovered knowledge
8/3/2019 Data Mining Presentation(2)
22/25
Statistical Analysis: Ill-suited for Nominal and StructuredData Types
Completely data driven - incorporation of domain knowledge notpossible
Interpretation of results is difficult and daunting
Requires expert user guidance
Data Mining: LargeData sets
Efficiency of Algorithms is important
Scalability of Algorithms is important Real World Data
Lots of Missing Values
Pre-existing data - not user generated
Data not static - prone to updates
Efficient methods for data retrieval available for use
8/3/2019 Data Mining Presentation(2)
23/25
AI/Machine LearningCombinatorial/Game Data MiningGood for analyzing winning strategies to games, and
thus developing intelligent AI opponents. (ie: Chess) Business StrategiesMarket Basket AnalysisIdentify customer demographics, preferences, andpurchasing patterns.
RiskAnalysisProduct Defect AnalysisAnalyze product defect rates for given plants andpredict possible complications (read: lawsuits) downthe line.
8/3/2019 Data Mining Presentation(2)
24/25
User Behavior ValidationFraud Detection
In the realm of cell phonesComparing phone activity to callingrecords. Can help detect calls made oncloned phones.
Similarly, with credit cards, comparingpurchases with historical purchases. Candetect activity with stolen cards.
8/3/2019 Data Mining Presentation(2)
25/25
THANK YOU