Data Mining Presentation(2)

8/3/2019 Data Mining Presentation(2)

1/25

DATA MINING

Prepared by:

Srinivasan 109CE21

Manas 109CE22

Karan 109CE23


2/25

Data explosion problem

Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in

databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge!

Solution:Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases


3/25

Lots of data is being collectedand warehoused

Web data, e-commerce

purchases at department/grocery stores

Bank/Credit Cardtransactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

Customer Relationship Management)


4/25

Data collected and stored atenormous speeds (GB/hour)

remote sensors on a satellite

telescopes scanning the skies

microarrays generating geneexpression data

scientific simulationsgenerating terabytes of data

Traditional techniques infeasible for raw

data Data mining may help scientists

in classifying and segmenting data

in Hypothesis Formation


5/25

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) information or patterns

from data in large databases

Alternative names and their inside stories: Data mining: a misnomer?

Knowledge discovery(mining) in databases (KDD),

knowledge extraction, data/pattern analysis, dataarcheology, business intelligence, etc.


6/25

Decisions in data mining

Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted

Data mining tasks Descriptive data mining

Predictive data mining


7/25

Databases to be mined

Relational, transactional, object-oriented, object-relational,

active, spatial, time-series, text, multi-media, heterogeneous,

legacy, WWW, etc.

Knowledge to be mined

Characterization, discrimination, association, classification,

clustering, trend, deviation and outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels

Techniques utilized

Database-oriented, data warehouse (OLAP), machine learning,statistics, visualization, neural network, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis,DNA mining, stock

market analysis, Web mining, Weblog analysis, etc.


8/25

Prediction Tasks Use some variables to predict unknown or future values of

other variables

Description Tasks

Find human-interpretable patterns that describe the data.

Common data mining tasks Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]


9/25

In terms of software and the marketing thereofData Mining != Data Analysis

Data Mining implies software uses someintelligence over simple grouping andpartitioning of data to infer new information.

Data Analysis is more in line with standardstatistical software (ie: web stats). Theseusually present information about subsets andrelations within the recorded data set (ie:browser/search engine usage, average visit

time, etc. )


10/25

CLUSTERING


11/25

Given a set of data points, each having a set ofattributes, and a similarity measure amongthem, find clusters such that Data points in one cluster are more similar to one

another.

Data points in separate clusters are less similar to oneanother.

Similarity Measures: EuclideanDistance if attributes are continuous.

Other Problem-specific Measures.


12/25

Market Segmentation:

Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be

selected as a market target to be reached with adistinct marketing mix.

Approach:

x Collect different attributes of customers based on their

geographical and lifestyle related information.

x Find clusters of similar customers.

x Measure the clustering quality by observing buying

patterns of customers in same cluster vs. those from

different clusters.


13/25

Document Clustering: Goal: To find groups of documents that are

similar to each other based on the important

terms appearing in them. Approach: To identify frequently occurring terms

in each document. Form a similarity measurebased on the frequencies of different terms. Useit to cluster.

Gain: Information Retrieval can utilize theclusters to relate a new document or search termto clustered documents.


14/25

Given a set of records each of which contain somenumber of items from a given collection;

Produce dependency rules which will predict occurrence of anitem based on occurrences of other items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Newspaper, Milk

4 Beer, Bread, Newspaper, Milk

5 Coke, Newspaper, Milk

Rules Discovered:

{Milk} --> {Coke}

{Newspaper, Milk} --> {Beer}

Rules Discovered:

{Milk} --> {Coke}

{Newspaper, Milk} --> {Beer}


15/25

Marketing and Sales Promotion: Let the rule discovered be

{Milk, } --> {Coke}

Coke as consequent => Can be used to determine whatshould be done to boost its sales.

Milk in the antecedent => Can be used to see whichproducts would be affected if the store discontinuesselling milk.

Milk in antecedent andCoke in consequent => Can beused to see what products should be sold with milk topromote sale of coke!


16/25

Supermarket shelf management.

Goal: To identify items that are bought

together by sufficiently many customers. Approach: Process the point-of-sale data

collected with barcode scanners to find

dependencies among items.

A classic rule --x If a customer buys Newspaper and milk, then he

is very likely to buy beer:


17/25

ADVANTAGES

AND DISADVANTAGES

OF

DATA MINING


18/25

Detect significant

deviations from normal

behavior

Applications: Credit Card Fraud Detection

Network Intrusion

Detection


19/25

Induction vs Deduction

Deductive reasoning is truth-preserving:

1. All horses are mammals2. All mammals have lungs

3. Therefore, all horses have lungs

Induction reasoning adds information:

1. All horses observed so far have lungs.2. Therefore, all horses have lungs.


20/25

Data mining: the core ofknowledge discoveryprocess.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant DataData Selection

Data Preprocessing

Data Mining

Pattern Evaluation


21/25

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of

effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction,invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering.

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns,etc.

Use of discovered knowledge


22/25

Statistical Analysis: Ill-suited for Nominal and StructuredData Types

Completely data driven - incorporation of domain knowledge notpossible

Interpretation of results is difficult and daunting

Requires expert user guidance

Data Mining: LargeData sets

Efficiency of Algorithms is important

Scalability of Algorithms is important Real World Data

Lots of Missing Values

Pre-existing data - not user generated

Data not static - prone to updates

Efficient methods for data retrieval available for use


23/25

AI/Machine LearningCombinatorial/Game Data MiningGood for analyzing winning strategies to games, and

thus developing intelligent AI opponents. (ie: Chess) Business StrategiesMarket Basket AnalysisIdentify customer demographics, preferences, andpurchasing patterns.

RiskAnalysisProduct Defect AnalysisAnalyze product defect rates for given plants andpredict possible complications (read: lawsuits) downthe line.


24/25

User Behavior ValidationFraud Detection

In the realm of cell phonesComparing phone activity to callingrecords. Can help detect calls made oncloned phones.

Similarly, with credit cards, comparingpurchases with historical purchases. Candetect activity with stolen cards.


25/25

THANK YOU

Date post:	06-Apr-2018
Category:	Documents
Upload:	manas1991
View:	219 times
Download:	0 times

Data Mining Presentation(2)

Documents