+ All Categories
Home > Documents > Data Mining Presentation(2)

Data Mining Presentation(2)

Date post: 06-Apr-2018
Category:
Upload: manas1991
View: 219 times
Download: 0 times
Share this document with a friend

of 25

Transcript
  • 8/3/2019 Data Mining Presentation(2)

    1/25

    DATA MINING

    Prepared by:

    Srinivasan 109CE21

    Manas 109CE22

    Karan 109CE23

  • 8/3/2019 Data Mining Presentation(2)

    2/25

    Data explosion problem

    Automated data collection tools and mature database

    technology lead to tremendous amounts of data stored in

    databases, data warehouses and other information repositories

    We are drowning in data, but starving for knowledge!

    Solution:Data warehousing and data mining

    Data warehousing and on-line analytical processing

    Extraction of interesting knowledge (rules, regularities,

    patterns, constraints) from data in large databases

  • 8/3/2019 Data Mining Presentation(2)

    3/25

    Lots of data is being collectedand warehoused

    Web data, e-commerce

    purchases at department/grocery stores

    Bank/Credit Cardtransactions

    Computers have become cheaper and more powerful

    Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

    Customer Relationship Management)

  • 8/3/2019 Data Mining Presentation(2)

    4/25

    Data collected and stored atenormous speeds (GB/hour)

    remote sensors on a satellite

    telescopes scanning the skies

    microarrays generating geneexpression data

    scientific simulationsgenerating terabytes of data

    Traditional techniques infeasible for raw

    data Data mining may help scientists

    in classifying and segmenting data

    in Hypothesis Formation

  • 8/3/2019 Data Mining Presentation(2)

    5/25

    Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously

    unknown and potentially useful) information or patterns

    from data in large databases

    Alternative names and their inside stories: Data mining: a misnomer?

    Knowledge discovery(mining) in databases (KDD),

    knowledge extraction, data/pattern analysis, dataarcheology, business intelligence, etc.

  • 8/3/2019 Data Mining Presentation(2)

    6/25

    Decisions in data mining

    Kinds of databases to be mined

    Kinds of knowledge to be discovered

    Kinds of techniques utilized

    Kinds of applications adapted

    Data mining tasks Descriptive data mining

    Predictive data mining

  • 8/3/2019 Data Mining Presentation(2)

    7/25

    Databases to be mined

    Relational, transactional, object-oriented, object-relational,

    active, spatial, time-series, text, multi-media, heterogeneous,

    legacy, WWW, etc.

    Knowledge to be mined

    Characterization, discrimination, association, classification,

    clustering, trend, deviation and outlier analysis, etc.

    Multiple/integrated functions and mining at multiple levels

    Techniques utilized

    Database-oriented, data warehouse (OLAP), machine learning,statistics, visualization, neural network, etc.

    Applications adapted

    Retail, telecommunication, banking, fraud analysis,DNA mining, stock

    market analysis, Web mining, Weblog analysis, etc.

  • 8/3/2019 Data Mining Presentation(2)

    8/25

    Prediction Tasks Use some variables to predict unknown or future values of

    other variables

    Description Tasks

    Find human-interpretable patterns that describe the data.

    Common data mining tasks Classification [Predictive]

    Clustering [Descriptive]

    Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive]

    Regression [Predictive]

    Deviation Detection [Predictive]

  • 8/3/2019 Data Mining Presentation(2)

    9/25

    In terms of software and the marketing thereofData Mining != Data Analysis

    Data Mining implies software uses someintelligence over simple grouping andpartitioning of data to infer new information.

    Data Analysis is more in line with standardstatistical software (ie: web stats). Theseusually present information about subsets andrelations within the recorded data set (ie:browser/search engine usage, average visit

    time, etc. )

  • 8/3/2019 Data Mining Presentation(2)

    10/25

    CLUSTERING

  • 8/3/2019 Data Mining Presentation(2)

    11/25

    Given a set of data points, each having a set ofattributes, and a similarity measure amongthem, find clusters such that Data points in one cluster are more similar to one

    another.

    Data points in separate clusters are less similar to oneanother.

    Similarity Measures: EuclideanDistance if attributes are continuous.

    Other Problem-specific Measures.

  • 8/3/2019 Data Mining Presentation(2)

    12/25

    Market Segmentation:

    Goal: subdivide a market into distinct subsets of

    customers where any subset may conceivably be

    selected as a market target to be reached with adistinct marketing mix.

    Approach:

    x Collect different attributes of customers based on their

    geographical and lifestyle related information.

    x Find clusters of similar customers.

    x Measure the clustering quality by observing buying

    patterns of customers in same cluster vs. those from

    different clusters.

  • 8/3/2019 Data Mining Presentation(2)

    13/25

    Document Clustering: Goal: To find groups of documents that are

    similar to each other based on the important

    terms appearing in them. Approach: To identify frequently occurring terms

    in each document. Form a similarity measurebased on the frequencies of different terms. Useit to cluster.

    Gain: Information Retrieval can utilize theclusters to relate a new document or search termto clustered documents.

  • 8/3/2019 Data Mining Presentation(2)

    14/25

    Given a set of records each of which contain somenumber of items from a given collection;

    Produce dependency rules which will predict occurrence of anitem based on occurrences of other items.

    TID Items

    1 Bread, Coke, Milk

    2 Beer, Bread

    3 Beer, Coke, Newspaper, Milk

    4 Beer, Bread, Newspaper, Milk

    5 Coke, Newspaper, Milk

    Rules Discovered:

    {Milk} --> {Coke}

    {Newspaper, Milk} --> {Beer}

    Rules Discovered:

    {Milk} --> {Coke}

    {Newspaper, Milk} --> {Beer}

  • 8/3/2019 Data Mining Presentation(2)

    15/25

    Marketing and Sales Promotion: Let the rule discovered be

    {Milk, } --> {Coke}

    Coke as consequent => Can be used to determine whatshould be done to boost its sales.

    Milk in the antecedent => Can be used to see whichproducts would be affected if the store discontinuesselling milk.

    Milk in antecedent andCoke in consequent => Can beused to see what products should be sold with milk topromote sale of coke!

  • 8/3/2019 Data Mining Presentation(2)

    16/25

    Supermarket shelf management.

    Goal: To identify items that are bought

    together by sufficiently many customers. Approach: Process the point-of-sale data

    collected with barcode scanners to find

    dependencies among items.

    A classic rule --x If a customer buys Newspaper and milk, then he

    is very likely to buy beer:

  • 8/3/2019 Data Mining Presentation(2)

    17/25

    ADVANTAGES

    AND DISADVANTAGES

    OF

    DATA MINING

  • 8/3/2019 Data Mining Presentation(2)

    18/25

    Detect significant

    deviations from normal

    behavior

    Applications: Credit Card Fraud Detection

    Network Intrusion

    Detection

  • 8/3/2019 Data Mining Presentation(2)

    19/25

    Induction vs Deduction

    Deductive reasoning is truth-preserving:

    1. All horses are mammals2. All mammals have lungs

    3. Therefore, all horses have lungs

    Induction reasoning adds information:

    1. All horses observed so far have lungs.2. Therefore, all horses have lungs.

  • 8/3/2019 Data Mining Presentation(2)

    20/25

    Data mining: the core ofknowledge discoveryprocess.

    Data Cleaning

    Data Integration

    Databases

    Data Warehouse

    Task-relevant DataData Selection

    Data Preprocessing

    Data Mining

    Pattern Evaluation

  • 8/3/2019 Data Mining Presentation(2)

    21/25

    Learning the application domain: relevant prior knowledge and goals of application

    Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of

    effort!) Data reduction and transformation:

    Find useful features, dimensionality/variable reduction,invariant representation.

    Choosing functions of data mining summarization, classification, regression, association,

    clustering.

    Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

    visualization, transformation, removing redundant patterns,etc.

    Use of discovered knowledge

  • 8/3/2019 Data Mining Presentation(2)

    22/25

    Statistical Analysis: Ill-suited for Nominal and StructuredData Types

    Completely data driven - incorporation of domain knowledge notpossible

    Interpretation of results is difficult and daunting

    Requires expert user guidance

    Data Mining: LargeData sets

    Efficiency of Algorithms is important

    Scalability of Algorithms is important Real World Data

    Lots of Missing Values

    Pre-existing data - not user generated

    Data not static - prone to updates

    Efficient methods for data retrieval available for use

  • 8/3/2019 Data Mining Presentation(2)

    23/25

    AI/Machine LearningCombinatorial/Game Data MiningGood for analyzing winning strategies to games, and

    thus developing intelligent AI opponents. (ie: Chess) Business StrategiesMarket Basket AnalysisIdentify customer demographics, preferences, andpurchasing patterns.

    RiskAnalysisProduct Defect AnalysisAnalyze product defect rates for given plants andpredict possible complications (read: lawsuits) downthe line.

  • 8/3/2019 Data Mining Presentation(2)

    24/25

    User Behavior ValidationFraud Detection

    In the realm of cell phonesComparing phone activity to callingrecords. Can help detect calls made oncloned phones.

    Similarly, with credit cards, comparingpurchases with historical purchases. Candetect activity with stolen cards.

  • 8/3/2019 Data Mining Presentation(2)

    25/25

    THANK YOU


Recommended