+ All Categories
Home > Documents > Ch2_DMIntro_1

Ch2_DMIntro_1

Date post: 10-Apr-2018
Category:
Upload: mamatadalei
View: 220 times
Download: 0 times
Share this document with a friend

of 28

Transcript
  • 8/8/2019 Ch2_DMIntro_1

    1/28

    1

    Chapter 2. Introduction to Data Mining

    Prof. Keith [email protected]

  • 8/8/2019 Ch2_DMIntro_1

    2/28

    2

    The Course Book

    Data Mining: A Tutorial Based Primer

    by Richard J.Roiger, Michael Geatz.

    Amazon.com

    Paperback: 408 pages ; Dimensions (in inches): 0.67 x

    9.14 x 7.44

    Publisher: Addison-Wesley Publishing; ; Book and

    CD-ROM edition (September 26, 2002)

    ISBN: 0201741288

    List Price: $40.00

    Availability: Usually ships within 2 to 3days

  • 8/8/2019 Ch2_DMIntro_1

    3/28

    3

    1.1 Data Mining: A DefinitionThe process of employing one or more

    computer learning techniques to

    automatically analyze and extractknowledge from data.

    Induction-based learningis the process

    of forming generally applicable models(or concept definitions) by observing

    specific examples.

  • 8/8/2019 Ch2_DMIntro_1

    4/28

    4

    Concepts

    Definition: A concept is a set of objects, symbols or events grouped

    together because they share certain characteristics.

    Concept set, class, group, cluster, roughly

    Classical View: Concept

    Set with well defined deterministic inclusionrules. E.g. A home owner is a good credit risk.

    Probabilistic View: A set with probabilistic includion rules.

    E.g. A home owner has an 80% chance of being a good credit risk.

    Exemplar View: this states that a given instance is determined to be an

    example of a particulalr concept if the instance is similar enough to a set ofone or more known examples of the concept.

    Eg. Mr. Smith owns his own home and is a good credit risk.

  • 8/8/2019 Ch2_DMIntro_1

    5/28

    5

    An Investment Dataset

    Table 1.3Acme Investors Incorporated

    Customer Account Margin Transaction Trades/ Favorite Annual

    ID Type Account Method Month Sex Age Recreation Income

    1005 Joint No Online 12.5 F 3039 Tennis 4059K

    1013 Custodial No Broker 0.5 F 5059 Skiing 8099K

    1245 Joint No Online 3.6 M 2029 Golf 2039K

    2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K

    1001 Individual Yes Online 5.0 M 4049 Golf 6079K

    Theflatfile of data is in attribute-valueformat.

    Each row/recordis also called a case orinstance.

    Each column gives values for an attribute (orvariable) for each of the cases.

    Attributes are discrete/categorical/factorial, having a fixed number of possible

    values,(e.g. sex, and age) orreal, having a continuous range of possible values (e.g.

    average Trades/month).

  • 8/8/2019 Ch2_DMIntro_1

    6/28

    6

    Possible Business Questions

    Table 1.3Acme Investors Incorporated

    Customer Account Margin Transaction Trades/ Favorite Annual

    ID Type Account Method Month Sex Age Recreation Income

    1005 Joint No Online 12.5 F 3039 Tennis 4059K

    1013 Custodial No Broker 0.5 F 5059 Skiing 8099K

    1245 Joint No Online 3.6 M 2029 Golf 2039K

    2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K

    1001 Individual Yes Online 5.0 M 4049 Golf 6079K

    Can I develop a general characterisation/profile of different

    investor types? (CLASSIFICATION)

    What characteristics distinguish between Online and Brokerinvestors? (DISCRIMINATION)

    Can I develop a model which will predict the average

    trades/month for a new investor? (PREDICTION)

  • 8/8/2019 Ch2_DMIntro_1

    7/28

    7

    Supervised LeaningIn last two questions, we distinguish ONE of the attributes that we would like

    to be able to determine from the values of the others.

    What characteristics distinguish between Online and Broker investors?

    (DISCRIMINATION). (Transaction method (categorical)) is the target

    variable .

    Can I develop a model which will predict the average trades/month for a

    new investor? (PREDICTION). (Trades/month (real)) is the targetvariable.

    The Target variable is called the Output variable.

    The other variables are called Input variables.

    Clearly, which attributes are the output and input variables depends on your

    question.

    For these questions, and output variables, we KNOW the values of the output

    variables for the cases in thte dataset.

    In such cases we say that we do SUPERVISED learning since the learning

    is controlled by the known values of the output variable in the dataset.

  • 8/8/2019 Ch2_DMIntro_1

    8/28

    8

    Unsupervised LearningFor the question:

    Can I develop a general characterisation/profile of different investor types?(CLASSIFICATION),

    NO particular attribute is singled out as an OUTPUT variable.

    The question is open-ended.

    We do not know if there are any different investor types at all.

    If there are different investor types, we do not know how many typesthere are.

    If there are different investor types then we do not know what the variousinvestor type (or classes, or concepts) mean. We have to determine the

    meaning of the concepts, and appropriate names, after we havedetermined that they exist.

    The method of induction based learning used is said to beUNSUPERVISED in such a situation, because the there are no knownoutput classes to control the learning process.

  • 8/8/2019 Ch2_DMIntro_1

    9/28

    9

    Another Example DatasetTable 1.1 Hypothetical Training Data for Disease Diagnosis

    Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

    1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold

    6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold

    In this example dataset there are categorical attributes

    corresponing to Symptoms, and a categorical attribute of

    Diagnosis.

    The natural question is to predict the Diagnosis (class) [the

    Output variable] from the symptoms, [the input variables].

    This requires supervised classification learning.

  • 8/8/2019 Ch2_DMIntro_1

    10/28

    10

    The Two Concept Learning Paradigms

    Supervised Learning

    builds a learner model, or concept

    definitions, using data instances of known

    origin.

    and uses the model to determine the

    outcome new instances of unknown origin.

    Unsupervised LearningA data mining method that builds models

    from data without predefined classes.

    Usually for classification/clustering.

  • 8/8/2019 Ch2_DMIntro_1

    11/28

    11

    Supervised Learning:

    A Decision Tree Example

    ADecision Tree is a tree structure where non-terminal

    nodes represent tests/decisions on one or more attributes

    and terminal nodes reflect decision outcomes.

    Let us consider the Symptoms/Diagnosis dataset for a

    supervised classification.

  • 8/8/2019 Ch2_DMIntro_1

    12/28

    12

    Table 1.1 Hypothetical Training Data for Disease Diagnosis

    Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

    1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold

    10 Yes Yes No Yes Yes Cold

    Consider each of the attributes in turn, to see which would be a good one to

    start our Decision Tree with.

    Is there a perfect 1-1 relationship between any of the input variables and the

    ourput variable:

    Sore Throat, Fever dont seem very good. However,

    {Swollen Glands = Yes} corresponds 1-1 with {Diagnosis = Strep throat}

    i.e. If {Swollen Glands = Yes} then {Diagnosis = Strep throat}

    Hence we use Swollen Glands for our first Dicision Node.

    Etc we get

  • 8/8/2019 Ch2_DMIntro_1

    13/28

    13

    Swollen

    Glands

    Fever

    No

    Yes

    Diagnosis = Allergy Diagnosis = Cold

    No

    Yes

    Diagnosis = Strep Throat

    First

    Test/Decision

    Node

    Terminal

    Decision Node

  • 8/8/2019 Ch2_DMIntro_1

    14/28

    14

    Notes on this Decision Tree:

    The tree is upside down.

    The Decision Tree fits the data perfectly.

    There are no errors. Accuracy = 100%.

    The Decision Tree discards the unneccessary attributes

    A computer algorithm to construct Decision Trees would

    be farly easy to programme, and would do the job muchquicker than we humans can.

  • 8/8/2019 Ch2_DMIntro_1

    15/28

    15

    Use of the Decision Tree for Prediction

    We may now use the Decision Tree for futurediagnoses, (or prediction of diagnosis). Consider

    the following symptomatic data:

    Table 1.2 Data I sta ces it a k o Classificatio

    Pati t r llI T r at r lands Congestion Headache Diagnosis

    11 No No Y s Y s Y s ?12 Y s Y s No No Y s ?13 No No No No Y s ?

    What are the predicted diagnoses?

    Are these likely to be 100% accurate?

  • 8/8/2019 Ch2_DMIntro_1

    16/28

    16

    Production Rules

    We may summarize the Decision Tree by listing

    the decisions along each path from the starting

    node to each terminal node.

    1. IF Swollen Glands = Yes

    THENDiagnosis = Strep Throat

    2. IF Swollen Glands = No&Fever = Yes

    THENDiagnosis = Cold3. IF Swollen Glands = No & Fever = No

    THENDiagnosis = Allergy

  • 8/8/2019 Ch2_DMIntro_1

    17/28

    17

    Unsupervised Clustering

    A data mining method that builds models from data without

    predefined output classes.Table 1.3Acme Investors Incorporated

    Customer Account Margin Transaction Trades/ Favorite Annual

    ID Type Account Method Month Sex Age Recreation Income

    1005 Joint No Online 12.5 F 3039 Tennis 4059K

    1013 Custodial No Broker 0.5 F 5059 Skiing 8099K

    1245 Joint No Online 3.6 M 2029 Golf 2039K

    2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K

    1001 Individual Yes Online 5.0 M 4049 Golf 6079K

    What attribute similarities group customers together?

    What differences in attribute values segment the customers?

    How many significant cluster are there?

  • 8/8/2019 Ch2_DMIntro_1

    18/28

    18

    1.3 Is Data Mining Appropriate for My Problem?

    Data Mining orData Query (using SQL and OLAP)?

    It depends on the type of question you want to answer, and

    the type of knowledge you want to discover.

    Shallow Knowledge: simple summaries (e.g. averages), or aggregates

    (totals) of an attribute over a selected set of cases.

    You need to know the cases to select. SQL can do this.

    Multidimensional Knowledge : Information about the frequent

    occurance of values of different attributes (known as Association

    Analysis). OLAP on the data cube can do this.

    Hidden Knowledge : Knowledge about patterns or relationships that

    cannot guessed at prior to data mining.

    Deep Knowledge : Knowledge about hidden patterns and relationships

    which can only be discovered using prior scientific or meta-knowledge.

    This is the research frontier for Data Mining.

  • 8/8/2019 Ch2_DMIntro_1

    19/28

    19

    Data Mining vs. OLA

    P vs. Data Query

    Use data query if you already almost know what you are

    looking for, and you wish to work with large databases.

    Use OLAP if you wish to discover simple associations in

    large databases.

    Use data mining to find patterns and relationships in data

    that are not obvious.

    Because of the relative slowness of datamining algorithmsthis often means that the database has to be small, or

    sampled. Devising Data Mining algorithms which scale to

    large databases is a current research topic in Data Mining.

  • 8/8/2019 Ch2_DMIntro_1

    20/28

    20

    Data Mining Applications

    Data mining is a young discipline with wide and

    diverse applications

    There is still a nontrivial gap between general principles

    of data mining and domain-specific, effective datamining tools for particular applications

    Some application domains

    Biomedical and DNA data analysis

    Financial data analysis

    Retail industry

    Telecommunication industry

  • 8/8/2019 Ch2_DMIntro_1

    21/28

    21

    Biomedical Data Mining andDNAAnalysis

    DNA sequences: 4 basic building blocks (nucleotides): adenine(A), cytosine (C), guanine (G), and thymine (T).

    Gene: a sequence of hundreds of individual nucleotidesarranged in a particular order

    Humans have around 100,000 genes

    Tremendous number of ways that the nucleotides can beordered and sequenced to form distinct genes

    Semantic integration of heterogeneous, distributed genome

    databases Current: highly distributed, uncontrolled generation and use

    of a wide variety of DNA data

    Data cleaning and data integration methods developed indata mining will help

  • 8/8/2019 Ch2_DMIntro_1

    22/28

    22

    DNAAnalysis: Examples

    Similarity search and comparison among DNA sequences Compare the frequently occurring patterns of each class (e.g., diseased

    and healthy)

    Identify gene sequence patterns that play roles in various diseases

    Association analysis: identification of co-occurring gene

    sequences Most diseases are not triggered by a single gene but by a combination of

    genes acting together

    Association analysis may help determine the kinds of genes that are likelyto co-occur together in target samples

    Path analysis: linking genes to different disease developmentstages

    Different genes may become active at different stages of the disease

    Develop pharmaceutical interventions that target the different stagesseparately

    Visualization tools and genetic data analysis

  • 8/8/2019 Ch2_DMIntro_1

    23/28

    23

    Data Mining for Financial Data Analysis

    Financial data collected in banks and financial institutions are

    often relatively complete, reliable, and of high quality

    Design and construction of data warehouses for

    multidimensional data analysis and data mining

    View the debt and revenue changes by month, by region, by

    sector, and by other factors

    Access statistical information such as max, min, total,average, trend, etc.

    Loan payment prediction/consumer credit policy analysis

    feature selection and attribute relevance ranking

    Loan payment performance

    Consumer credit rating

  • 8/8/2019 Ch2_DMIntro_1

    24/28

    24

    Financial Data Mining

    Classification and clustering of customers fortargeted marketing

    multidimensional segmentation by nearest-neighbor,

    classification, decision trees, etc. to identify customergroups or associate a new customer to an appropriatecustomer group

    Detection of money laundering and other financialcrimes

    integration of from multiple DBs (e.g., bank transactions,federal/state crime history DBs)

    Tools: data visualization, linkage analysis, classification,clustering tools, outlier analysis, and sequential patternanalysis tools (find unusual access sequences)

  • 8/8/2019 Ch2_DMIntro_1

    25/28

    25

    Data Mining for Retail Industry

    Retail industry: huge amounts of data on sales,

    customer shopping history, etc.

    Applications of retail data mining

    Identify customer buying behaviors Discover customer shopping patterns and trends

    Improve the quality of customer service

    Achieve better customer retention and satisfaction

    Enhance goods consumption ratios

    Design more effective goods transportation and

    distribution policies

  • 8/8/2019 Ch2_DMIntro_1

    26/28

    26

    Data Mining in Retail Industry: Examples

    Design and construction of data warehouses based onthe benefits of data mining

    Multidimensional analysis of sales, customers, products,time, and region

    Analysis of the effectiveness of sales campaigns Customer retention: Analysis of customer loyalty

    Use customer loyalty card information to register sequencesof purchases of particular customers

    U

    se sequential pattern mining to investigate changes incustomer consumption or loyalty

    Suggest adjustments on the pricing and variety of goods

    Purchase recommendation and cross-reference ofitems

  • 8/8/2019 Ch2_DMIntro_1

    27/28

    27

    Data Mining for Telecomm. Industry (1)

    A rapidly expanding and highly competitive industryand a great demand for data mining

    Understand the business involved

    Identify telecommunication patterns

    Catch fraudulent activities

    Make better use of resources

    Improve the quality of service

    Multidimensional analysis of telecommunicationdata

    Intrinsically multidimensional: calling-time, duration,

    location of caller, location of callee, type of call, etc.

  • 8/8/2019 Ch2_DMIntro_1

    28/28

    28

    Data Mining for Telecomm. Industry (2)

    Fraudulent pattern analysis and the identification of unusual

    patterns

    Identify potentially fraudulent users and their atypical usage patterns

    Detect attempts to gain fraudulent entry to customer accounts

    Discover unusual patterns which may need special attention

    Multidimensional association and sequential pattern analysis

    Find usage patterns for a set of communication services by customer

    group, by month, etc.

    Promote the sales of specific services Improve the availability of particular services in a region

    Use of visualization tools in telecommunication data analysis