+ All Categories
Home > Documents > 2014 Cluster Analysis Handout

2014 Cluster Analysis Handout

Date post: 01-Jun-2018
Category:
Upload: amit-singha
View: 224 times
Download: 0 times
Share this document with a friend

of 25

Transcript
  • 8/9/2019 2014 Cluster Analysis Handout

    1/25

    Cluster AnalysisSegmenting the market

    Cluster Analysis

    (classification analysis, numericaltaxonomy):

    a class of techniques used to classify objects orcases into relatively homogeneous groups calledclusters based on the set of variables considered.Objects in each cluster tend to be similar to eachother and dissimilar to objects in the other

    clusters.

    objects: either variables or observations;

    likeness: calculated from the measurements for

    each object.

  • 8/9/2019 2014 Cluster Analysis Handout

    2/25

    Applications:

    1. market segmentation: e.g., benefit

    segmentation: clustering consumers on the

    basis of benefits sought from the purchase of

    a product,

    2. understanding buyer behaviors: e.g.,

    clustering consumers to identify

    homogeneous groups, a firm can examine the

    buying behavior or information seeking

    behavior of each group,

    3. identifying new product opportunities: e.g.,

    clustering brands and products to identify

    competitive sets within the market, a firm can

    examine its current offerings compared to

    those of its competitors to identify potential

    new product opportunities,

    4. selecting test markets: e.g., clustering cities

    into homogeneous clusters, a firm can selectcomparable cities to test various marketingstrategies.

  • 8/9/2019 2014 Cluster Analysis Handout

    3/25

    Distance measures for individual

    observations

    • To measure similarity between two observations adistance measure is needed

    • With a single variable, similarity is straightforward• Example: income – two individuals are similar if their income level

    is similar and the level of dissimilarity increases as the incomegap increases

    • Multiple variables require an aggregate distancemeasure• Many characteristics (e.g. income, age, consumption habits,

    brand loyalty, purchase frequency, family composition, educationlevel, ..), it becomes more difficult to define similarity with a singlevalue

    • The most known measure of distance is the Euclideandistance, which is the concept we use in everyday life forspatial coordinates.

    Model:

    Data: each object is characterized by a set of

    numbers (measurements);

    e.g., object 1: (x 11, x 12 , … , x 1n )

    object 2: (x 21, x 22 , … , x 2n )

    : :

    object  p : (x  p1, x  p2 , … , x  pn )

    Distance: Euclidean distance, d ij ,

    22222

    11   jnin ji jiij   x x x x x xd     

  • 8/9/2019 2014 Cluster Analysis Handout

    4/25

    Example

    Household Household 

    Income Size

     A 50K 5

    B 50K 4

    C 20K 2

    D 20K 1

    22 3324.4  

    $

    (unit: 10K 

    Size

    A

    B

    C

    D

    20K 50K  

    1

    1

    22 3261.3  

    8

    BetweenBetween--Cluster Variation = MaximizeCluster Variation = Maximize

    WithinWithin--Cluster Variation = MinimizeCluster Variation = Minimize

    Three Cluster Diagram ShowingThree Cluster Diagram Showing

    BetweenBetween--Cluster and WithinCluster and Within--Cluster VariationCluster Variation

  • 8/9/2019 2014 Cluster Analysis Handout

    5/25

    HighHigh

    LowLow

    LowLow HighHigh

       F  r  e  q  u  e  n  c  y

      o   f  e  a   t   i  n  g   o

      u   t

       F  r  e  q  u  e  n  c  y

      o   f  e  a   t   i  n  g   o

      u   t

    Frequency of going to fast food restaurantsFrequency of going to fast food restaurants

    Scatter Diagram for Cluster

    Observations

    HighHigh

    LowLow

    LowLow HighHigh

    Frequency of going to fast food restaurantsFrequency of going to fast food restaurants

       F  r  e  q  u  e  n  c  y

      o   f  e  a   t   i  n  g   o

      u   t

       F  r  e  q  u  e  n  c  y

      o   f  e  a   t   i  n  g 

      o  u   t

    Scatter Diagram for Cluster ObservationsScatter Diagram for Cluster Observations

  • 8/9/2019 2014 Cluster Analysis Handout

    6/25

    Comparison of Score Profiles for FactorComparison of Score Profiles for Factor

     Analysis and Hierarchical Cluster Analysis Analysis and Hierarchical Cluster Analysis

     Variables Variables

    RespondentRespondent 11 22 33

     A A 77 66 77

    BB 66 77 66

    CC 44 33 44

    DD 33 44 33

    7

    6

    5

    4

    3

    2

    1

    Respondent ARespondent A

    Respondent BRespondent B

    Respondent CRespondent C

    Respondent DRespondent D

     S  

     c  or  e 

    Clustering procedures

    • Hierarchical procedures

    • Agglomerative (start from n clusters to

    get to 1 cluster)

    • Divisive (start from 1 cluster to get to n

    clusters)

    • Non hierarchical procedures

    • K-means clustering

  • 8/9/2019 2014 Cluster Analysis Handout

    7/25

    Hierarchical clustering

    • Agglomerative:• Each of the n observations constitutes a separate cluster • The two clusters that are more similar according to some distance rule are

    aggregated, so that in step 1 there are n-1 clusters

    • In the second step another cluster is formed (n-2 clusters), by nesting the twoclusters that are more similar, and so on

    • There is a merging in each step until all observations end up in a singlecluster in the final step.

    • Divisive• All observations are initially assumed to belong to a single cluster 

    • The most dissimilar observation(s) is extracted to form a separate cluster • In step 1 there will be 2 clusters, in the second step three clusters and so on,

    until the final step will produce as many clusters as the number ofobservations. This technique is used in medical research and not in thescope of our course.

    • The number of clusters determines the stopping rule for thealgorithms

    Non-hierarchical clustering• These algorithms do not follow a hierarchy and produce a

    single partition

    • Knowledge of the number of clusters (c) is required

    • In the first step, initial cluster centres (the seeds) aredetermined for each of the c clusters, either by theresearcher or by the software.

    • Each iteration allocates observations to each of the cclusters, based on their distance from the cluster centres

    • Cluster centres are computed again and observations may

    be reallocated to the nearest cluster in the next iteration• When no observations can be reallocated or a stopping rule

    is met, the process stops

  • 8/9/2019 2014 Cluster Analysis Handout

    8/25

    Distance between clusters

    • Algorithms vary according to the way thedistance between two clusters is defined.

    • The most common algorithm forhierarchical methods include

    • centroid method

    • single linkage method

    • complete linkage method

    • average linkage method

    • Ward algorithm

    Linkage methods• Single linkage method (nearest neighbour):

    distance between two clusters is the minimumdistance among all possible distances betweenobservations belonging to the two clusters.

    • Complete linkage method (furthest neighbour):nests two cluster using as a basis the maximumdistance between observations belonging toseparate clusters.

    •  Average linkage method: the distance betweentwo clusters is the average of all distancesbetween observations in the two clusters.

  • 8/9/2019 2014 Cluster Analysis Handout

    9/25

    Ward algorithm

    1. The sum of squared distances is computedwithin each of the cluster, considering alldistances between observation within the samecluster 

    2. The algorithm proceeds by choosing theaggregation between two clusters whichgenerates the smallest increase in the total sumof squared distances.

    • It is a computationally intensive method,because at each step all the sum of squared

    distances need to be computed, together with allpotential increases in the total sum of squareddistances for each possible aggregation ofclusters.

    Non-hierarchical clustering:

    K-means method

     The number k of clusters is fixed

    An initial set of k “seeds” (aggregation centres) is

    provided

    First k elements

    Given a certain fixed threshold, all units areassigned to the nearest cluster seed

    New seeds are computed

    Go back to step 3 until no reclassification is

    necessary  Units can be reassigned in successive steps

    (optimising partioning)

  • 8/9/2019 2014 Cluster Analysis Handout

    10/25

    Hierarchical vs. non-hierarchical methods

     Hierarchical Methods Non-hierarchical methods

    No decision about the number

    of clusters Problems when data contain a

    high level of error Can be very slow, preferable

    with small data-sets

    At each step they requirecomputation of the full

    proximity matrix

    Faster, more reliable, works

    with large data sets Need to specify the number of

    clusters Need to set the initial seeds

    Only cluster distances to seeds

    need to be computed in eachiteration

    How many clusters?

    no hard and fast rules,

    a. theoretical, conceptual, or practical

    considerations;

    b. the distances at which clusters are combined

    in a hierarchical clustering;

    c. the relative size of the clusters should be

    meaningful, etc.

  • 8/9/2019 2014 Cluster Analysis Handout

    11/25

    Outlairs

    • It would affect your cluster solution if you

    don’t remove it!

    • It would affect your cluster solution if you

    remove it! (small sample size)

    • Should we standardize clustering

    variables?

    • What is the effect of multi-collinearity in

    cluster analysis?

  • 8/9/2019 2014 Cluster Analysis Handout

    12/25

    Cluster AnalysisCluster Analysis – – Variable Selection Variable Selection

    • Variables are typicallymeasured metrically, but

    technique can be applied to

    non-metric variables with

    caution.

    • Variables are logically relatedto a single underlying conceptor construct.

    Variable Descripti onVariable Descriptio n TypeType

    Work Environm ent MeasuresWork Environm ent Measures

    XX11 I am paid fairly for the work I do.I am paid fairly for the work I do. MetricMetricXX22 I am doing the kind of work I want.I am doing the kind of work I want. MetricMetricXX33 My supervisor gives credit and praise for work well don e.My supervisor gives credit and praise for work w ell done. MetricMetricXX44 There is a lot of coop eration among the members of my work grou pThere is a lot of coo peration among the members of my work grou p.. MetricMetricXX55 My job allows me to learn new skills.My job allows me to learn new skills. MetricMetricXX66 My supervisor recognizes my potential.My supervisor recognizes my potential. MetricMetricXX77 My work gives me a sense of accomplishm ent.My work giv es me a sense of accomplishm ent. MetricMetricXX88 My immediate work group functions as a team.My immediate work group funct ions as a team. MetricMetricXX99 My pay reflects the effort I put into doing m y work.My pay reflects the effort I put into doing m y work. MetricMetricXX1010 My supervisor is friendly and helpful.My supervisor is friendly and helpful. MetricMetricXX1111 The members of my wo rk group h ave the skills and/or trainingThe members of my wo rk group h ave the skills and/or training

    to do their job well.to do their job well. MetricMetric

    XX1212 The benefits I receive are reasonable.The benefits I receive are reasonable. MetricMetric

    Relationship MeasuresRelationship Measures

    XX1313 I have a sense of loyalty t o McDonald's restaurant.I have a sense of loyalty to McDonald's restaurant. MetricMetric

    XX1414 I am willing to put in a great deal of effort beyond thatI am willing to put in a great deal of effort beyond thatexpected to help McDonald's restaurant t o be successful.expected to help McDonald's restaurant to be successful. MetricMetric

    XX1515 I am proud to tell others that I work f or McDonald's restaurant.I am proud to tell others that I work f or McDonald's restaurant. MetricMetric

    Classification VariablesClassification VariablesXX1616 Intention to SearchIntention to Search MetricMetricXX1717 Length of Time an EmployeeLength of Time an Employee NonmetricNonmetricXX1818 Work Type = PartWork Type = Part--Time vs. FullTime vs. Full--TimeTime NonmetricNonmetricXX1919 Gender Gender  NonmetricNonmetricXX2020  Age Age MetricMetricXX2121 PerformancePerformance MetricMetric

  • 8/9/2019 2014 Cluster Analysis Handout

    13/25

    Using SPSS to Identify ClustersUsing SPSS to Identify Clusters

    For this example we are looking for subgroups among all the 63For this example we are looking for subgroups among all the 63

    employees of McDonald's restaurant using theemployees of McDonald's restaurant using the ““organizationalorganizational

    commitmentcommitment”” variables. The SPSS click through sequence is: Analyzevariables. The SPSS click through sequence is: Analyze

    ClassifyClassify Hierarchical Cluster. This will take you to a dialog box wherHierarchical Cluster. This will take you to a dialog box wheree

    you select and move variables Xyou select and move variables X1313, X, X1414 and Xand X1515 into theinto the ““VariablesVariables”” box.box.

    Next you go to the statistics box and agglomeration schedule isNext you go to the statistics box and agglomeration schedule is selected asselected as

    default option. Cluster membershipdefault option. Cluster membership ‘‘nonenone’’ is selected as default. We shallis selected as default. We shall

    continue with default option here. Next click oncontinue with default option here. Next click on ‘‘plotplot’’ box. Check onbox. Check on

    dendogramdendogram and inand in IcicleIcicle window,window, click on none button. Then continue.click on none button. Then continue.

    Next click on the Method box and select WardNext click on the Method box and select Ward’’s under Cluster Method (its under Cluster Method (it

    is the last option). Squared Euclidean Distances is the defaultis the last option). Squared Euclidean Distances is the default underunder

    Measure and we will use it, and we do not need to standardize thMeasure and we will use it, and we do not need to standardize this data.is data.

    We will not select anything on the save option now. Now click onWe will not select anything on the save option now. Now click on ““OKOK”” toto

    run the program.run the program.

  • 8/9/2019 2014 Cluster Analysis Handout

    14/25

  • 8/9/2019 2014 Cluster Analysis Handout

    15/25

    Notice the charm inNotice the charm in

    coefficients in last twocoefficients in last two

    stagesstages

  • 8/9/2019 2014 Cluster Analysis Handout

    16/25

    Identified the numberIdentified the number

    of clusters fromof clusters from

    dendogramdendogram

    Identified the numberIdentified the number

    of clusters fromof clusters from

    dendogramdendogram

    Using SPSS to Identify ClustersUsing SPSS to Identify Clusters

    IIn the next step in SPSS click throughn the next step in SPSS click through

    sequence is: Analyzesequence is: Analyze ClassifyClassify KK--mean cluster.mean cluster.

    This will take you to a dialog box where you selectThis will take you to a dialog box where you select

    and move variables Xand move variables X1313, X, X1414 and Xand X1515 into theinto the

    ““VariablesVariables”” box. In the boxbox. In the box ‘‘number of clustersnumber of clusters’’ putput

    3 in place of 2. Next you go to the save box and3 in place of 2. Next you go to the save box and

    check on cluster membership. Next click on options.check on cluster membership. Next click on options.

    UncheckUncheck initial cluster option and check ANOVAinitial cluster option and check ANOVAtable. Now click ontable. Now click on ““OKOK”” to run the program.to run the program.

  • 8/9/2019 2014 Cluster Analysis Handout

    17/25

    33

    34

  • 8/9/2019 2014 Cluster Analysis Handout

    18/25

    35

    36

    Determine if clusters exist . . .Determine if clusters exist . . . –  –  RunRun ANOVA with cluster IDs and ANOVA with cluster IDs and

    organizational commitment variables.organizational commitment variables.

  • 8/9/2019 2014 Cluster Analysis Handout

    19/25

    37

    Click on Options,Click on Options,

    check Descriptive,check Descriptive,

    next Continue,next Continue,

    and then OKand then OK

    3

    Move the three clusterMove the three cluster

    variables into windowvariables into window

    1

    MoveMove

    cluster IDcluster ID

    variable intovariable intowindowwindow

    2

     ANOVA ANOVA

    38

    Step 1: Determine if clusters exist?Step 1: Determine if clusters exist?

     –  –  2 Cluster ANOVA Results2 Cluster ANOVA Results  –  – Three issues to examine: (1) statisticalThree issues to examine: (1) statistical

    significance, (2) cluster sample sizes, andsignificance, (2) cluster sample sizes, and

    (3) variable means(3) variable means..

    ConclusionConclusion::

    Cluster 1Cluster 1 –  – 

    More CommittedMore Committed

    Cluster 2Cluster 2 –  – 

    Less CommittedLess Committed

  • 8/9/2019 2014 Cluster Analysis Handout

    20/25

    39

    Step 2: Determine if clusters exist?Step 2: Determine if clusters exist?

     –  –  3 Cluster ANOVA3 Cluster ANOVA  –  – 

    Must runMust run ““postpost--hochoc”” teststests

    Take 2 cluster IDTake 2 cluster ID

    variable out andvariable out and

    insert 3 cluster IDinsert 3 cluster ID

    1

    Click on PostClick on PostHoc button andHoc button and

    checkcheck ScheffeScheffe

    2

    40

    ConclusionsConclusions::

    Cluster 1Cluster 1 –  – Least CommittedLeast CommittedCluster 2Cluster 2 –  – Moderately CommittedModerately Committed

    Cluster 3Cluster 3 –  – Most CommittedMost Committed

    •• Individual cluster sample sizes OK.Individual cluster sample sizes OK.

    •• Clusters significantly different, butClusters significantly different, but

    must examine post hoc tests.must examine post hoc tests.

  • 8/9/2019 2014 Cluster Analysis Handout

    21/25

    41

    Step 2: Determine if clusters exist?Step 2: Determine if clusters exist?

     –  –  3 Cluster ANOVA3 Cluster ANOVA  –  – 

    42

    Remove 3 cluster IDRemove 3 cluster ID

    variable and insert 4variable and insert 4

    cluster ID variablecluster ID variable

    1Click OKClick OKto runto run

    2

    Step 3: Determine if clusters exist?Step 3: Determine if clusters exist?

     –  –  4 Cluster ANOVA4 Cluster ANOVA  –  – 

  • 8/9/2019 2014 Cluster Analysis Handout

    22/25

    43

    Determine if clusters exist?Determine if clusters exist?

     –  –  4 Cluster ANOVA4 Cluster ANOVA  –  – 

    Conclusions:Conclusions:

    1.1. Group sample sizes still OK.Group sample sizes still OK.

    2.2. Clusters are significantly different.Clusters are significantly different.

    3.3. Means of four clusters more difficult toMeans of four clusters more difficult to

    interpretinterpret –  – may want to examinemay want to examine ““polarpolar

    extremesextremes””. Most likely approach is combine. Most likely approach is combine

    clusters 1 and 2 and do a three clusterclusters 1 and 2 and do a three cluster

    solution, or remove groups 1 and 2 andsolution, or remove groups 1 and 2 and

    compare extreme groups (3 & 4).compare extreme groups (3 & 4).

    44

    Four Cluster ANOVAFour Cluster ANOVA

     –  – Post Hoc resultsPost Hoc results –  – 

    1.1.  All clusters are All clusters are

    significantly different.significantly different.

    2.2. Largest differencesLargest differences

    consistently betweenconsistently between

    clusters 3 and 4.clusters 3 and 4.

  • 8/9/2019 2014 Cluster Analysis Handout

    23/25

    45

    Error Error 

    CoefficientsCoefficients

    Decide number of clusters . . .Decide number of clusters . . .

    1.1. Examine cluster analysisExamine cluster analysis

     Agglomeration Schedule. Agglomeration Schedule.

    2.2. Consider cluster sample sizes.Consider cluster sample sizes.

    3.3. Consider statistical significance.Consider statistical significance.

    4.4. Evaluate differences in cluster means.Evaluate differences in cluster means.

    5.5. Evaluate interpretation &Evaluate interpretation &

    communication issues.communication issues.

    Error ReductionError Reduction::

    11 –  – 2 Clusters = 58.4%2 Clusters = 58.4%

    22 –  – 3 Clusters = 25.5%3 Clusters = 25.5%

    33 –  – 4 Clusters = 22.8%4 Clusters = 22.8%

    44 –  – 5 Clusters = 22.2%5 Clusters = 22.2%

    ConclusionConclusion:: benefitbenefit

    similar or less after 3similar or less after 3

    clusters.clusters.

    46

    Step 4: Describe cluster characteristics . . .Step 4: Describe cluster characteristics . . .

    1.1. Use ANOVAUse ANOVA

    2.2. Remove clustering variables fromRemove clustering variables from

    ““Dependent ListDependent List”” windowwindow

    3.3. Insert demographic variablesInsert demographic variables

    4.4. ChangeChange ““Factor Factor ”” variable if necessaryvariable if necessary

    InsertInsert

    DemograDemogra

    phicphic

    VariablesVariables

  • 8/9/2019 2014 Cluster Analysis Handout

    24/25

    47

    Step 4: Describe cluster characteristics . . .Step 4: Describe cluster characteristics . . .

    1.1. Go to Variable View.Go to Variable View.

    2.2. Click on None beside variable for numberClick on None beside variable for numberof cluster groups you will examine underof cluster groups you will examine under

    Values column.Values column.

    3.3.  Assign value labels to each cluster. Assign value labels to each cluster.

    4.4. Run ANOVA on demographics.Run ANOVA on demographics.

     Assign value Assign value

    labels forlabels for

    clustersclusters

    •• Describe demographic characteristicsDescribe demographic characteristics

    •• ConclusionsConclusions –  – 3 cluster solution:3 cluster solution:

    •• Clusters are significantly different.Clusters are significantly different.

    •• More committed cluster (must know codingMore committed cluster (must know codingto interpret) . . .to interpret) . . .

     –  – Less likely to search (lower mean)Less likely to search (lower mean)

     –  – Full time employees (code = 0)Full time employees (code = 0) –  – Females (code = 1)Females (code = 1)

     –  – High performers (higher mean)High performers (higher mean)

  • 8/9/2019 2014 Cluster Analysis Handout

    25/25

    •Thank you


Recommended