+ All Categories
Home > Documents > Unilever Data Analysis Project

Unilever Data Analysis Project

Date post: 23-Jan-2017
Category:
Upload: nguyenque
View: 223 times
Download: 0 times
Share this document with a friend
39
For more information, [email protected] or 617-253-7054 please visit our website at http://ebusiness.mit.edu or contact the Center directly at A research and education initiative at the MIT Sloan School of Management Unilever Data Analysis Project Paper 179 Dimitris Bertsimas Adam Mersereau Geetanjali Mittal June 2003
Transcript
Page 1: Unilever Data Analysis Project

For more information,

[email protected] or 617-253-7054 please visit our website at http://ebusiness.mit.edu

or contact the Center directly at

A research and education initiative at the MITSloan School of Management

Unilever Data Analysis Project

Paper 179 Dimitris Bertsimas Adam Mersereau Geetanjali Mittal

June 2003

Page 2: Unilever Data Analysis Project

UUNNIILLEEVVEERR DDAATTAA AANNAALLYYSSIISS

PPRROOJJEECCTT

BY

DIMITRIS BERTSIMAS ADAM MERSEREAU

GEETANJALI MITTAL

Massachusetts Institute of Technology

Page 3: Unilever Data Analysis Project

TABLE OF CONTENTS

1 INTRODUCTION 1

1.1 Summary of Data and Unilever's Previous Data Mining Efforts 1

1.1.1 Panel Data 2

1.1.2 The Most Valuable Consumer and Existing Predictive Models 2

1.1.3 Unilever Database 3

1.2 Project Research Directions 3

1.3 Data Provided 3

2 PREDICTION EFFORTS ON ORIGINAL DATA SET 6

2.1 Data Cleaning 6

2.2 Predictive Methods Investigated 8

2.3 Results 9

2.4 Conclusions 11

2.5 Block Structure in Unilever Data Extract 12

2.6 Analysis of Fall 2000 Survey Data 13

3 NON-SURVEY DATA ANALYSIS 14

3.1 Data Details 15

3.2 Data Processing and Transformation 16

3.2.1 Data Cleaning 17

3.2.2 Data Aggregation 17

3.3 Data Issues 18

3.3.1 Conflicts in Computation of Household Layout Variables 18

3.3.2 Unbalanced Brand Representation 19

3.3.3 Insufficient Information on Source of Information in Non–Survey Data 20

3.4 Predictive Modeling 20

3.4.1 Choice of Models 20

3.4.2 Predictive Efforts using Logistic Regression 22

3.5 GQM Score–Based Stratified Analysis 25

3.5.1 Stratified Prediction Models 25

3.5.2 Predicting GQM Strata for Each Consumer 26

ii Massachusetts Institute of Technology

Page 4: Unilever Data Analysis Project

4 CLUSTERING ANALYSIS 26

4.1 Clustering Background 28

4.2 Cluster Analysis Details 28

4.3 Inference from Cluster Analysis 29

4.3.1 Effects of Unbalanced Brand Representation 29

4.3.2 Details of Cluster 1 30

4.3.3 Details of Cluster 2 30

4.3.4 Details of Cluster 3 31

4.5 Cluster–Wise Predictive Modeling 32

5 PROJECT SUMMARY AND CONCLUSIONS 33

iii Massachusetts Institute of Technology

Page 5: Unilever Data Analysis Project

TABLE OF FIGURES

Figure 1: Graphic Representation of MVC 2

Figure 2: Brands with Top Ten Consumer Responses 6

Figure 3: Consumer Response Distribution for Overall Data 7

Figure 4: Representative Diagram of Unilever Data Structure 7

Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively 10

Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively 10

Figure 7: Coefficients of a Logistic Regression Model for Caress Body Wash 11

Figure 8: Coefficients of a Logistic Regression Model for Classico Pasta Sauce 11

Figure9: Graphical Representation of Unilever Data Structure 12

Figure 10: Lift Curves for Logistic Regression Models Based on Fall 2000 Survey Data 14

Figure 11: Graphical Representation of Non–Survey Data 15

Figure 12: Consumer Response Distribution for Non–Survey Data 16

Figure 13: Category – wise Brand Distribution 16

Figure 14: Household Member Age–Gender–Wise Aggregation 18

Figure 15: Consumer Response Distribution for Some of the Brands 19

Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network 21

Figure 17: Lift Curves for Different Unilever Brands 22

Figure 18: Significant Demographic Parameters 23

Figure 19: Graphical Representation of Logistic Regression Model for Gorton Fillets 23

Figure 20: Logistic Regression Model Coefficients for Gorton Fillets 24

Figure 21: Comparative Lift Curves for Models with and without GQM Scores 24

Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models 25

Figure 23: Cluster Pie Chart 28

Figure 24: Cluster Statistics 28

Figure 25: Importance Value of Significant Variables 29

Figure 26: Input Means for Cluster 1 30

Figure 27: Input Means for Cluster 2 31

Figure 28: Input Means for Cluster 3 31

Figure 29: Comparative Lift Curves for Cluster – based and Overall Models 33

iv Massachusetts Institute of Technology

Page 6: Unilever Data Analysis Project

1 INTRODUCTION

This report describes and concludes the data analysis project undertaken in collaboration with Unilever

through the Sloan School of Management Center for eBusiness. In this document we trace our interactions

with Unilever, describe the data made available to us, describe various analyses and results, and present

overall conclusions learned in the course of the project.

Unilever has been a pioneer in mass-marketing, which focuses on widely broadcast advertising messages.

This marketing approach is at odds with new trends towards targeted marketing and CRM (Consumer

Relationship Management), and Unilever is interested in investigating how the new marketing philosophy

applies to the packaged consumer goods industry in general and to Unilever in particular. As Unilever sells

its products not directly to consumers but through a variety of retail channels, they have indirect contact

with the end consumers. Thus the application of CRM (defined by Unilever as “Consumer Relationship

Management” rather than the more standard “Customer Relationship Management” to make this

distinction) is less obvious in their business. In particular, Unilever recognizes three clear obstacles to the

application of CRM ideas in the packaged goods industry:

• Consumer transaction data is difficult to procure in the packaged goods industry.

• Data mining expertise and experience are new to packaged goods companies.

• The packaged goods industry marketing efforts focus on brands rather than on consumers.

Unilever employs data mining in the area of CRM. This effort is partly undertaken in the Relationship

Marketing Innovation Center (RMIC), which is a group that transcends Unilever’s individual brands.

RMIC Unilever’s project with MIT has been part of these efforts.

Especially in light of the second bullet point above, the MIT team was asked to help evaluate the potential

of data mining technology for Consumer targeting at Unilever’s business. Unilever was to make available

to the MIT team representative samples of the data at Unilever’s disposal, and the MIT team was to analyze

this data and research new data mining methods for making use of this data in a targeted marketing

framework.

1.1 Summary of Data and Unilever’s Previous Data Mining Efforts This section summarizes our understanding of the data Unilever has available for analysis, as well as

Unilever’s previous data mining efforts with this data. A primary data source is a Unilever database of

consumers. The database contains information on individuals and households who have interacted with

Unilever in some fashion in the past. The data includes demographic and geographic information as well

as self-reported usage and survey information.

1 Massachusetts Institute of Technology

Page 7: Unilever Data Analysis Project

1.1.1 Models

Based on information available on a subset of Unilever consumers two models were fit, the so-called

Demographic and Golden Question models, for predicting if a consumer is an MVC (“Most Valuable

Consumer”) as measured by their dollar spend on Unilever’s collection of brands. The concept of MVC

and the two models are described in more detail in the next paragraph.

1.1.2 The Most Valuable Consumer and Existing Predictive Models

Much of Unilever’s data mining efforts prior to August, 2001, were focused on identifying the “most

valuable consumers” (MVCs) on a brand- and company-wide level. The MVC for a specific brand is a

consumer determined to spend highly on the Unilever brand and on the industry category. Specifically,

rank consumers both by their dollar spend on a Unilever brand and by their dollar spend in the

corresponding industry category. Individuals are categorized as a “heavy”, “medium”, or “low” brand or

category consumer. The MVCs for a given brand are generally defined as those consumers found in the

shaded regions of the following table.

H

M

L

L M Hprofitability to Unilever brand

Figure 1: Graphic Representation of MVC

The concept of MVC can also be extended to the level of the overall company. Unilever’s Demographic

and Golden Question models are logistic regression models that estimate the probability that an individual

consumer is an MVC in terms of their expenditure to Unilever as a whole. The demographic model uses

demographic variables exclusively, while the Golden Question model uses both demographic input as well

as a minimal set of survey responses about product usages. These models are used to score and rank

individuals in the Unilever database.

1.1.3 Unilever Database

The Unilever Database is a large data warehouse owned by Unilever but maintained by Axciom, a data

warehousing company. The database includes varying amounts of information on Unilever consumers,

compiled from a number of sources. For each consumer, the database potentially reports on:

• Demographic data at the individual, household, and geographic block levels

• Responses to promotional events

2 Massachusetts Institute of Technology

Page 8: Unilever Data Analysis Project

• Survey responses

• Predictions of Demographic and Golden Question models

The database contains no transactional purchase information, although it does provide self-reported brand

usage data, model predictions, and contact history information. The accuracy of the self-reported brand

usage data varies by brand.

1.2 Project Research Directions

At a project kickoff meeting in Greenwich, CT in August, 2001, we discussed several research directions of

interest to both MIT and Unilever:

• Investigate methods for predicting, characterizing, and clustering individual consumer usage of

products.

• Experiment with alternate definitions of MVC.

• Develop alternative models to predict MVC.

• Develop dynamic logistic regression models—that is, prediction methodologies based on logistic

regression that evolve in time.

• Develop optimization-based logistic regression subset selection methodologies. Specifically, how can

we use such a methodology to design a questionnaire of maximum value and minimum length?

Although this list includes a number of items that may be of interest to Unilever in the future, the majority

of our efforts were focused on the first of these topics. This decision was largely guided by the data made

available to us, which includes no information on consumer profitability and contains limited time stamp

information. In subsequent sections of the report, we will revisit the data limitations and the role of the data

in guiding our analysis.

1.3 Data Provided We received several data files from Unilever on a CD dated November 30, 2001. The files and our

understanding of them are as follows:

• “UNIFORM.TXT”: A large data file sampled from the Unilever Database. The extraction was

performed via uniform random sampling from the most complete and reliable records from the

database.

• “STRATIFIED.TXT”: A large data file sampled from the Unilever Database. This file is similar to

UNIFORM.TXT except in the method of sampling. Stratified sampling was used and stratification

was done with respect to the Golden Question model scores.

• “MIT LAYOUT #1 AXCIOM.XLS”: A list of variable names and layout for the data in

UNIFORM.TXT and STRATIFIED.TXT.

3 Massachusetts Institute of Technology

Page 9: Unilever Data Analysis Project

• “LAYOUT DESCRIPTION #1.XLS”: A list of variable names along with a brief description of

many of the variables.

• “MIT LAYOUT #2 INFOBASE.XLS”: An enumeration of InfoBase data entries, which form some

of the variables in UNIFORM.TXT and STRATIFIED.TXT. Many of the variables described in MIT

LAYOUT #2 INFOBASE.XLS do not appear in the data files.

• “DATA DICTIONARY JUNE 2001.DOC”: An enumeration of InfoBase variables, only some of

which are included in the data files.

• “BRAND ID&NAME.XLS”: A table matching brand ids to brand names.

• “MARKET.TXT”: A table matching Market Codes to county names.

• “VIC’S REPLY.TXT”: Some detail clarifications on LAYOUT DESCRIPTION #1.XLS.

We were subsequently given the following file:

• “MASTER.XLS”: A table matching brand ids to brand names, franchises, and product categories.

We were provided no other description or information regarding the data.

The UNIFORM.TXT data file has a raw size of 114 Mb, and includes 367 variables for 46,307

observations. The STRATIFIED.TXT data file has a raw size of 133 Mb, and includes the same 367

variables for 53,693 observations. Many of the observations in both data files include significant missing

data. To summarize these data sets, we provide a brief discussion of the sets of variables included in

UNIFORM.TXT and STRATIFIED.TXT:

• Individual Layout: This set of variables includes individual and household ID codes, individuals'

names, and basic demographic variables for age and gender. Other demographic variables like

”Marital Status,” ”Employment Status,” “Occupation Type,” and “Ethnic Code” have significantly

large number of missing values.

• Household Layout: This set of variables includes data specific to a household (note that a

household may include multiple individuals), and is a result of the merging of the Unilever

Database with data from third-party sources.

Third-party data offers information on household vehicle ownership, the distribution of genders and ages in

the household, occupations, home ownership, credit card ownership, and membership in a number of

lifestyle clusters (e.g. “traditionalist,” “home and garden”). Most of these fields have at most 20% missing

data.

• Demographic Model: This set of variables gives results of the Demographic logistic regression

model for prediction of MVC. The most important field here is the “Model Score” which gives

values between 0 and 1, the model’s prediction. The field “Model Score Group” denotes the

decile in which the model scores falls. Deciles are defined according to the Demographic model

results.

4 Massachusetts Institute of Technology

Page 10: Unilever Data Analysis Project

• Golden Question Model: This set of variables gives results of the golden question logistic

regression model for prediction of MVC. The most important field here is the “Model Score”

which gives values between 0 and 1, the model’s predictions. The field ``Model Score Group''

groups the observations into deciles according to the Golden Question model results. The

Demographic and Golden Questions models are positively correlated. We have measured a

correlation of 0.6 on the UNIFORM.TXT data.

• Block Group: This set of variables describes the block the household belongs to. A block is an

address-based segment of the population. Thus, there are likely multiple households per block.

These variables essentially provide demographic information about the geographical neighborhood of the

household. Variables describe the urban/rural breakdown, the ethnic breakdown, the distribution of home

valuation, employment breakdown, education level, etc.

• Brand Usage Layout #01-#20: For each individual, there are 20 sets of brand usage variables.

Thus each individual is associated with at most 20 brands. The brands are a mixture of Unilever

and non-Unilever brands.

A total of 259 brands appear in the UNIFORM.TXT dataset, with 96 chosen by at least 100 individuals, 55

chosen by at least 1000 individuals, and 17 chosen by at least 10000 individuals. The average individual

reports interaction with 9.5 distinct brands. In our analysis we concentrated on the file UNIFORM.TXT

instead of the file STRATIFIED.TXT, because it seemed appropriate to use a representative sample of the

underlying data set.

2 PREDICTION EFFORTS ON ORIGINAL DATA SET

Our initial efforts were towards developing methods for predicting usage of individual brands using

demographic variables and cross-purchase information as inputs. We were particularly interested in

methodological innovations for using these different sets of variables to make useful predictions.

At this stage of the project we chose to concentrate on the prediction of usage for individual brands due to

the following reasons:

• Prediction of brand usage is of obvious use in targeted marketing

• A lack of information on consumer profitability and limited time stamp information eliminated

several of the topics mentioned in section 1.2

5 Massachusetts Institute of Technology

Page 11: Unilever Data Analysis Project

• The problem is of general interest to Unilever and other packaged goods companies that have

large amounts of data over a wide range of products.

2.1 Data Cleaning

In efforts to clean the data set for analysis, we first performed a brand aggregation using the Franchise and

Category information provided in the MASTER.XLS file. The reason for this was to eliminate the

distinction between very similar products.

For example, we judged that different flavors and versions of the same product should appear the same

from the perspective of a company-level analysis. The new brand labels in most cases can be mapped one-

to-one to the original notion of brands. After eliminating those brands with reported usage by fewer than

100 individuals, we were left with 76 unique brands for analysis in the UNIFORM.TXT data set. Figure 2

shows the brands with top ten consumer response and Figure 3 shows the distribution of reported usage

among the 76 brands. Note that the patterns of reported usage in this data is not representative of the actual

sales of these products.

# Consumer Response Product

9367 Suave Shampoo / Conditioner 6643 Dove Bar Soap 6146 Ragu Pasta Sauce 4989 Lipton Tea Bags 4938 Dial Bar Soap 4923 Bath & Body Works Bar Soap 4640 Lever 2000 Bar Soap 4554 Good Humor / Breyers Ice Cream 4399 Other Body Wash 4302 Dove Body Wash

Figure 2: Brands with Top Ten Consumer Responses

0

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

6 0 0 0

7 0 0 0

8 0 0 0

9 0 0 0

1 0 0 0 0

1 5 9 1 3 1 7 2 1 2 5 2 9 3 3 3 7 4 1 4 5 4 9 5 3 5 7 6 1 6 5 6 9 7 3

# C

usto

mer

s U

sing

Pro

duct

(138

70 to

t

Figure 3: Consumer Response Distribution for Overall Data

6 Massachusetts Institute of Technology

Page 12: Unilever Data Analysis Project

For each of these 76 products, we developed a binary indicator indicating the reported usage for all the

consumers, with 1 representing a positive usage response and 0 representing no indication of usage.

In addition to the 76 brands, we focused on the following four demographic variables in our prediction

efforts:

• Presence of Child (Yes/No)

• Household Size

• Income Category

• Geographic Region (North, South, Midwest, West)

These variables were chosen because they included relatively few missing values and because they

generally represent important individual and household indicators which act as proxies for underlying

factors that influence purchase behavior.

The resulting cleaned data thus contained four demographic variables and 76 binary reported usage

variables for each of the 46307 consumers. A representation of this data is as follows:

Indi

vidu

al C

hild

Hou

seho

ld

Size

Inco

me

Cat

egor

y

Reg

ion

BR

AN

D

BR

AN

D

BR

AN

D

1 2 … 46307 Demographic variables Usage variables

Figure 4: Representative Diagram of Unilever Data Structure

As is common practice in data mining studies, to deal with the so–called “over–fitting” problem the

UNIFORM.TXT data was randomly partitioned into a training set of 27,731 consumers for fitting of model

parameters, a validation set of 9,291 consumers for choosing among models, and a test set of 9,285

consumers for measuring final results.

2.2 Predictive Methods Investigated

Logistic regression has a long history of use in targeted marketing for linking demographic variables and

purchase behavior, while collaborative filtering is an approach that has developed with the rise of e-

commerce for predicting a consumer’s preferences based on the preferences of similar consumers. Our

7 Massachusetts Institute of Technology

Page 13: Unilever Data Analysis Project

interests in this study were in investigating the relative merits of these two methodologies and of combining

their results in various ways.

Logistic regression establishes a relationship between predictor variables and a response variable via the

logistic function. In particular, we model a consumer’s probability of response p in terms of a set of

predictor variables x1, …, xn as follows:

+

=

iii

iii

x

xp

β

β

exp1

exp

Logistic regression has often been used in marketing contexts, and has the advantages that the model is

interpretable and can be fit using efficient methods. It can be susceptible to overfitting, however, when

there are many predictor variables.

Collaborative filtering models an individual consumer’s response as a weighted average of the responses of

other consumers in the database, where the weighting is typically according to a similarity measure among

consumers. The approach is appropriate in applications where there are sufficiently large numbers of

products to allow computation of a useful similarity measure among consumers. The collaborative filtering

approach has proved useful particularly in internet recommender systems. Examples include the

recommendation engines employed by Amazon and Netflix. Collaborative filtering is not as easily

interpretable as logistic regression, but has the advantages that it is conceptually simple and is adept at

handling many variables representing choices among a large number of possibilities.

In our implementation, we compute similarities among consumers based on reported usage information

only. Since this is binary data, we require a suitable similarity measure. We have made use of the so-

called Jaccard similarity. Given usage vectors of two individual consumers, define a to be the number of

products the two consumers have in common, b to be the number of products unique to consumer 1, and c

to be the number of products unique to consumer 2. Then the Jaccard similarity is given by the ratio

a/(a+b+c). The Jaccard measure thus takes into account products the two consumers have in common, but

ignores items that neither have chosen. As our reported usage data is relatively sparse, there are likely to be

many products chosen by neither. With the Jaccard similarity measure, these products will not inflate the

similarity measure as they would with, say, a correlation measure. We also made use of a weighting

scheme that weighs rare products more heavily than common ones. Such a modification is based on the

observation that selection of a rare product is more informative than selection of a common product. Such

an “inverse frequency” weighing is common in collaborative filtering systems.

8 Massachusetts Institute of Technology

Page 14: Unilever Data Analysis Project

Initial tests of these methods indicated that we might be able to produce more accurate predictions by using

the predictions of a logistic regression model and a collaborative filtering model as inputs to a third model.

We considered three methods for combining these models: a weighted average of the two results, another

logistic regression taking the two results as inputs, and an optimization model that computes a linear

discriminant in the space of the logistic and collaborative model outputs. Upon further testing, we decided

that the logistic regression method for combining models exhibits the best performance.

In the end, we examined several models for predicting individual product usage from the demographic and

other product usage data:

• RAND : Predictions are random numbers between 0 and 1. This is less of a prediction method

than a baseline for comparison.

• LOGIT : A logistic regression model using demographic variables as predictors.

• COLLAB : The collaborative filtering model using reported usage data from other consumers.

• COMB_LOGIT : A logistic regression model using the LOGIT and COLLAB model predictions

as predictor variables.

• FULL_LOGIT : A logistic regression model using the demographic variables as well as the usage

variables for products other than the one being modeled. We implemented two versions of the

FULL_LOGIT model: one using all variables available, and one which uses subset selection

methods to choose an accurate model with only 5 variables.

2.3 Results

We present our results for the various methods in the form of lift curves. To compute a lift curve for a

given product, we use a method described above to assign each consumer in the validation set an estimated

probability of usage. We then rank the consumers in order of their predicted usage, and ask the question “If

we contacted the top M consumers in this list, how many actual users would we find?” If we repeat this

question for every choice of M, we obtain the lift curve. Random predictions should roughly give a straight

diagonal line, while effective prediction algorithms will result in lift curves as high as possible above this

diagonal line. Higher lifts indicate better performance of a predictive model.

9 Massachusetts Institute of Technology

Page 15: Unilever Data Analysis Project

Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively

Figure 5 includes lift curves for the models used to predict usage of the products Caress Body Wash and

Classico Pasta Sauce, noting that these are indicative examples of results obtained for several products.

The first set of lift curves illustrates the results for the COLLAB, LOGIT, and COMB_LOGIT methods.

We observe that the while the LOGIT model using the four demographic variables does better than random

prediction, the COLLAB model using usage information of other products does significantly better.

Combining the two models does only slightly better than the collaborative filtering approach.

The following set of lift curves adds the FULL_LOGIT method:

Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively

The FULL_LOGIT model, based on all the variables, performs even better than the COMB_LOGIT model.

This logistic regression model using only a few selected variables exhibits surprisingly good performance.

This observation motivated a closer look at the specific coefficients of the variables in these parsimonious

10 Massachusetts Institute of Technology

Page 16: Unilever Data Analysis Project

models. The tables below give the coefficients for the two 5-variable models responsible for the lift curves

above:

Target: Caresss Body Wash FULL_LOGITS Coefficients

Intercept -2.47

Caress Bar Soap 1.61

Lever Body Wash 0.59

Oil Olay Body Wash 0.51

Herbal Essence Body Wash 0.46

Dove Body Wash 0.45

Figure 7: Coefficients of a Logistic Regression Model for Caress Body Wash

Target: Classico Pasta Sauce FULL_LOGITS Coefficients

Intercept -2.89

Five Bros Pasta Sauce 1.92

Francisco Rinaldi Pasta Sauce 1.31

Prego Pasta Sauce 0.52

Breyer’s Ice Cream -2.11

Lipton Tea Bags -2.53

Figure 8: Coefficients of a Logistic Regression Model for Classico Pasta Sauce

Thus, among the most useful variables for predicting Classico Pasta Sauce usage are other pasta sauce

brands, while other brands of body wash are among the most useful information for predicting usage of

Caress Body Wash in the overall data set.

2.4 Conclusions Our analysis of the overall UNIFORM.TXT data set led us to some intriguing conclusions, and motivated a

closer look at the data set and its sources.

Methodologically, while combination models are interesting, logistic regression, perhaps with subset

selection, is a sufficiently powerful method for analyzing this data. Furthermore, it has the added

advantage of interpretability and is well understood as a tool in marketing.

11 Massachusetts Institute of Technology

Page 17: Unilever Data Analysis Project

Our primary finding was that in this data set, reported brand usage variables are considerably more

powerful than the limited set of demographics we looked at. In particular, usage of a given product can be

predicted surprisingly accurately using usage data from a small number of closely related products.

These pronounced and powerful trends encouraged us to take a closer look at the underlying data. After

discussions with our counterparts at Unilever, it was revealed that much of the data we were working with

was aggregated from two consumer surveys. Indeed, one of the surveys asked questions regarding personal

washes, while the other did not. Also, one of the surveys included questions regarding pasta sauces, while

the other did not. Clearly, the differences between the two surveys largely explained the high correlation

we were observing among usages of similar products.

Thus, our most significant conclusion from this analysis was that our models were achieving impressive

results, but were likely modeling the data collection technique rather than the underlying phenomenon.

Unfortunately, such a model may generalize poorly to panel data or to real world situations. Such survey

data, in an aggregated format, may serve as a poor proxy for purchase behavior.

The results of this analysis motivated more work to understand the source of the data. Future efforts were

focused on identifying portions of the data that were as uniform as possible, and forming prediction and

clustering using relatively simple modeling techniques such as logistic regression.

2.5 Block Structure in the Unilever Data Extract The previous analysis motivated a closer look at the data. At this point Unilever provided us copies of two

questionnaires, the responses to which comprised a large portion of the data in the Unilever database

extract. Using timestamps that indicated the dates of collection for the various responses, we were able to

obtain a more detailed understanding of the data. This structure is indicated in the following figure:

DM

Usage variables

Demographicvariables G

QM

1

Consumer

70 39 brands 11 Unilever brands

S2000 survey responses

F2000 survey responses

Non-Survey / “Coupon” responses . .

Consumer 46307

Figure9: Graphical Representation of Unilever Data Structure

12 Massachusetts Institute of Technology

Page 18: Unilever Data Analysis Project

In figure 9, rows indicate data available for a single consumer, while columns indicate different variables in

the data extract. Some demographics and model scores are reported for each consumer. In the section

marked as “Usage Variables,” the shaded blocks indicate the presence of self-reported usage data. After

some investigation, we believe that roughly half the consumers had Spring 2000 survey responses and no

Fall 2000 survey responses, while the other half had Fall 2000 responses and no Spring 2000 responses. A

subset of consumers from both groups also had some additional brand usage responses, which we hereafter

refer to as “Non-Survey” data. We were advised that this “Non-Survey” data largely represented responses

to coupon redemption.

In subsequent analyses, we concentrated on sets of brands and consumers whose usage data fell uniformly

in one of these blocks. The individual surveys reported on a relatively small set of brands, while the “Non-

coupon” data included a much larger set of brands. For this reason, we focused our efforts on the Non-

Survey data. In what follows, we will briefly discuss our limited modeling efforts on the Fall 2000 survey

data, and we will provide a lengthy discussion of an extensive analysis of the Non-Survey data.

2.6 Analysis of Fall 2000 Survey Data While our efforts were focused on the Non-Survey data, we also performed an exploratory analysis of the

Fall 2000 survey data and associated demographic variables. We extracted a small sample of Fall 2000

data that included 2500 consumers.

The Fall 2000 survey data includes response data on a limited number of brands. The Unilever brands

represented are Suave, Lipton, Breyer’s, Wishbone, Dove, Lever2000, Caress, and Snuggle. This limited

amount of response information restrained the scope of analysis we could perform, and hence we tried the

following three indicative tasks:

• Predicting reported Caress usage given all other available variables.

• Predicting reported Caress usage given demographic variables only.

• Predicting simultaneous Caress and Snuggle usage given all other available variables.

The third task was an attempt to use predictive modeling to identify cross-selling opportunities.

We used only the demographic variables described above: described region of residence, income levels,

household sizes, and presence of children. We tried several modeling methodologies including nearest

neighbor methods, logistic regression, discriminant analysis, classification trees, and neural networks.

These methodologies will be described in more detail in a subsequent section. We concluded that the

choice of modeling algorithm seemed to make insignificant difference in the quality of the results obtained.

The best-performing models predicted 100% non-usage, which gave a 32% misclassification rate for

Caress, and a 19% misclassification rate for Caress / Snuggle cross-sells. These results indicated that it is

difficult to make predictions given the limited number of variables.

13 Massachusetts Institute of Technology

Page 19: Unilever Data Analysis Project

0

50

100

150

200

250

300

350

0 500 1000 1500

# targeted

# re

spon

ses

Caress lift curve(demographic andresponse predictors)Caress Reference

Caress&Snuggle liftcurve

Caress&SnuggleReference

Caress lift curve(demographicpredictors)

Figure 10: Lift Curves for Logistic Regression Models based on Fall 2000 Survey Data

On constraining the models to generate a reasonable number of usage predictions, the best models achieved

35% misclassification rate for Caress and 21% misclassification rate for Caress / Snuggle cross-sells.

Figure 10 shows lift curves from the logistic regression models. We observe that the model making use of

demographic variables only gave insignificant lift.

The models based on both demographic and brand usage information seemed to achieve more lift. The

most useful predictor variables seemed to be other brands of soap – namely Dove and Lever2000. While

this may reflect real consumer usage pattern, it may also be due to the design of the survey, which included

separate sections for soap and for other products. As with the other analyses in this report, the question

remains as to whether these results are transferable to real-world usage patterns.

3 NON–SURVEY DATA ANALYSIS

Here we report on a subset of the original UNIFORM.TXT which we refer to as the “Non–Survey” data.

Our decision to analyze this section of the data was guided by the internally homogenous composition of

this data set and the fact that it included information on a wide range of products.

• Data Details: A detailed description of the Non–Survey data extraction and composition

highlighting some of the inherent features.

• Data cleaning and aggregation: A detailed description of treatment of missing values or outliers

and transformation of some of the variables that we conducted.

• Data credibility issues: A few of our observations suggesting the possibility of artificial data

structure and data bias issues.

14 Massachusetts Institute of Technology

Page 20: Unilever Data Analysis Project

• Predictive modeling: An in - depth analysis to predict brand usage and explore the Most Valuable

Consumer concept using various modeling techniques based on both demographics and Golden

Question Model and Demographic Model scores.

• Cluster analysis: A description of efforts to segregate consumers into distinctive clusters and to

use this potential information to enhance predictive efforts.

3.1 Data Details The layout of data provided by Unilever was explained above. The Non–Survey data has been extracted

from the file UNIFORM.TXT. This file contains information on 46,307 consumers. Each usage entry in the

file UNIFORM.TXT bears a date stamp indicative of time of data collection. A majority of data entries in

this file bear one of the two time stamps – namely 15th May, 2000 and 15th November, 2000. Based on the

quantity of usage data associated with these two dates and the fact that the data with these time stamps

seems to correspond to the surveys provided to us, we assumed that usage entries with these time stamps

correspond to the Spring and Fall survey data. The remaining data is what we analyzed in this section and

has we refer to it as the Non–Survey data. As per information provided by Unilever, we believe the Non-

Survey data represents product promotion coupon responses.

The Non–Survey data consists of 14,492 consumers. For each consumer we have demographic information

and reported usage for seventy brands, some of which are Non–Unilever brands. In addition, Golden

Question Model and Demographic Model scores have been provided for each consumer. The diagram

below is a graphical representation of the data layout and structure.

Use 07 r 463. .

Consumer 46, 307

Usage variables

70 brands

S2000 survey responses

Demographicvariables

GQD

M

39 brands 11 Unilever brands

User Consumer 1 1

F2000 survey responses

Non-Survey / “Coupon”

Figure 11: Graphical Representation of Non–Survey Data

Each consumer reports usage of at most twenty brands. Following the data cleaning efforts, the maximum

number of brands reported by a consumer was reduced to seventeen. The following diagram shows a

distribution of consumers according to the number of responses reported by each consumer. It is observed

15 Massachusetts Institute of Technology

Page 21: Unilever Data Analysis Project

that a large majority of the consumers report usage of very few brands, which leads to the sparse nature of

the Non–Survey dataset.

0

500

1000

1500

2000

2500

3000

# C

onsu

mer

s

1 3 5 7 9 11 13 15 17

# Responses

Consumer Response Distribution

Figure 12: Consumer Response Distribution for Non–Survey Data

Figure 13 depicts category–wise brand distribution of Unilever and non–Unilever brands in the Non–

Survey data set. There is a dominating presence of body wash and bar soap brands in the data set, followed

by the presence of food items. This is because of the survey design and, therefore, the distribution is not

representative of true usage patterns, a potential bias that we will investigate later in the discussion.

Distribution of Brands

0 8 16 24 32 40

Bar Soap

Body Wash

Shampoo

Detergents

Food Items

Body Items

Misc

# Br

ands

Brand Categories

Figure 13: Category – wise Brand Distribution

3.2 Data Processing and Transformation In contrast to our initial efforts, during the analysis on Non–Survey data we sought to incorporate a wide

range of demographic and model score variables for a more comprehensive study. This necessitated

numerous decisions regarding data set preparation like choice of demographics and treatment of missing

values and outliers.

16 Massachusetts Institute of Technology

Page 22: Unilever Data Analysis Project

3.2.1 Data Cleaning

Following the data cleaning efforts as mentioned in section 2.1, Non-Survey data set was further cleaned

and filtered. The total numbers of consumers to begin with were 14,492, which were reduced to 8,608

consumers after data cleaning and filtering. As stated earlier, the Non-Survey data had over 100

demographic variables and 70 brand variables. All the brand variables have been considered in the analysis

regardless of affiliation with Unilever. Block Layout variables were not considered in this analysis due to

their complex nature and to expedite the process. We believed the information contained in Household and

Individual variables was significant enough for revealing relevant information. Among the Individual

Layout and Household Layout demographics, variables with an excess of 20% missing values were

eliminated. Imputing missing values for such a large number of data entries would have led to misleading

results. Some of the demographic variables were rejected due to the ambiguous nature of their source and

method of computation. Certain variables that seemed to be derived from other variables in unclear ways

were also rejected. Examples of such variables include lifestyle clusters like traditionalist, home garden,

etc. For important demographic variables, all consumers with missing values were eliminated. For the

remaining demographics that were measured on a continuous scale, missing values were imputed with the

mean values. Outlying consumers whose corresponding demographic values were greater than or less than

five standard deviations from the mean were also eliminated. This led to negligible number of consumers

with missing values among demographics measured on a binary scale. These were also eliminated.

3.2.2 Data Aggregation

Certain demographic variables were aggregated for the purpose of obtaining simpler models which are

more interpretable, to decrease the computational effort and to deal with potential model over – fitting.

State – Wise Regional Aggregation

Variable FIPS_census_state contains information on the state of residence of the consumer. These were

aggregated into the following nine regions.

• New England

• Middle Atlantic

• East North Central

• West North Central

• South Atlantic

• East South Central

• West South Central

• Mountain

• Pacific

This aggregation was necessitated for decreasing computational and time complexity. In addition, we

believed that a region – wise approach will be more insightful.

17 Massachusetts Institute of Technology

Page 23: Unilever Data Analysis Project

Household Member Age – Gender – Wise Aggregation

Variables containing information on age and gender–wise presence of household members were

aggregated. These are variables of type IB_males_0_2, IB_females_3_5, etc.

Presence of Male Child (Non-Earning)Presence of Male 0-2, 3-5,

6-10, 11-15, 16-17 years

Presence of Male 18-24, 25-34, 35-44, 45-54 years Presence of Male Adult (Earning)

Presence of Male Senior Presence of Male 55-64,

65-74, 75 plus years

Figure 14: Household member age – gender – wise aggregation

The variables were grouped into presence of child/ adult/ senior member of male/ female/ unknown gender.

The age threshold for segregation was chosen based on certain assumptions about occupational status of

each household member according to the individual’s age. This is made clearer in Figure 14. The variables

were combined into children, adult and senior categories as indicated above. The same treatment was

extended to variables for female and unknown gender also. This led to a compression of 36 variables into 9

variables, with the preservation of age and gender–wise composition of the household and their influence

on consumer response. Prior to our meeting with Unilever representatives in July 2002, we assumed that

household members in the age group of 16 to 17 could be included in the adults category. However, we

were informed that Unilever considers consumers up to the age of 17 as children. We made necessary

modifications to the data set thereafter but no significant changes were recorded due to this minor

modification.

3.3 Data Issues Prior to extensive predictive modeling efforts, the data was observed and examined to extract information

that might be useful in subsequent study. We observed several instances that suggested data inconsistencies

and possible biases in the data. Some of these issues relate to the means and methods of data collection and

interpretation, while others relate to possible artificial data structuring induced by aggregation of dissimilar

data from multiple sources. Presented below are a few cases in point.

3.3.1 Conflicts in Computation of Household_Layout variables

Household_Layout variables consisted of certain variables that supply gender-wise and age-wise

information on presence of household members. Examples include IB_males_0_2, IB_females_3_5, etc

(henceforth referred as household-member variables). A positive response indicates presence of a

18 Massachusetts Institute of Technology

Page 24: Unilever Data Analysis Project

household member in that age and sex group. This information was not extracted from the consumer

directly, rather derived from third-party data source. The accuracy if the data varied depending on source

of data. The following situations led us to doubt the accuracy of some of the information contained in these

variables

• The data set also contains a variable called IB_house_size. Addition of unit responses in all the

aforementioned variables describing presence of household members should not exceed the value

indicated by variable IB_house_size. Yet it was observed that there was no correlation between the

aggregated value as obtained from the household-members variables and the house size variable. We

tried various combinations, with the inclusion or exclusion of unknown gender type members, yet

we failed to achieve a match in between the two sets of variables.

• Similarly there was no reconciliation between the values represented by variables

IB_presence_of_child or IB_number_of_adults and the information aggregated using various

combinations of household-member variables.

The above data inconsistencies suggests that caution must be exercised in choosing variables to be

considered in the modeling analysis. Careful analysis of variable definitions and calculations must also be

done. To capture information on age and gender of household members, we chose to use the aggregated

household – member variables.

3.3.2 Unbalanced Brand Representation

Consider the following distribution of consumer response for some of the brands in the Non–Survey data.

# Response per Brand

0 1000 2000 3000 4000 5000 6000

Dove BarSoap

Dove BodyWash

Caress BarSoap

Ragu PastaSauce

lvr2k Bodywash

Mealmk StirFry

Suave BarSoap

Bra

nd N

ame

# Response

Figure 15: Consumer Response Distribution for Some of the Brands

Brands contributed disproportionately to the Unilever database. For instance there is an overwhelming

presence of Dove body products, whereas response for some of the products is relatively infrequent. The

most frequent reported usage is of bar soap and body wash products, followed by the usage of food items,

detergents and other miscellaneous products in that order. It seems evident that the patterns is not

representative of true brand usage frequencies.

19 Massachusetts Institute of Technology

Page 25: Unilever Data Analysis Project

3.3.3 Insufficient Information on Source of Information in Non–Survey Data

The imbalance in reported usage across various brands raised doubts as regards the origins of the Non–

Survey data. Details on the source of the data and the methods of data collection were not disclosed

clearly. We were advised to assume that the Non–Survey data represents information on redemption of

coupon promotions. However we did not have details about the coupons themselves or methods of

circulation.

We note that the applicability of our data analysis results largely depend on the data quality and the extent

to which it is understood. Though we believe that our results accurately reflect the process that generated

the data, our results are as valuable in reality as the extent to which a real brand usage situation has been

captured in the data presented to us.

3.4 Predictive Modeling For reasons we have listed before, our prediction efforts have been focused on predicting reported usage of

individual brands. We briefly looked at the modeling of MVC also. The following summarizes our

sequence of predictive analysis:

• Fitting various algorithms to data set to identifying a common predictive modeling methodology

leading to best results over a wide range of brands.

• Examining the contribution of GQM and DM scores in predictive models

• Conducting “Most Valuable Consumer” (MVC) based analysis to capture information contained

in GQM or DM scores for prediction of MVC.

• Drawing inferences and conclusions from model results and suggesting means for obtaining

improved results

3.4.1 Choice of Models We fitted a number of naïve and sophisticated models, to arrive at a common predictive model that proved

both accurate and interpretable for a wide range of brands. Some of the modeling techniques tried were: k–

Nearest Neighbors, Classification Trees, Artificial Neural Networks and Logistic Regression. Choice of

appropriate model involves a tradeoff between the accuracy of results, interpretability, ease of application

to alternate data set and computational complexity. The models rank as follows in decreasing order of

interpretability: Regression Models, Classification Trees, k-Nearest Neighbors and Neural Networks.

Following is a brief description of each of some of these algorithms, focusing on the advantages and

disadvantages of each.

Artificial Neural Network (ANN): An artificial Neural Network is a mathematical model capable of robust

classification even when the underlying data structure is quite complex. ANNs derive their predictive

20 Massachusetts Institute of Technology

Page 26: Unilever Data Analysis Project

power from architecture of interconnected computational units. They also allow predictive modeling of

more than one variable in a single iteration. Although capable of high accuracy, ANNs suffer from the

drawback that their models can be hard to interpret. Thus they are more appropriate when predictive

accuracy is more important than interpretability. They also require a very large training data, are

susceptible to over – training and are more computationally intensive than other algorithms.

Classification Trees: A classification tree makes data classifications based on a set of simple rules that can

be organized in the form of a decision tree. While these models are not as intricate as ANNs, they are

considerably more interpretable. The decision trees can be easily translated to business strategies. So they

have become more popular in business context recently. This technique is also applicable to data sets with

missing values, thereby considerably reducing the data cleaning efforts.

k–Neareset Neighbors: This methodology is in many ways similar to the collaborative filtering

methodology discussed earlier. Usage predictions of a given variable are computed as weighted average of

the usage of other users. Its main advantage lies in the fact that it is effective when there are a lot of

response variables. However the simplicity may come at the expense of loss of predictive accuracy.

Logistic Regression: There has been a prior discussion of logistic regression. It has been used to model

preferences in the marketing context. Some of the benefits include a record of success over a wide range of

prediction problems, interpretability of the models, and the fast speed of algorithms used for model fitting.

The output of logistic regression model is a set of posterior probabilities for which we can vary the

threshold according to the desired level of predictive aggressiveness.

Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network

On fitting all the aforementioned algorithms to numerous brands and comparing the results, logistic

regression was chosen as the common model for all predictive analysis henceforth. We observed that

21 Massachusetts Institute of Technology

Page 27: Unilever Data Analysis Project

predictive results from logistic regression were at least as good as or better than the results obtained using

other models for a wide range of products. Figure 16 illustrates the superior predictive performance of

logistic regression compared to classification trees and k-nearest neighbors (appears as “User” in the

figure) in the case of Dove body wash.

3.4.2 Predictive Efforts using Logistic Regression An exhaustive logistic regression analysis was conducted for all the seventy brands in the data set and a

number of interesting results were observed.

Varying Success in Predictive Efforts

Logistic regression models were built to make predictions of each brand usage based on the demographics

and model scores only. We observed varying degrees of success across various brands ranging from highly

successful such as results for Breyer’s Ice Cream to poor results as seen for Lipton Tea Bags. We present

depicting typically good, moderate and poor results.

Good: Breyers Ice Cream Medium: Caress Body Wash Poor: Lipton Tea Bags

Figure 17: Lift Curves for Different Unilever Brands

Some of the conclusions to be made are as follows:

a) For a majority of brands noticeable lift was observed. Typical lift curves over a wide range of

products are similar to the Caress body wash lift curve shown above. This indicates that the

demographic and model score variables contain considerable predictive power in this data set and

can be used for making brand usage predictions.

b) The most important predictive variables emerged to be the Golden Question Model (GQM) and

Demographic Model (DM) scores. There are two possible explanations for this. The first is that

MVC may be an important summary statistic that captures brand usage. The second is that the

Golden Question Model takes a number of brand usage variables as inputs. Thus using the GQM

scores to further predict the same usage variables can lead to artificially inflated results that may

be misleading.

22 Massachusetts Institute of Technology

Page 28: Unilever Data Analysis Project

c) The following demographics are seen to have significant presence in models for a wide range of

brands:

Significant Demographics

Age Length of residence

Gender Marital status

Household Members Region Code

Figure 18: Significant Demographic Parameters

d) Brands with high coefficients in the GQM computation had significantly better lift curves. As

already noted, this may be deceptive. Hence the lift charts for the brands which have been used as

inputs in the computation of GQM must be viewed with caution. The Breyers ice cream in an

example of such a brand.

e) Brands for which response rate was below 10% of the total consumer base led to poor predictive

efforts, as in the case of Lipton Tea This can be attributed to insufficient number of data entries

available for training and validating the predictive models, leading to poor results.

Example of a Predictive Model

The following chart is an example of logistic regression predictive model for one of the Unilever brands –

namely Gorton Fillets. The diagram graphically indicates t – scores for the model.

Figure 19: Graphical Representation of Logistic Regression for Gorton Fillets

23 Massachusetts Institute of Technology

Page 29: Unilever Data Analysis Project

The most important model coefficients represented above are as follows:

Target: Breyers Ice Cream Logistic Regression Model

Model Score GQM 4.7005

Model Score DM -2.521

Intercept -3.0433

Absence of Female Adult 0.2977

Gender -0.358

Age 0.0183

Absence of Unknown Adult 0.400

Home Renter 0.215

Figure 20: Logistic Regression Model Coefficients for Gorton Fillets

During the course of study, it was made evident that GQM and DM scores held significant predictive

information. For a majority of products, model coefficients were the highest for these score variables. As

explained earlier, since GQM and DM scores were fit using Panel data. Their importance in models for the

same brands on a different data set verifies a degree of similarity between the panel data and Non–Survey

data, as well as establishes the importance of GQM and DM scores in predictive efforts.

Modeling With and Without GQM Scores

Further analysis was carried out to judge the contribution of GQM and DM score variables in predictive

models. We wished to explore the comparative performance of models without the GQM and DM scores,

based on demographics only. The figure below indicates superior performance of the models based on both

demographics and model scores compared to models based on demographics only. (In Figure 21, “Reg”

represents the model excluding GQM scores and “Reg 2” represents model based on demographics only for

the prediction of Breyer’s Ice Cream usage).

Figure 21: Comparative Lift Curves for Models With and Without GQM Scores

24 Massachusetts Institute of Technology

Page 30: Unilever Data Analysis Project

3.5 GQM Score – Based Stratified Analysis As documented previously, Unilever spent considerable efforts in identifying the “Most Valuable

Consumers” (MVC) on a brand and company-wide level. Given the Golden Question Model scores and

Demographic Model scores for each consumer, we were keen to enhance our predictive efforts using this

information. The exhaustive logistic regression analysis had already convinced us that GQM and DM

scores could be highly instrumental in predicting brand usage. A two pronged approach was followed in the

MVC based analysis:

3.5.1 Stratified Prediction Models

Previous modeling attempts were focused on fitting a single model for each brand to the entire data set and

generating posterior probabilities using these models. Since GQM scores seemed to contain information on

MVC, we stratified the data set based on model scores into three categories. Firstly a separate training data

set was created. The partition was based on consumer distribution such that each stratum had roughly one –

third consumers. All consumers with a GQM score greater than 0.679 were categorized as high GQM

consumers. All consumers with a GQM score between 0.679 and 0.342 were categorized as medium GQM

consumers and the rest were categorized as low GQM consumers. Separate logistic regression models were

built for each of these three training data sets. Based on the strata cutoffs for the training set, validation data

set was also divided into three corresponding data sets. Models fit on the respective training data sets were

applied to the validation sets to compute posterior probabilities for each consumer in all the categories.

Thus we generated 2 sets of posterior probabilities for each consumer – from the overall model and from

the stratified analysis. Lift charts were drawn for both predictive efforts to compare performance.

Lift Curve comparing GQM Stratified and overall analysis

00.10.20.30.40.50.60.70.80.9

1

10 20 30 40 50 60 70 80 90 100

Stratified Analysis Baseline Overall Analysis

Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models

Figure 22 shows lift charts for GQM score–based stratified and overall models for Dove bar soap, which

are representative for other brands as well. It is clear that both the methodologies lead to similar results. We

25 Massachusetts Institute of Technology

Page 31: Unilever Data Analysis Project

noted that there was insignificant difference in the three separate models created for the stratified data sets

and that these models were each close to the overall model as well. This claim was further substantiated by

details of the model coefficients in each case. It was observed that the important variables in overall model

and stratified models were the same for a single brand. There was only a slight variation in the coefficients

for each of these variables. Thus we concluded that the stratification of data set according to GQM does not

lead to improved results.

3.5.2 Predicting GQM Strata for Each Consumer

We have discussed the reason which prevented us from thoroughly investigating the concept of MVC given

the nature and type of data available to us. We found no direct indications of MVC in the Unilever data set,

rather GQM and DM based estimates of MVC. Nevertheless we spent efforts using GQM scores as a target

for our predictive models. Instead of predicting the exact GQM scores, based on demographics we tried to

predict whether the consumer belonged to high, medium or low GQM strata. Our intention was to generate

a representative MVC model using Non–Survey data based on demographics only. Logistic regression was

used to arrive at results. GQM strata definitions were maintained the same as mentioned afore. Each

consumer was assigned a strata number one, two or three depending on whether it belonged to the high,

medium or low GQM strata. Subsequently predictive models were computed to predict the strata class for

each consumer based on demographic variables only.

A very low degree of success was achieved in predicting the GQM score strata that each consumer

belonged to. For this reason and unclear applicability of this model, we did not find it appropriate to pursue

this line of analysis further.

4 CLUSTERING ANALYSIS

4.1 Clustering Background Clustering places objects into groups or clusters suggested by the data. The objects in each cluster tend to

be similar to each other in some sense, and objects in different clusters tend to be dissimilar. The

observations are divided into clusters so that every observation belongs to at most one cluster. Clustering

not only reveals inherent data characteristics by identifying points of similarity or dissimilarity in the data

set, it aids in understanding data structure issues. If dissimilar data sets are aggregated to produce a bigger

data set, clustering of the aggregated set might reveal underlying data sets. One of the added advantages of

clustering analysis is that it can be applied to a data set with missing values as well.

Aside from data cleaning and data structure issues, clustering results can also be of interest by identifying

groups of consumers with similar traits, who may be targeted in a similar fashion. Additionally, we

explored the possibility of improved prediction by modeling the individual clusters separately and then

aggregating results.

26 Massachusetts Institute of Technology

Page 32: Unilever Data Analysis Project

Clustering can be performed according to various methods. For our analysis we chose Ward’s Method

which is somewhat more sophisticated than the popular but simple k–means method. Ward’s method is an

iterative method that seeks to minimize the statistical spread of observations within a cluster. In this

method, the distance between two clusters is the ANOVA sum of squares between the two clusters added

up over all the variables. At each iteration, the within–cluster sum of squares is minimized over all

partitions obtainable by merging two clusters from previous iteration. Clustering can be performed

according to various measures of spread or distance among the data points. During our study we used the

Least Squares method because it works fastest while dealing with large data sets.

During the clustering analysis SAS Enterprise Miner computes an Importance Value between 0 and 1 for

each variable in the data set. This represents the measure of worth of the given variable in the formation of

clusters. While the data is split into clusters, Importance Value of each variable indicates the extent to

which each variable was influential in the splitting process. An importance of 0 indicates that this variable

was not used as splitting criteria for clustering and an Importance value of 1 indicates that this variable had

the highest worth in splitting criteria.

One of the most important tools which help us interpret individual clusters is the Input Mean Chart for each

cluster. This allows a comparison of the variable mean for selected clusters to the overall variable means.

The input means are normalized using a scale transformation function:

)min()max()min(

xxxxy

−−

=

For example assume 5 input variables yi = y1,…,.y5 and 3 clusters C1, C2, C3. Let the input mean for

variable Yi in cluster Cj be represented by Mij. Then the normalized mean, or input mean, SMij becomes:

),,min(),,max(),,min(

321321

321

iiiiii

iiiijij MMMMMM

MMMMSM

−=

The input means are normalized to fall in a range from 0 to 1. For each cluster input means are ranked

based on magnitude of difference between the input means for the selected cluster(s) and the overall input

means. The variables with the highest spreads typically best characterize the selected clusters(s). Input

means that are very close to the overall means are not very helpful in describing the unique attributes of

consumers within the selected cluster(s).

4.2 Cluster Analysis Details

27 Massachusetts Institute of Technology

Page 33: Unilever Data Analysis Project

We performed a clustering analysis to identify distinctive groups of consumers based on their brand usage

responses only. All the seventy brands were used as input variables for the clustering analysis. The

clustering analysis led to a few noteworthy insights and enhanced our understanding of the data. Analysis

was performed over the entire Non–Survey data set. We repeated the clustering several times for various

randomly sampled subsets of the Non–Survey data to ensure generality and accuracy of the results. The

repeated runs of the clustering algorithm over these various subsets gave similar results, and seemed to

indicate an obvious clustering into 3 groups. Results shown below are for the clustering of the entire Non–

Survey data set.

Figure 23: Cluster Pie Chart

Figure 23 depicts the three clusters obtained as the distinct pie sections which are indicative of the

frequency of data points in each cluster. The color of each pie section is reflective of the root mean square

standard deviation of points in each cluster and its value has been provided in Figure 24

Cluster

#

% Data

Points

Cluster Std

Deviation Cluster Description

1 38 0.246 Frequent Buyers of food

item

Infrequent buyers of

soaps/ washes

2 8.7 0.269 Frequent Buyers of

detergents

Buyers of items with

low response rate

3 53.3 0.194 Frequent Buyers of

body soaps/ washes

Infrequent buyers of

food items

Figure 24: Cluster Statistics

Figure 24 shows some statistical details of each cluster. These include:

28 Massachusetts Institute of Technology

Page 34: Unilever Data Analysis Project

• Percentage Data Points – Percentage of data points in each cluster (Total number of data points –

8,608).

• Cluster Standard Deviation – Average standard Deviation of each point in the cluster from cluster

mean, which indicates within-cluster spread of the data.

• Cluster Description – A qualitative description of distinguishing features of each cluster

The cluster descriptions are based on the input means plots and importance values described previously.

Another tool used in the process was a decision tree that is created in the clustering analysis. It is somewhat

similar to the classification trees. A set of simple variables–based rules is generated which approximates

the cluster boundaries. The following section sheds light on methods of deciphering the crucial features of

distinction amongst all the clusters and discussion on similarity traits within each cluster.

4.3 Inference from Cluster Analysis

4.3.1 Effects of Unbalanced Brand Representation

Figure 25 depicts importance values for the most critical variables based on which the clustering of

data points was carried out.

Importance Value of Brands

0 0.2 0.4 0.6 0.8 1

Dove body washCaress bar soapOofo body wash

Pond’s faceDial bar soap

Oofo bar soapCaress body washAll laundry powderSurf heavy duty lq

Dove body washRagu pasta sauce

GHB ice creamSuave fabric cond

Bran

ds

Imporatance Value

Figure 25: Importance Value of Significant Variables in Clustering

We find that the personal wash products are the most important variables for the clustering analysis. We

also noticed that the frequency of brand usage affects the importance value, with typically all the brands

with low reported usage having a low importance value. Thus we note that the importance values are

related to the unbalanced brand representation in the data set.

29 Massachusetts Institute of Technology

Page 35: Unilever Data Analysis Project

4.3.2 Details of Cluster 1

Cluster 1 consists of 38% of the consumers (3,271). The above chart presents means of particular brands

within cluster 1 as compared to the overall mean of the same brands over the entire data set. It is to be

noted that complimentary brand usage response has been modeled (Thus blocks to the left of the mean

actually indicate that the cluster has a higher incidence of response for that product). For instance the chart

shows 55% of the consumers indicate a positive response for Dove body wash in the overall data set but

this is true for only 32% of the consumers in cluster 1. This means that the number of Dove body wash

buyers is significantly more than the average consumer in the dataset.

Figure 26: Input Means for Cluster 1

We conclude that consumers in cluster 1 display a significantly high propensity for purchase of body

hygiene products including bar soaps and body washes. Some of the main products purchased by these

individuals in the order of significance are: Dove body wash, Dial bar soap, Caress bar soap, Oil of Olay

body wash etc. Another noticeable attribute is that they are also infrequent buyers of food items like Ragu

pasta sauce, Wishbone salad dressing, Gorton fillets, etc.

4.3.3 Details of Cluster 2

Cluster 2 consists of only 8.7% of the consumers (748). Cluster 2 consists of individuals that are infrequent

buyers of both the personal wash and food items. They seem to be purchasers of items with very low

response rate like Ponds Face, VICL Body, VSLN Body and some of the detergents. It is possible that these

responses were collected from a data set of beauty products and laundry products purchasers, or these are

simply outliers in the data set.

30 Massachusetts Institute of Technology

Page 36: Unilever Data Analysis Project

Figure 27: Input Means for Cluster 2

4.3.4 Details of Cluster 3

Cluster 3 consists of 53% consumers (4588). The most important attribute of individuals in this cluster is

that they are highly frequent buyers of food items compared to average consumers. Important food items

include Ragu Pasta Sauce, Wishbone salad dressing, Breyers Ice cream, Gorton Fillets. These are

coincidentally also characterized as infrequent buyers of personal hygiene products. For example they are

infrequent buyers of Caress Bar Soap, Dove Body Wash, Dial Bar Soap, Lever 2000 Bar Soap, etc.

Fig 28: Input Means for Cluster 3

31 Massachusetts Institute of Technology

Page 37: Unilever Data Analysis Project

Based on the consumer attributes revealed by the clustering analysis, we conclude that the clustering found

may be approximating the underlying data sets that have been aggregated. We found it intriguing to

observe that the cluster of individuals with higher propensity for purchasing body hygiene products should

have a significantly lower tendency for purchasing food products and that the cluster of consumers with a

high propensity for purchasing food items should have a lower propensity for purchasing body hygiene

products. Also, the few outlying individuals who are neither frequent purchaser of personal wash or food

items have been clustered separately. Possibly information from body hygiene brands like Dove and food

items like Ragu Pasta Sauce were collected separately and aggregated together in one data set, which would

explain the clusters that we observe. As mentioned several times before, body soap and wash products

overwhelm the usage data. Therefore the clustering analysis also segregates frequent and infrequent buyers

of these items into separate clusters. Based on the likelihood that the data available to us has been

aggregated from various sources and the differences in the source may be guiding the clustering analysis,

extrapolation of the results to true usage behavior may be inappropriate. This means that the clusters we

have generated may accurately reflect groups in the data but not be indicative of the true consumer usage

pattern.

4.5 Cluster–Wise Predictive Modeling Regardless of the reason for emergence of data clusters, they may potentially lead to improved prediction

results. Here we investigate whether fitting a separate model for each cluster can lead to better results. If

accuracy is improved, it suggests that separate clusters of consumers should be preferably modeled

individually.

Once it was identified that there exists a possibility of prior data aggregation leading to artificial structure,

we explored a cluster–wise predictive methodology to obtain better results. The objective was to conclude

whether agglomeration of data sets was desirable or data for separate brands and product categories should

be treated separately.

For this purpose, each data cluster was treated as separate data set and logistic regression models were built

for each. Thus predictive probabilities were generated for each data point according to the cluster it

belonged to. Probabilities for each of these data entries were also generated through an overall model fitted

to the entire agglomerated data set. Lift curves for generated for both the analysis and compared.

Figure 29 shows comparative lift charts for cluster–wise and overall analysis for Dove Bar Soap, which is

typical of lift curves observed for other brands too. This analysis proves that a cluster-wise predictive

model is capable of giving significantly than a predictive model based on the entire agglomerated data set.

This observation is in contrast to the weak results we obtained by separately modeling consumers in

different GQM score strata.

32 Massachusetts Institute of Technology

Page 38: Unilever Data Analysis Project

Lift Curvess comparing Cluster-wise and Overall Analysis

00.10.20.30.40.50.60.70.80.9

1

10 30 50 70 90

Cluster Wise Analysis Baseline Overall Analysis

Figure 29: Comparative Lift Curves for Cluster – Based and Overall Models

The conclusion to be drawn from this analysis is that modeling small homogenous groups in the data

independently is a more useful exercise than fitting an overall model to an agglomerated data set. If our

clusters are reflecting underlying data sets from different product categories that have been agglomerated to

form the Unilever Database, then it is more advantageous for Unilever to analyze its various data sets

independently.

5 PROJECT SUMMARY AND CONCLUSIONS

We conclude the report with a brief summary of what has been accomplished. Below we have enumerated

some overall conclusions and recommendations based on all our analysis.

A single extract of the Unilever database was made available to us for the purpose of developing modeling

methodologies useful for Unilever’s business, to identify actionable insights from the data, and to evaluate

the data as an asset for targeted marketing. In addition to sizeable efforts trying to understand, clean, and

prepare the data, we focused on generating predictive models of individual brand usage because such

models are arguably the most valuable tool for targeted marketing. Our efforts and interactions with

Unilever representatives gave us a better understanding of the data which in turn led to more refined

analyses on subsets of the data. Finally, we generated clustering models to identify groupings in the data,

whether due to natural brand usage patterns or induced artificially through combination of various data sets.

We list a number of overall trends and conclusions that arise out of our analyses:

• There is a need to understand the content of the Unilever database more fully. This includes

gaining greater understanding of the effect of missing data and assessment of the quality of some

of the third party information. Also, the Unilever database includes some information on dates

33 Massachusetts Institute of Technology

Page 39: Unilever Data Analysis Project

34 Massachusetts Institute of Technology

and usage quantities. The availability of such data could potentially lead to more interesting

analyses.

• Logistic regression appears to be a suitable method for most of the prediction tasks undertaken. It

has the benefits of being well-studied for these types of data and of being interpretable. Its

performance was comparable to more complicated methods for most of the tasks we tried.

Overall, we observed that data quality issues seemed more important than the choice of modeling

methodology.

• The Unilever data seems to be aggregated from several different data sources, many of which

seem to be incompletely understood and/or poorly documented. We note that the quality and

applicability of any data analysis depends critically on the quality of the underlying data and the

extent to which it is understood.

• Our prediction efforts seem to indicate that in general, predictions of self-reported brand usage

based on MVC estimations and on usage of other products were somewhat effective. We believe

that in this data set there is a potential problem with using MVC estimates to predict usage of

certain individual brands, as these MVC estimates are sometimes based on the same data we are

trying to predict. Demographic variables seemed to have less predictive power.

• Many of our prediction and clustering analyses seemed to be heavily influenced by data

aggregation effects. In many cases, these effects may have dominated any underlying consumer

usage effects. Due to this, our results may be representative of the data we are working with but

extrapolate poorly to actual usage situations.

• A strategy of clustering consumers first, then fitting prediction models gave improved prediction

results. This suggests that data may best be analyzed in a cluster-wise manner. If clusters are

representative of underlying data sources, then these data sources may be better analyzed

individually rather in an aggregated fashion.

• While the Unilever database provides information on a large number of consumers, it is our

opinion that modeling and insight generation about consumer behavior is best performed on a

cleaner data set that is better understood and more uniformly gathered. Examples include the

panel data to which Unilever already has access and data gathered by retailers of Unilever brands.


Recommended