Date post: | 06-Apr-2018 |
Category: |
Documents |
Upload: | radhukarthi |
View: | 225 times |
Download: | 0 times |
of 47
8/3/2019 Thierry Vallaud Thesis
1/47
TVallaud 1
Estimating potential customer value using customer dataUsing a classification technique
to determine customer value
Thierry Vallaud
A Thesis
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science in Data Mining
Department of Mathematical Sciences
Central Connecticut State University
New Britain, Connecticut
April 2009
Thesis Advisor
Dr. Daniel Larose
Department of Mathematical Sciences
Key Words: Turnover potential, Classification, Kohonen Networks
8/3/2019 Thierry Vallaud Thesis
2/47
TVallaud 2
Abstract:
This study outlines a method of determining individual customer potential, based solely on
data present in the customer database: descriptive information and transaction records.
We define potential as the incremental turnover that any particular company could do with
their present customers.
In order to successfully calculate this potential in a large database with multiple variables, we
propose grouping together customers who look like each other (known as clones), by means
of an appropriate clustering technique: Kohonen Networks.
This method is applied to actual data sets, and various techniques are employed to check the
stability of the clusters obtained. Real potential is then determined by means of an empirical
approach: practical application to a major French retailers database of 5 million customers.
8/3/2019 Thierry Vallaud Thesis
3/47
TVallaud 3
Contents
The context ................................................................................................................................. 4
Our thesis subject ....................................................................................................................... 6
The precise modelling application ............................................................................................. 6
The research questions ............................................................................................................... 7
The data mining process used .................................................................................................... 7
Data understanding ................................................................................................................. 8
Data preparation ..................................................................................................................... 8
Clustering models and determination of customer potential .................................................... 11
Kohonen network method .................................................................................................... 11
Model development .................................................................................................................. 13
1- Objectives and methodology ............................................................................................ 13
2- Robustness of the Kohonen method: ............................................................................... 15
3- Calculation of the potentials ............................................................................................ 26
4- Main results ...................................................................................................................... 29
5- Results summary .............................................................................................................. 34
The validation procedures for the models ................................................................................ 35
Conclusions .............................................................................................................................. 36
Discussion of the results of the research study .................................................................... 36
The limits and the contribution of our research study .......................................................... 36
Further research .................................................................................................................... 36
Bibliography ............................................................................................................................. 37
Appendix .................................................................................................................................. 40
8/3/2019 Thierry Vallaud Thesis
4/47
TVallaud 4
The context
Most companies would like to know their customers potential in terms of turnover at the
individual level. Determining potential means identifying the incremental turnover that agiven company generates with its existing customers.
Customer turnover potential models exist and are mainly based on the customer value
determined by the LTV approach (LTV = Life Time Value) (Bnavent and Cri; Berger and
Nasr 1998; Dwyer 1997; Venkasten, Rajkumar and Kumar 2004).
Beside this model, other models exist which estimate the customers spending share (Cooil et
al. 2007; Yuxing Du et al.; Keimingham et al. 2007). Other econometric models exist, which
are based on data that often are external to the database (Plastria 2001, Huff 2003, Reilly
1931).
Customer consumption (total value) represents the lifetime consumption of a particularproduct by a particular customer, referred to as Customer Total Value or CTV. For example
over the course of his life, a customers total value for a retailer is the sum of all the purchases
he will make in the retailers stores during his life.
It is possible to estimate a customers consumption on this market for a given brand b. Over
the course of his lifetime, the customer will consume several brands. His total consumption
one of these brands then constitutes the brands wallet share over the customers lifetime
(Figure 1).
Wallet share of
The difference or delta between total consumption by the customer in the market and the
total consumption of the brand corresponds to the Competitors Consumption Total
Value CCTV (Figure 2).
or
Depending on the brands marketing stimulus, the customer will take a share of that
delta to competitors and/or increase his consumption in the total market:
Customers of the retailer will consume in some competitors stores and may be increase his
total consumption for retailers.
8/3/2019 Thierry Vallaud Thesis
5/47
TVallaud 5
Thus, the customers theoretical potential is his total consumption over his lifetime:
which is his reachable potential that can be estimated by means of the above econometric
model
Where
Actual Value for Brand 1
Share of consumption taken to the competitors (Figure 3).
Increase of its total consumption
The customers reachable potential then corresponds to what the brand has already captured
and what the customer could consume additionally or obtain from competitors. This reachable
potential can be estimated in two ways: using an econometric model, which requires
exogenous data from the companys internal customer database; or alternatively, using solely
internal data from the companys customer database, by means of the clones method.
A given brand can only capture n% of the theoretical potential (Berend Wierenga and Gerrit,
2000). Some marketing researchers have shown that a brand can increase its actual wallet
share to a maximum of 30%, above this rate the customer perceives a change and tries to
resist it. Above 30% of increase there is too much modification of his choice set1(Bremer and
Joyce, 1988). This subject has already been covered in one of our previous studies (Vallaud,
2003).
1 The choice set is the finite set of products for a given product category that a customer has in mindbefore to make a purchase
8/3/2019 Thierry Vallaud Thesis
6/47
TVallaud 6
The most advanced approaches to determination of potential try to determine the portion that
could be reachable for the company, relying solely on customer data from the companys
customer database. These approaches calculate a customer by customer potential but
evidently have to be consistent at the aggregated level with market values macro
information.
Our thesis subject
The objective is to work on clustering models2(Lerman 1970, Dorofeyuk 1971, Borko et al.
Bernick 1963, Two Steps (Tan et al. 1997), K means (Hartigan et al. 1979, Fang et al. 1982),
SOM (Teuvo Kohonen 1988, Vesanto 1997, Kaski 1997), etc..), on large databases from
commercial companies (phone operators, ISPs, major retailers, mail order companies, etc...).
We use clustering models in order to determine the customer potential using a method we call
the clonemethod, whereby customers who most resemble each other are considered to be
clones and should have the same potential.
We have access to a variety of data bases suited to our methodological process. In this
document we will perform an empirical test of our method on customer data from a major
French grocery retailer.
As part of our brief presentation of the context, we will look at two main subjects:
- Calculation of potential or the customer value in marketing and its differentdependences: LTV, wallet share, market share capture, etc.
- The mathematical models that allow similar individuals to be grouped intohomogeneous data groups : clustering techniques
The investigation field will be multidisciplinary, although there will be a minor marketing
investigation and a major investigation in the area of statistics, data mining and clustering.
The precise modelling application
The greater part of our research objectives is to test several techniques, separately and
possibly jointly, to ensure that the clusters formed are homogeneous groups of clones.
Besides choosing the models, part of the research involves defining the most informative
variables and a model topology which fits with these data. The aim here is to obtain the most
meaningful and convergent results.
Another aspect of our research will involve confirming the clusters obtained using the models,along with other complementary statistical techniques:
- Dimension reduction to choose variables because of the very large numbers of clustersand with large value ranges,
- Projection of passive and active variables3 in the clusters,- Clusters reallocation by supervised models,- Validation by non automatic classification techniques, connectivity of super classes,
2 SOM belongs to the clustering methods, typologies is the French word for clustering and
typologies belong to the unsupervised classification techniques 3 Active variables are used to build the groups themselves in term of distances, passive are justdescriptive variables to explain the groups
8/3/2019 Thierry Vallaud Thesis
7/47
TVallaud 7
- Empirical verification with external panels like Nielsen or TNS Sofres4 whichrepresents the market reality of the potential.
Another large part of our research study is selecting the above mentioned methods and
validating these choices. The aim is to find a clustering method that converges sufficiently to
be validated with all the approaches described above. The modelling will therefore become aprocess of several models.
The definitive modelling will be realized using a market standard software platform:
Clementine from SPSS in a French version.
The scientific contribution will be:
- a methodological contribution to selecting clustering models and validating thesechoices
- a real life data application, validated by the reality of an actual business case:calculating real attainable potentials
The research questions
- Can we use a clustering technique to determine customers which are similar to eachother and therefore define a realistic potential in terms of turnover for these
customers?
- Can we develop a method?- How can we validate the stability of the clusters?
The data mining process used
We will use the Cross-Industry Standard Process for Data Mining (CRISP)5
data mining
project process which will conduct our approach to analyzing the data. The CRISP standard
process consists of the following stages:
4 Nielsen and TNS are market research companies which provide panels in which members scan
purchases they do. These panels can be crossed with customers data bases to measure marketingmix effects5http://www.crisp-dm.org/
http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/8/3/2019 Thierry Vallaud Thesis
8/47
TVallaud 8
Data understanding
We will work on 5,373,026 individuals derived from the database of a major French retail
company. We have the details of all cash register receipts over a period of 12 months fromJanuary 2006 to December 2006.
For external validation purposes, we also have market research available on the French
market:Referenseigne 2006from TNS Sofres6. This research gives us the wallet share of the
main French retailers7.
Data preparation
This step consists in familiarizing ourselves with the data in the database of the program
members, in order to determine the structure of the database due to the data layout, the level
of completed fields comprising the data file, and also the origin and nature of the data in thedata file. Each field will hence be checked to ensure it does not undermine model stability.
We have done a data audit and EDA in two steps, only the second EDA is presented in this
document.
The audit includes:
- The structure of the database- The origin and nature of data (socio-demographic / consumption)- The possibility of performing cross data analysis (by brand / shelf / product family,
etc)
- Data periodicity- Data historicity- Data completeness
Thus, the principal data management processes performed on the data in the database will
therefore include:
- Controlling and the validation of the format of the variables
- Recoding and correcting certain variables called aberrant variables- Creating specific aggregates useful for further segmentation (total turnover, turnover
by product family, annual visit frequency, average buying basket )
- Analysing the correlation of the target variable (turnover) with other variables (socio-demographic criteria, order frequency) in order to check whether any dependantrelationships exist
- Geocoding (useful for the enriching the profiles of certain socio-demographic dataderived from the INSEE
8(French national statistical office) via the IRIS
9(specific
French geocoding data)
6 Referenseigne is a monographic market research done on the French retail market yearly since tenyears by TNS Sofres the third worldwide research company.7http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp/8
INSEE (Institut National de la Statistique et des tudes conomiques in French) is the FrenchNational Institute for Statistics and Economic Studies. It collects and publishes information on theFrench economy and society, carrying out the periodic national census. Located in Paris, it is the
http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://en.wikipedia.org/wiki/Francehttp://en.wikipedia.org/wiki/List_of_national_and_international_statistical_serviceshttp://en.wikipedia.org/wiki/List_of_national_and_international_statistical_serviceshttp://en.wikipedia.org/wiki/Francehttp://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/8/3/2019 Thierry Vallaud Thesis
9/47
TVallaud 9
- Calculating the distances between the customer and the Point of Sales (trade zone)
The analysis will be performed on 12 months sliding turnover on the total sum of the
historical data, to ensure modelling is more reliable. Nevertheless, the greater the historical
data set and its homogeneity, the more stable and predictive should be the model.
In this document, we have merely included some examples of the data audit and data
preparation, as our demonstration is focused on the model and results. Details of the second
EDA in appendix 2 (p.39).
The input variables are as shown in the following table :
French branch of Euro stat, European Statistical System. The INSEE was created in 1946 as asuccessor to the National Statistics Service (SNS) created under Vichy during World War II.
9The IRIS is a French geographic unit on which are linked the census data
8/3/2019 Thierry Vallaud Thesis
10/47
TVallaud 10
Identification of the outliers:
We have identified and eliminated from the analysis some customers with anomalous
behaviour on two variables linked to turnover.
We used only these two variables in the outlier detection, because they are very constitutive
of the potential itself.
Discretization:
We have discretized some important variables and studied their dispersion.
We produced a total EDA in appendix 2 (p.39) with descriptive analysis with tables and
graphs, correlation estimates, and so on.
8/3/2019 Thierry Vallaud Thesis
11/47
TVallaud 11
Clustering models and determination of customer potential
The modelling process is divided into three major phases:
(1), The clustering method itself, (2) the calculation of the evolution levels, and (3) the
calculation of the individual customer potential:
1. The clustering method: as these models are being applied to very large databases withlarge numbers of variables and records, the SOM (Self Organizing Map) seem to be
particularly well adapted (Kohonen, 1988):
- Kohonen networks allow very homogenous and stable groups with multipleindividuals and variables,
- Kohonen networks allow complex non linear relationship on many variables for manyindividuals,
- Kohonen networks handle missing data well.
Kohonen network method
Kohonen networks represent a type of self organising map (SOM), which itself represents a
special class of neural networks.
Kohonen analysis is a clustering method. Its main advantage is to convert high dimensional
input signal into a simpler low dimensional discrete map. Kohonen is an unsupervised method
no target as to be defined.
Kohonen network exhibit three characteristic process :
1 Competition: Ouput nodes compete with each other to produce the best value for a
particular scoring function, most commonly the smallest Euclidian distance.
2 Cooperation: Winning node therefore becomes the center of the neighbourhood of exited
neurones.
3 Adaptation: Nodes is the neighbourhood of the winning node participate in adaptation, thatis, learning. The weights of that node are adjusted so as to further improve the score function.
Network architecture :
Each neuron of the Kohonen map is linked to all the other neurons of the map. Each one of
them receives a complete copy of an input vector.
8/3/2019 Thierry Vallaud Thesis
12/47
TVallaud 12
Gagnant Voisinage
Inputs
Taux d apprentissage Poids ajust des gagnants en fonctiondes donnes d entre
Les donnes desortie qu i essaiede devenirgagnantes
Gagnant Voisinage
Inputs
Taux d apprentissage Poids ajust des gagnants en fonctiondes donnes d entre
Les donnes desortie qu i essaiede devenirgagnantes
Le s donnes ensortie quiessaient dedevenir gagnan te s
Learning rate
Winner Neighborhood
Adjusted weight ofwinners in function of
the input data
Output data whichtry to become
winners
Kohonen networks are self-organising maps that exhibit Kohonen Learning. There is a set of
m field values for the nth record to be an input vector and the current set
of m weights for a particular output node j to be a weight vector . In
Kohonen learning, the nodes in the neighbourhood of the winning node adjust their weights
using a linear combination of the input vector and the current weight vector :
)
where , represents the learning rate. Kohonen indicates the learning rate should
be a decreasing function of training epochs (run through the data set).
Upon each iteration, it checks the accuracy of its previous grouping.
- A Kohonen network is particularly well suited to building homogenous groups. It isobviously a lengthy process when performed on large number of individuals with
many variables and records.
- A Kohonen network allocates a relevant group to each customer.
By mapping the analysis, we can evaluate the similarity between groups. Two groups which
are close on the graph have similar characteristics.
The aim is to find a method:
- That represents the best trade-off between many classes, ensuring small groups withhomogeneous customers within each group, but groups which differ greatly from each
other.
- That enables us to obtain realistic customer potential with clusters that are internallystable.
8/3/2019 Thierry Vallaud Thesis
13/47
TVallaud 13
2. Calculation of the evolution level: Evolution is the small jump in turnover rate that acustomer needs to produce in order to be clustered with customers who most resemble
him on all the variables selected for the model, but who represent higher turnover than
him. This requires a calculation method based on dividing each class of clones for
which we are calculating the median into decile.
Individuals in one group should not have a huge gap to cross in order to obtain a realistic
determination of potential10: the potential increase of turnover that could be achieved afterapplication of the correct marketing actions. We will try to justify this calculation by
methodological means. This step will give us the evolution rates in the classes.
3. Calculating individual customer potential: once the rates are properly determined, wewill calculate, for each customer, individual customer potential to be captured. This
calculation needs specific adjustments: all customers with an evolution rate potential
above 100% are allocated to the average potential rates of all groups, except that to
which they belong.
Model development
1 - Objectives and methodology
To complete segmentations based on customer turnover, SML segmentation11
(Brusset 2005)and RFM segmentation
12(McCartya and Hastak 2007, Chen et al., 2008), we calculate scores
of turnover potential for each customer in the loyalty program data base.
This score is based on an iterative approach allowing us to predict the consumption propensity
of customers to the aim to determine the potential future turnover.
10 Example: Customer A has an actual turnover of 1 000$. Customer A belongs to first decile of acluster in which all customers look like the most each others. Turnover max of the customer at theupper limit of this decile is 1 200$. So potential is the difference between the 1 200$ of customer max
and the 1 000$ of customerA: 20% or 20011 SML Segmentation (Small, Medium, Large) is dividing the customers in function of their turnover12 RFM Segmentation (Recency, Frequency, Money Value) is a classical segmentation in marketing
8/3/2019 Thierry Vallaud Thesis
14/47
TVallaud 14
The approach consists of grouping together customers who resemble each other, according to
some socio demographic and consumption variables.
For the computation we will use consumption data recorded on a period of 12 months (from
January 2006 to December 2006).
The variables used in the model are those we decide to keep following the data preparation
stage.
Socio-Demo & Consumption data Turnover rate per product family
Customer ID Customer ID
Number of children in the household Rate other
Filtered turnover on 12 months Rate Bazar
Total turnover Rate otherYearly turnover on promo Rate Pork Butcher LS
Nb of transformed points on 12 months Rate Pet food
Nb of CM on 12 months Rate Baby
Nb of reduction voutchers used Rate Butcher
SML 12 months Rate backer
RFM 3 months Rate Pork Butcher
Number of children in the household Rate dietetic bio
Rate cheese
Rate fruits and vegetables
Rate fishs
Rate frozen food
Rate wine
Rate cleaning products
Rate grocery
Rate liquid
Rate textile
Rate ultra fresh products
Rate pouldry
Rate First price
Rate Retailer Brand 1
Rate Retailer Brand 2
Discarded variables are eliminated after a correlation analysis for the quantitative variables
(turnover and number of purchases acts for instance) or by proximity matrixes for qualitative
variables. We dont used PCA because we would like to keep the information as the much
desegregated level of the original variables in the data base.
Inactive customers, customers without any transaction of the period, are discarded.
Clementine stream:
8/3/2019 Thierry Vallaud Thesis
15/47
TVallaud 15
This figure is here to illustrate how a model is done on Clementine from SPSS, Clementine is a statistical
software which uses object language to make models
We will use clustering method to create "clone" groups that are highly homogeneous within
each other, but different from each others.
The second stage involves creating turnover potential values for these different groups, given
that an individual with the same variables as another does not obviously realize the same level
of turnover. He can tend towards the turnover of his superior clone. To do this, we will use a
Kohonen neural network.
Once the clone families have been obtained and potential values calculated, the main familiesare determined:
-"Gold : evolution rate higher than 20%
- Silver : evolution rate between 15% and 20%
- Bronze : evolution rate below 15%
The evolution rate is the ratio of the potential on the actual turnover.
It should be note here that potential refers to absolute potential over twelve consecutive
months.
This potential is expressed in the form of a rate. For operational purposes, potential values
must be reclassified as absolute value:
P1: Large potential
P2: Medium potential
P3: Small potential
2- Robustness of the Kohonen method:
We test several methods of determining convergences between Kohonen groups.
2.1 CONVERGENCES VISUALIZATION
8/3/2019 Thierry Vallaud Thesis
16/47
TVallaud 16
We obtained 40 groups, numbered from 00 to 93 (note that clusters do not follow a numbered
sequence).
We would like to obtain a quiet important number of groups to minimize at the maximum the
inter group standard deviation.
Mappings: 00 is the cluster of 0 coordinate on the X axis and 0 on the Y axis, and 93 is thegroup of coordinate 9 on the X axis and 3 on the Y axis.
Kohonen groups x SML segmentation (12 months)
Colors are generally well grouped, with customers belonging to the same SML segments
being together.
Visually, the placement of SML through clusters shows stability.
Kohonen groups x RFM segmentation (3 month)
8/3/2019 Thierry Vallaud Thesis
17/47
TVallaud 17
Colours are generally well-grouped, with customers belonging to the same RFM segments
found in the same Kohonen groups.
There is a far greater mixture of colors inside each cluster, with customers belonging to the
same RFM segments being found in the same Kohonen groups, but the homogeneity of
clusters is less obvious than with SML mapping.
2.2 ROBUSTNESS OF THE KOHONEN CLASSIFICATION
Is this distribution of the population stable? We can answer this question in four different
ways
A - Is there a convergence of clusters weights between the sample of the active observations
and passive observations?
B - Can the grouping be reproduced by a Bayesian network (Pourret et al, Jensen, Stephenson
2000)?
C - Can the classification be reproduced by segmentation as C5.0 (Quinlan 1993, 1996,2004)?
D - Is there convexity of the super classes?
A/ Convergence of the method
We can check the percentage ofcustomers allocation on two random samples
8/3/2019 Thierry Vallaud Thesis
18/47
TVallaud 18
Number % Number % Number %
KH01 271 944 5,06% 10 949 5,13% 260 995 5,06%
KH02 171 396 3,19% 6 983 3,27% 164 413 3,19%
KH03 261 136 4,86% 10 498 4,92% 250 638 4,86%
KH04 289 912 5,40% 11 508 5,39% 278 404 5,40%
KH05 80 239 1,49% 3 214 1,50% 77 025 1,49%
KH06 40 698 0,76% 1 596 0,75% 39 102 0,76%
KH07 64 515 1,20% 2 550 1,19% 61 965 1,20%KH08 93 685 1,74% 3 768 1,76% 89 917 1,74%
KH09 95 415 1,78% 3 757 1,76% 91 658 1,78%
KH10 91 169 1,70% 3 681 1,72% 87 488 1,70%
KH11 57 384 1,07% 2 235 1,05% 55 149 1,07%
KH12 181 691 3,38% 7 224 3,38% 174 467 3,38%
KH13 142 728 2,66% 5 624 2,63% 137 104 2,66%
KH14 83 298 1,55% 3 260 1,53% 80 038 1,55%
KH15 65 365 1,22% 2 597 1,22% 62 768 1,22%
KH16 152 665 2,84% 6 153 2,88% 146 512 2,84%
KH17 119 559 2,23% 4 797 2,25% 114 762 2,22%
KH18 45 360 0,84% 1 794 0,84% 43 566 0,84%
KH19 73 151 1,36% 2 783 1,30% 70 368 1,36%
KH20 35 914 0,67% 1 378 0,65% 34 536 0,67%
KH21 120 165 2,24% 4 688 2,20% 115 477 2,24%
KH22 137 752 2,56% 5 462 2,56% 132 290 2,56%
KH23 36 215 0,67% 1 417 0,66% 34 798 0,67%KH24 267 939 4,99% 10 739 5,03% 257 200 4,99%
KH25 193 624 3,60% 7 581 3,55% 186 043 3,61%
KH26 50 454 0,94% 2 019 0,95% 48 435 0,94%
KH27 26 271 0,49% 1 036 0,49% 25 235 0,49%
KH28 76 724 1,43% 3 082 1,44% 73 642 1,43%
KH29 199 372 3,71% 7 810 3,66% 191 562 3,71%
KH30 28 913 0,54% 1 102 0,52% 27 811 0,54%
KH31 124 878 2,32% 4 922 2,30% 119 956 2,32%
KH32 347 565 6,47% 13 963 6,54% 333 602 6,47%
KH33 75 304 1,40% 2 998 1,40% 72 306 1,40%
KH34 103 656 1,93% 4 107 1,92% 99 549 1,93%
KH35 24 658 0,46% 989 0,46% 23 669 0,46%
KH36 31 206 0,58% 1 272 0,60% 29 934 0,58%
KH37 301 456 5,61% 11 863 5,55% 289 593 5,61%
KH38 252 820 4,71% 10 042 4,70% 242 778 4,71%
KH39 130 904 2,44% 5 193 2,43% 125 711 2,44%KH40 425 926 7,93% 16 942 7,93% 408 984 7,93%
Total 5 373 026 100,00% 213 576 100,00% 5 159 450 100,00%
Learning sample Test sampleClones
Total
B/ Reallocation using a Bayesian network
The above table confirms that the algorithm is able to reproduce the distribution on a larger
data set (Learning sample vs Test sample).
However, it is by using another algorithm that we can determine whether or not the clustering
can be reproduced or if it is stable or not.
Again, the learning sample is split into two independent sub-samples. The learning sample
includes 70% of the observations, the test sample 30%.
We use a Bayesian network, because to make a prediction on 40 groups discriminating
analysis is not well adapted.
Bayesian network allows a stepwise approach, as we can fix the level of probabilities of links
that we retain between variables. If we fix a probability of 0.9, the results are as presented on
a graph format below.
8/3/2019 Thierry Vallaud Thesis
19/47
TVallaud 19
The network uses 11 variables, turnover data and socio demographic variables. It can be seen
that SML and RFM are very important. This result validates the representation of the
densities. Below the weights of variables in the model.
8/3/2019 Thierry Vallaud Thesis
20/47
TVallaud 20
Kullback-Leibler measurement http://www.it-
innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdf. comes from informationtheory. It is a measure of convergence between two series after they have been recoded on a
bitmap format. The higher the value, the greater the probability that these two values have a
joint distribution.
http://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdf8/3/2019 Thierry Vallaud Thesis
21/47
TVallaud 21
Scoring result at the individual level: on the learning sample, 90.7% of the individuals are
correctly classified.
On the test sample, the figure is 90.1%
Below are the rates in % of correctly classified individuals by the Bayesian Network for each
of the 40 clusters.
The Kohonen clusters can be reproduced.
8/3/2019 Thierry Vallaud Thesis
22/47
TVallaud 22
8/3/2019 Thierry Vallaud Thesis
23/47
TVallaud 23
In the above table, poorly reallocated groups are of course groups containing a small number
of customers.
Even for these groups, accuracy remains above 65%
C/ Reallocation by decision tree
The cross-validation rate is 94.2% of correctly affected individuals to groups.
The test sample confirms this rate.
There is a strong convergence of the two supervised learning methods Bayesian Networks and
C5 are able to reallocate properly individuals to 40 clusters.
Robustness of the classification is validated.
8/3/2019 Thierry Vallaud Thesis
24/47
TVallaud 24
8/3/2019 Thierry Vallaud Thesis
25/47
TVallaud 25
D/ Superclasses convexity
We use a Bayesian network analysis, which identifies a small number of variables that are the
most important for clustering.
We analyse contingency table between the 40 groups and the variables which contribute at the
network for more than 10% of explicative ability.
- Family situation
- C.S.P.
- R.F.M. at 3 months
- S.M.L at 3 months
- Home type
- Age categories
- Filtered cumulated turnover
- Customer seniority categories
On this table, the scale used is Khi distance (Ottos, 2007, Meunier et al, Romesburg, 2004)
and aggregation method is that used by Ward (Clarke and Sun, 1997, Barnier 2008).
Dendrogramme
KH01
KH05KH02
KH06
KH11
KH12
KH03
KH07
KH04
KH08
KH25
KH29
KH34
KH30
KH33
KH38
KH39
KH40
KH35
KH37
KH31
KH32
KH36
KH21
KH26
KH27
KH24
KH28
KH22
KH23
KH16
KH19
KH20
KH10
KH09
KH13
KH14
KH17
KH15
KH18
0 1 2 3 4 5 6 7 8 9
Breakdown of the standard deviation for an optimal classification:
Intra-groups 89790172,495Inter-groups 25701803,959Total 115491976,454
8/3/2019 Thierry Vallaud Thesis
26/47
TVallaud 26
Distances between the central objects:
Results per cluster:
Cluster 1 2 3 4 5
Objects 10 7 8 5 10
Sum of weights 10 7 8 5 10
Intra class standard
deviation72493506,744 25490092,048 67169820,393 125787548,900 151548331,778
Minimal distance to
barycenter
3535,855 3419,845 3465,065 3277,114 4143,086
Average distance to
the barycenter7323,184 4577,213 6560,727 8500,865 10606,386
Maximal distance to
the barycenter14057,216 5976,284 16549,137 18792,278 22881,067
KH01 KH09 KH16 KH21 KH25
KH02 KH10 KH19 KH26 KH29
KH03 KH13 KH20 KH31 KH30
KH04 KH14 KH22 KH32 KH33
KH05 KH15 KH23 KH36 KH34
KH06 KH17 KH24 KH35
KH07 KH18 KH27 KH37KH08 KH28 KH38
KH11 KH39KH12 KH40
A check is performed to ensure that the bottom/top classification respects the order of the
groups: clone 40 is not grouped together with clone 3. It's one of the "quality" criteria of a
Kohonen map.
In conclusion, the sharp classification obtained by Kohonen algorithm satisfies the criteria of
stability and reproducibility which guarantee a robust and lasting potential.
3- Calculation of the potentials
We divided the annual turnover (filtered turnover on 12 month) into deciles.
For each clusters obtained with the Kononen method, we have calculated the business
potential based on the turnover.
We retained the deciles method which allows very significant variations in turnover to be
taken into account.
We split the total turnover of each class of clones into deciles, then calculated the median of
each deciles.
8/3/2019 Thierry Vallaud Thesis
27/47
TVallaud 27
Then we allocate the groups a potential turnover value derived from the calculation of the rate
of increase between medians and deciles.
For each clones group, the increasing rate of the turnover measures the turnover growth to go
from a decile to the upper decile.
18 increasing rates per clones group are determined:
- Between the median of the first decile and the upper limit of the first decile: Tx01
- Between the upper limit of the first deciles and the median of the second decile: Tx02
- Between median of the second deciles and the upper limit of the second deciles: Tx03
...
- Between the upper limit of the eighth deciles and the median of the ninth decile: Tx16
- Between the median of the ninth deciles and the upper limit of the ninth decile: Tx17
- Between the upper limit of the tenth deciles and the median of the tenth decile: Tx18
We let without any potential companies which are higher than the median of the tenth deciles.
We estimate that companies with such high turnover will have an evolution rate near 0, equalto the inflation rate, or equal to their annual evolution rate.
For each Kohonen group, a customer for whom the filtered turnover is between the minimum
and median of the first decile will have an evolution rate equal to rate 1 (Tx01).
A customer whose turnover is between the median and the upper limit of the first deciles, will
have an increase rate equal to rate 2 (Tx02) etc...
Each customer is allocated an evolution rate. The rate multiplied by the turnover allows us to
estimate a potential turnover of each customer.
4.1 - Limits and medians of deciles per Kohonen group:
8/3/2019 Thierry Vallaud Thesis
28/47
TVallaud 28
Turnover in euros
Number Mean Median Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median
KH01 00 271 944 683 354 16,6 34,6 56,4 81,9 111,9 146,5 187,5 234,1 289,4 354,5 431,5 523,7 634,9 769,0 935,8 1 151,5 1 409,5
KH02 01 171 396 730 396 19,0 39,9 64,5 93,1 126,3 164,1 209,3 262,8 323,4 396,1 482,8 584,4 705,7 847,6 1 023,5 1 245,5 1 521,7
KH03 02 261 136 770 414 22,2 45,8 72,5 102,8 137,3 177,3 223,6 277,3 340,4 414,1 501,1 603,4 724,0 867,4 1 041,4 1 266,4 1 570,2
KH04 03 289 912 1 353 658 31,3 65,4 104,5 148,8 201,6 262,2 335,5 423,3 529,1 658,0 812,4 1 000,2 1 230,8 1 510,2 1 864,7 2 314,6 2 898,5
KH05 10 80 239 471 303 17,8 35,4 55,7 79,8 106,7 136,4 170,6 209,7 253,7 303,4 359,7 425,9 499,4 582,3 675,7 784,9 911,0
KH06 11 40 698 452 312 19,7 39,1 61,4 86,5 113,9 146,0 180,7 217,6 262,3 311,7 367,8 430,3 502,8 584,4 677,0 775,0 885,0
KH07 12 64 515 469 360 23,9 49,5 77,3 107,4 140,9 177,0 217,5 261,5 308,6 360,1 417,2 480,0 549,0 622,9 705,6 797,8 900,8
KH08 13 93 685 697 467 27,9 57,4 91,0 127,5 168,3 215,1 268,2 326,7 391,7 466,7 550,4 643,3 748,8 862,3 995,6 1 146,6 1 381,5
KH09 20 95 415 432 313 20,5 39,6 61,7 86,1 113,2 144,8 179,9 219,5 263,6 313,0 369,2 430,3 499,9 576,5 662,0 756,4 863,1
KH10 21 91 169 519 452 32,7 65,5 101,0 140,9 183,5 229,8 280,1 333,2 391,0 451,7 516,7 581,1 651,4 722,7 797,6 877,1 957,5
KH11 22 57 384 580 454 27,6 56,8 92,0 129,3 170,7 219,8 271,1 327,2 390,2 454,2 525,6 600,9 683,3 771,7 865,6 965,8 1 071,6
KH12 23 181 691 850 670 37,7 80,4 129,2 184,3 245,6 314,3 390,1 475,0 567,6 669,9 783,8 905,1 1 037,8 1 180,3 1 330,4 1 501,9 1 694,4
KH13 30 142 728 988 779 39,4 85,9 144,8 214,2 293,1 377,6 472,0 571,3 674,2 779,0 888,3 1 000,1 1 115,8 1 257,1 1 429,2 1 627,0 1 853,7
K H1 4 31 8 3 2 98 2 9 71 2 5 92 1 26 8, 3 1 3 93 ,2 1 5 25 ,2 1 6 58 ,7 1 80 2, 7 1 9 46 ,8 2 09 9, 5 2 2 57 ,7 2 43 0, 3 2 5 91 ,5 2 76 4, 1 2 9 46 ,5 3 15 8, 4 3 3 96 ,3 3 67 5, 6 4 0 09 ,5 4 42 2, 9
K H1 5 32 6 5 3 65 2 4 88 2 0 79 1 20 9, 3 1 2 95 ,8 1 3 83 ,0 1 4 72 ,9 1 56 4, 9 1 6 60 ,3 1 75 9, 2 1 8 63 ,6 1 97 1, 1 2 0 78 ,9 2 18 9, 0 2 3 04 ,7 2 42 2, 7 2 5 94 ,9 2 87 6, 1 3 2 39 ,7 3 70 1, 3
K H1 6 33 1 52 66 5 4 2 03 3 7 45 1 58 4, 7 1 9 76 ,5 2 3 25 ,5 2 5 79 ,3 2 74 6, 6 2 9 17 ,8 3 10 7, 6 3 3 04 ,4 3 52 0, 0 3 7 45 ,0 3 99 1, 8 4 2 56 ,8 4 54 9, 6 4 8 79 ,7 5 26 4, 2 5 7 07 ,4 6 26 1, 8
K H1 7 40 1 19 55 9 2 6 46 2 2 54 1 16 6, 1 1 2 70 ,9 1 3 77 ,5 1 4 88 ,5 1 60 4, 0 1 7 22 ,5 1 84 3, 4 1 9 71 ,9 2 11 0, 0 2 2 53 ,5 2 40 5, 7 2 5 74 ,7 2 78 4, 9 3 0 27 ,5 3 30 4, 2 3 6 25 ,9 4 03 4, 0
K H1 8 41 4 5 3 60 3 2 63 2 8 88 1 33 6, 7 1 5 12 ,5 1 6 81 ,1 1 8 59 ,9 2 03 5, 6 2 2 15 ,0 2 39 7, 4 2 5 68 ,8 2 72 0, 1 2 8 88 ,3 3 06 9, 4 3 2 67 ,3 3 48 8, 7 3 7 38 ,1 4 02 5, 8 4 3 93 ,4 4 84 1, 6
K H1 9 42 7 3 1 51 4 5 54 4 0 65 2 62 1, 0 2 7 55 ,0 2 8 93 ,3 3 0 34 ,8 3 17 7, 2 3 3 34 ,5 3 50 0, 5 3 6 73 ,4 3 86 2, 7 4 0 64 ,4 4 27 4, 6 4 4 99 ,5 4 75 5, 2 5 0 56 ,5 5 39 7, 1 5 7 93 ,1 6 30 3, 2
K H2 0 43 3 5 9 14 4 5 71 4 0 70 2 57 0, 7 2 7 05 ,2 2 8 50 ,9 3 0 04 ,1 3 15 5, 6 3 3 12 ,4 3 48 7, 7 3 6 69 ,4 3 86 1, 8 4 0 69 ,6 4 28 4, 9 4 5 27 ,1 4 79 7, 5 5 1 03 ,9 5 44 7, 5 5 8 61 ,8 6 38 2, 0
K H2 1 50 1 20 16 5 2 2 91 1 9 43 1 21 2, 4 1 2 82 ,5 1 3 56 ,5 1 4 32 ,3 1 51 0, 3 1 5 90 ,1 1 67 4, 2 1 7 60 ,4 1 84 9, 6 1 9 42 ,5 2 03 8, 5 2 1 41 ,3 2 24 6, 1 2 3 55 ,7 2 46 9, 4 2 7 62 ,9 3 23 2, 9
K H2 2 51 1 37 75 2 4 5 72 4 1 09 2 51 5, 0 2 6 64 ,9 2 8 17 ,2 2 9 78 ,6 3 14 3, 5 3 3 19 ,0 3 50 1, 2 3 6 91 ,9 3 89 1, 3 4 1 09 ,4 4 33 9, 2 4 5 94 ,2 4 87 0, 6 5 1 86 ,9 5 53 9, 8 5 9 56 ,9 6 47 9, 9
K H2 3 52 3 6 2 15 4 3 36 3 8 49 2 59 9, 6 2 7 09 ,0 2 8 23 ,5 2 9 41 ,5 3 06 9, 1 3 2 01 ,5 3 34 4, 2 3 5 04 ,2 3 66 7, 1 3 8 48 ,9 4 04 5, 0 4 2 60 ,8 4 49 8, 7 4 7 70 ,8 5 08 8, 1 5 4 74 ,5 5 96 2, 4
K H2 4 53 2 67 93 9 4 6 33 4 0 99 2 63 7, 0 2 7 72 ,2 2 9 12 ,9 3 0 57 ,5 3 20 8, 4 3 3 65 ,3 3 52 8, 8 3 7 07 ,9 3 89 6, 9 4 0 99 ,3 4 32 0, 3 4 5 62 ,6 4 82 7, 9 5 1 28 ,9 5 47 7, 8 5 8 96 ,2 6 42 8, 4
KH25 60 193 624 552 436 24,0 50,9 83,8 121,2 163,1 210,5 260,4 314,0 373,5 436,1 502,0 573,0 646,0 722,1 802,9 888,8 977,2
K H2 6 61 5 0 4 54 1 6 92 1 6 69 1 61 ,2 1 1 82 ,5 1 2 34 ,0 1 2 85 ,9 1 34 2, 0 1 4 00 ,7 1 46 3, 2 1 5 28 ,9 1 59 5, 8 1 6 68 ,7 1 74 1, 8 1 8 21 ,0 1 90 4, 3 1 9 90 ,9 2 08 2, 9 2 1 78 ,5 2 28 1, 0
K H2 7 62 2 6 2 71 2 7 66 2 5 52 1 34 8, 1 1 5 38 ,1 1 7 09 ,0 1 8 52 ,4 1 97 8, 5 2 1 03 ,6 2 22 1, 7 2 3 39 ,1 2 45 2, 7 2 5 51 ,7 2 65 2, 2 2 7 72 ,0 2 90 5, 3 3 0 59 ,1 3 24 1, 7 3 4 68 ,9 3 78 1, 4
K H2 8 63 7 6 7 24 3 2 24 2 9 33 1 63 2, 4 1 9 13 ,0 2 1 20 ,9 2 2 99 ,2 2 46 3, 6 2 5 59 ,5 2 64 0, 9 2 7 28 ,8 2 82 7, 7 2 9 33 ,2 3 05 2, 2 3 1 84 ,3 3 34 0, 5 3 5 17 ,5 3 73 2, 8 3 9 97 ,6 4 34 5, 7
KH29 70 199 372 365 276 22,6 43,0 64,6 87,8 112,6 140,2 169,7 202,1 237,2 276,0 320,2 368,9 424,1 486,6 557,6 641,4 738,9
KH30 71 28 913 598 565 58,5 119,2 174,6 228,6 281,5 335,2 390,3 444,8 502,3 564,5 622,8 684,5 744,3 808,2 873,5 940,2 1 009,9
K H31 72 124 878 1 292 1 442 25,8 62, 7 118, 6 215, 8 498, 0 1 194,3 1 251,5 1 313,0 1 375,6 1 441,9 1 512,9 1 589,5 1 672,5 1 763,4 1 859,4 1 965,2 2 078,4
K H32 73 347 565 1 220 1 415 14,7 32, 2 59,1 102, 6 184, 8 445, 0 1 200,4 1 267,9 1 339,4 1 414,7 1 493,6 1 577,2 1 665,5 1 759,2 1 859,6 1 967,5 2 084,2
KH33 80 75 304 527 495 72,8 124,0 168,2 212,6 257,3 301,7 348,6 395,7 444,1 494,5 547,2 602,7 659,6 719,7 783,0 849,3 917,2
KH34 81 103 656 417 334 36,0 62,9 90,2 118,1 146,9 178,4 211,3 248,3 289,0 334,2 383,6 437,9 498,8 567,4 643,6 728,1 821,9
KH35 82 24 658 458 345 16,7 51,6 82,6 115,2 149,1 184,6 219,3 257,4 297,8 345,3 395,3 449,8 507,5 575,5 647,2 729,4 822,4
KH36 83 31 206 982 1 199 0,9 3,8 9,1 15,0 22,1 32,4 47,8 76,5 155,9 1 199,2 1 264,9 1 335,1 1 417,4 1 509,6 1 622,1 1 759,1 1 930,6
KH37 90 301 456 623 619 151,9 214,8 269,7 322,0 371,7 421,1 470,2 520,2 569,5 619,2 669,1 719,9 771,5 823,8 876,4 930,6 985,9
KH38 91 252 820 322 241 27,8 47,1 66,0 86,4 107,8 130,7 155,3 181,6 209,5 241,0 275,7 314,9 358,7 409,4 467,9 537,8 624,2
KH39 92 130 904 265 171 21,4 35,3 48,6 62,5 77,3 92,6 109,4 127,5 147,5 170,5 196,9 228,3 263,3 306,0 357,8 423,1 509,1
KH40 93 425 926 143 62 5,5 9,6 13,7 18,2 23,3 29,0 35,6 43,1 52,0 62,4 74,8 89,8 108,4 131,6 163,2 206,9 271,4
T OTA L 5 373 026 1 390 687 31,2 2 772,2 90,0 3 057,5 181, 3 3 365,3 316, 4 3 707,9 461,9 4 109,4 632,4 4 594,2 780,9 5 186,9 1 055,1 5 956,9 1 556,4
9th
Cluster
5th Decile 6th Decile 7t Decile 8th Decile1st Decile 2nd Decile 3rd Decile 4th Decile
Number TX1 TX2 TX3 TX4 TX5 TX6 TX7 TX8 TX9 TX10 TX11 TX12 TX13 TX14 TX15 TX16 TX17 TX18
KH01 00 5 789 108,63 63,13 45,26 36,55 30,96 27,95 24,86 23,62 22,48 21,74 21,36 21,22 21,13 21,69 23,05 22,41 25,49 33,37
KH02 01 1 580 109,62 61,89 44,29 35,62 29,93 27,54 25,55 23,07 22,50 21,87 21,06 20,76 20,11 20,74 21,69 22,18 24,58 28,22
KH03 02 3 067 106,41 58,41 41,79 33,61 29,12 26,10 24,00 22,76 21,65 21,02 20,42 19,98 19,80 20,06 21,60 23,99 26,03 30,51
KH04 03 2 345 109,01 59,80 42,32 35,48 30,08 27,94 26,19 24,98 24,36 23,48 23,11 23,05 22,70 23,48 24,13 25,23 27,30 35,39
KH05 04 1 781 99,49 57,27 43,20 33,71 27,81 25,14 22,89 20,97 19,60 18,54 18,42 17,25 16,61 16,03 16,17 16,06 16,15 33,77
KH06 05 4 534 98,18 56,94 40,91 31,68 28,12 23,79 20,39 20,55 18,84 18,01 16,98 16,84 16,24 15,86 14,47 14,19 13,86 14,33
KH07 10 688 106,73 56,18 38,95 31,22 25,67 22,87 20,23 17,99 16,71 15,86 15,05 14,38 13,45 13,28 13,06 12,91 12,36 12,62
KH08 11 378 105,92 58,47 40,16 32,00 27,76 24,72 21,80 19,91 19,14 17,94 16,89 16,39 15,16 15,46 15,16 20,49 23,96 25,23
KH09 12 398 92,99 55,61 39,55 31,49 27,96 24,22 22,00 20,07 18,76 17,97 16,54 16,18 15,32 14,83 14,25 14,10 13,41 12,78
KH10 13 472 100,21 54,17 39,50 30,19 25,21 21,90 18,98 17,34 15,53 14,38 12,46 12,09 10,96 10,36 9,96 9,18 8,84 8,49
KH11 14 627 106,06 62,07 40,46 32,05 28,76 23,36 20,69 19,26 16,38 15,72 14,34 13,70 12,94 12,17 11,58 10,96 13,96 33,63
KH12 15 1 388 113,10 60,81 42,60 33,25 28,00 24,11 21,77 19,49 18,02 17,02 15,47 14,66 13,74 12,72 12,89 12,82 13,61 14,42
KH13 20 4 594 118,30 68,56 47,95 36,82 28,83 25,00 21,04 18,01 15,55 14,03 12,58 11,58 12,66 13,69 13,84 13,93 15,36 18,86
KH14 21 3 075 9,85 9,47 8,76 8,68 7,99 7,84 7,53 7,65 6,63 6,66 6,60 7,19 7,53 8,22 9,08 10,31 13,70 20,35
KH15 22 420 7,16 6,73 6,50 6,25 6,09 5,96 5,93 5,77 5,47 5,30 5,28 5,12 7,11 10,84 12,64 14,25 17,16 25,06
KH16 23 3 872 24,73 17,66 10,91 6,49 6,23 6,50 6,33 6,52 6,39 6,59 6,64 6,88 7,26 7,88 8,42 9,71 11,71 16,92KH17 24 959 8,99 8,39 8,05 7,76 7,39 7,02 6,98 7,00 6,80 6,75 7,02 8,16 8,71 9,14 9,73 11,26 14,23 20,64
KH18 25 2 661 13,15 11,15 10,64 9,44 8,81 8,23 7,15 5,89 6,18 6,27 6,45 6,78 7,15 7,69 9,13 10,20 12,58 19,12
KH19 30 883 5,11 5,02 4,89 4,69 4,95 4,98 4,94 5,15 5,22 5,17 5,26 5,68 6,34 6,74 7,34 8,81 11,07 16,74
KH20 31 1 037 5,23 5,39 5,38 5,04 4,97 5,29 5,21 5,24 5,38 5,29 5,65 5,97 6,39 6,73 7,61 8,87 11,67 16,93
KH21 32 66 5,78 5,77 5,58 5,45 5,28 5,29 5,15 5,07 5,02 4,94 5,04 4,89 4,88 4,83 11,89 17,01 20,42 27,37
KH22 33 1 429 5,96 5,71 5,73 5,53 5,58 5,49 5,45 5,40 5,60 5,59 5,88 6,02 6,49 6,81 7,53 8,78 10,61 15,87
KH23 34 381 4,21 4,23 4,18 4,34 4,31 4,46 4,78 4,65 4,96 5,10 5,34 5,58 6,05 6,65 7,59 8,91 10,88 16,20
KH24 35 3 392 5,12 5,07 4,96 4,94 4,89 4,86 5,08 5,10 5,19 5,39 5,61 5,82 6,23 6,80 7,64 9,03 11,30 17,32
KH25 40 1 508 111,69 64,82 44,56 34,56 29,06 23,70 20,59 18,95 16,75 15,12 14,15 12,73 11,79 11,19 10,69 9,95 9,52 13,91
KH26 41 1 340 633,59 4,35 4,21 4,36 4,37 4,47 4,49 4,37 4,57 4,38 4,55 4,58 4,55 4,62 4,59 4,70 4,84 5,28
KH27 42 1 525 14,09 11,11 8,39 6,81 6,33 5,61 5,28 4,86 4,04 3,94 4,52 4,81 5,29 5,97 7,01 9,01 12,27 18,53
KH28 43 956 17,19 10,87 8,40 7,15 3,89 3,18 3,33 3,63 3,73 4,06 4,33 4,91 5,30 6,12 7,09 8,71 11,18 17,50
KH29 44 684 90,17 50,40 35,81 28,31 24,52 21,03 19,11 17,34 16,37 16,02 15,22 14,95 14,74 14,58 15,04 15,19 15,33 16,13
KH30 45 2 217 103,76 46,45 30,93 23,13 19,06 16,44 13,96 12,92 12,38 10,34 9,90 8,74 8,59 8,08 7,63 7,41 6,93 6,45
KH31 50 3 839 143,57 89,13 81,91 130,78 139,82 4,80 4,91 4,77 4,82 4,92 5,06 5,23 5,43 5,44 5,69 5,76 6,14 6,39
KH32 51 277 118,82 83,58 73,58 80,03 140,84 169,74 5,62 5,64 5,62 5,58 5,60 5,60 5,63 5,70 5,80 5,93 6,16 6,26
KH33 52 1 649 70,22 35,66 26,38 21,03 17,27 15,53 13,53 12,21 11,36 10,67 10,13 9,45 9,11 8,80 8,47 7,99 7,89 8,40
KH34 53 799 74,74 43,50 30,94 24,35 21,45 18,44 17,50 16,40 15,64 14,78 14,16 13,91 13,76 13,43 13,12 12,88 12,70 12,05
KH35 54 392 208,99 60,21 39,48 29,41 23,76 18,82 17,38 15,70 15,94 14,46 13,78 12,83 13,39 12,46 12,71 12,76 13,08 13,70
KH36 55 4 055 322,22 139,61 65,18 46,88 46,76 47,38 60,19 103,73 669,03 5,48 5,54 6,17 6,50 7,46 8,45 9,74 12,00 15,74
KH37 60 353 41,38 25,55 19,40 15,44 13,28 11,67 10,62 9,49 8,73 8,05 7,60 7,16 6,78 6,38 6,18 5,94 5,75 5,56
KH38 61 1 000 69,11 40,25 30,82 24,84 21,22 18,77 16,95 15,39 15,04 14,40 14,20 13,93 14,13 14,28 14,93 16,07 18,24 21,65
KH39 62 904 64,60 37,69 28,67 23,61 19,82 18,09 16,55 15,76 15,56 15,50 15,91 15,33 16,25 16,93 18,25 20,33 23,48 29,45
KH40 63 1 670 73,32 43,87 32,68 27,54 24,82 22,54 21,23 20,67 20,03 19,75 20,16 20,60 21,49 24,02 26,72 31,21 39,22 54,61
Mean 39,16 38,60 29,60 23,07 18,69 16,44 15,15 13,07 12,38 11,70 11,36 11,18 11,19 11,45 12,07 12,98 14,62 19,14
Clusters
This table gives us the evolution rate that assigned to each customer depending on his clones
group and present turnover.
Under the table, we cans see the average of these rates: if we obtain an aberrant evolution rate
(>100%), we replace this excessively high value with the mean rate calculated on all the
groups except those containing a high value. The table below shows the corrections made.
For instance, a customer who belongs to the clones group 00 (KH01) and has turnover below
21.5 euros will have an evolution rate equal to rate 1: i.e. 89%.
If a customer belongs to the clones group 00 (KH01) and has turnover between 21.5 and 40.5euros, then he will have an evolution rate equal to rate 2, i.e. 54%.
8/3/2019 Thierry Vallaud Thesis
29/47
TVallaud 29
The evolution rates given for the two examples are very high, but they concern very small
customers.
Each of the customers is assigned an evolution rate. The rate multiplied by turnover will allow
us to estimate potential turnover for each of the customers.
CORRECTION OF ABERRANT RATES
Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number
KH01 0 5 789 39,16 63,13 45,26 36,55 30,96 27,95 24,86 23,62 22,48 21,74 21,36 21,22 21,13 21,69 23,05 22,41 25,49 33,37
KH02 1 1 580 39,16 61,89 44,29 35,62 29,93 27,54 25,55 23,07 22,50 21,87 21,06 20,76 20,11 20,74 21,69 22,18 24,58 28,22
KH03 2 3 067 39,16 58,41 41,79 33,61 29,12 26,10 24,00 22,76 21,65 21,02 20,42 19,98 19,80 20,06 21,60 23,99 26,03 30,51
KH04 3 2 345 39,16 59,80 42,32 35,48 30,08 27,94 26,19 24,98 24,36 23,48 23,11 23,05 22,70 23,48 24,13 25,23 27,30 35,39
KH07 10 688 39,16 56,18 38,95 31,22 25,67 22,87 20,23 17,99 16,71 15,86 15,05 14,38 13,45 13,28 13,06 12,91 12,36 12,62
KH08 11 378 39,16 58,47 40,16 32,00 27,76 24,72 21,80 19,91 19,14 17,94 16,89 16,39 15,16 15,46 15,16 20,49 23,96 25,23
KH10 13 472 39,16 54,17 39,50 30,19 25,21 21,90 18,98 17,34 15,53 14,38 12,46 12,09 10,96 10,36 9,96 9,18 8,84 8,49
KH11 14 627 39,16 62,07 40,46 32,05 28,76 23,36 20,69 19,26 16,38 15,72 14,34 13,70 12,94 12,17 11,58 10,96 13,96 33,63
KH12 15 1 388 39,16 60,81 42,60 33,25 28,00 24,11 21,77 19,49 18,02 17,02 15,47 14,66 13,74 12,72 12,89 12,82 13,61 14,42
KH13 20 4 594 39,16 68,56 47,95 36,82 28,83 25,00 21,04 18,01 15,55 14,03 12,58 11,58 12,66 13,69 13,84 13,93 15,36 18,86
KH25 40 1 508 39,16 64,82 44,56 34,56 29,06 23,70 20,59 18,95 16,75 15,12 14,15 12,73 11,79 11,19 10,69 9,95 9,52 13,91
KH26 41 1 340 39,16 4,35 4,21 4,36 4,37 4,47 4,49 4,37 4,57 4,38 4,55 4,58 4,55 4,62 4,59 4,70 4,84 5,28
KH30 45 2 217 39,16 46,45 30,93 23,13 19,06 16,44 13,96 12,92 12,38 10,34 9,90 8,74 8,59 8,08 7,63 7,41 6,93 6,45
KH31 50 3 839 39,16 89,13 81,91 23,07 18,69 4,80 4,91 4,77 4,82 4,92 5,06 5,23 5,43 5,44 5,69 5,76 6,14 6,39
KH32 51 277 39,16 83,58 73,58 80,03 18,69 16,44 5,62 5,64 5,62 5,58 5,60 5,60 5,63 5,70 5,80 5,93 6,16 6,26
KH35 54 392 39,16 60,21 39,48 29,41 23,76 18,82 17,38 15,70 15,94 14,46 13,78 12,83 13,39 12,46 12,71 12,76 13,08 13,70
KH36 55 4 055 39,16 38,60 65,18 46,88 46,76 47,38 60,19 13,07 12,38 5,48 5,54 6,17 6,50 7,46 8,45 9,74 12,00 15,74
Cluster
The above table shows the replacement by the average of the aberrant rates (>100%).
4- Main results
The average incremental rate of the loyalty program customers is 12.79%.
This retailer can earn 12.79% of extra turnover on these customers.
Customer assigned to turnover
8/3/2019 Thierry Vallaud Thesis
30/47
TVallaud 30
41,9%
65,9%
38,4%
14,2%
44,0%
19,0%
41,5%
20,0%
15,0%
0%
20%
40%
60%
80%
100%
Number Turnover Turnover potential
Br onze Silve r Gold
41.9% of the retailer customers are Bronze potentials generating 65.9% of annual turnover
and accounting for 38.4 % of the potential turnover.
At the other end 44% of customers are "Gold" potentials, generating only 19% of the
turnover but accounting for 41.5% of potential turnover.
Regrouping in SML segments:
31,4%
76,7% 76,4%33,8%
5,9%34,6%13,6% 5,7%
9,0%
9,0%
3,8%0,2%
0%
20%
40%
60%
80%
100%
Bronze Silver Gold
S M L New
S customers account for a high proportion of "Gold" potentials, based on annual turnover.
There are M customers among the Bronze potentials, and L customers among the "Bronze"and "Silver" potentials. Most of them have an interesting margin of growth.
In annual turnover (in):
0,1% 1,1% 1,8%10,3%
22,5%29,9%
27,1%7,1%
62,5%69,3%
25,2%
43,1%
0%
20%
40%
60%
80%
100%
B ro nze Silver Go ld
New S M L
8/3/2019 Thierry Vallaud Thesis
31/47
TVallaud 31
In potential turnover (in k) :
0,1% 1,2% 2,6%14,5%
22,3%29,9%
25,9% 7,1%
59,5%69,4%
23,7%
43,8%
0%
20%
40%
60%
80%
100%
Bronze Silver Gold
New S M L
L customers have the most important potential in absolute value, although they do not have
the highest evolution rates. They balance this with much more significant turnover than the Sor M segments.
S customers are over-represented among "Gold" potential, with 29,9% of the potential
turnover of the cluster.
Distribution by the retailer RFM and by potential categories
In numbers:
8/3/2019 Thierry Vallaud Thesis
32/47
TVallaud 32
0,9% 4,0%14,7%
1,4%3,8%
17,0%
0,2%
3,8%
9,0%
10,9%
43,3%
31,4%
32,4%
24,4%
16,5%
7,8%
3,0%
2,2%
23,0%
7,0%
6,0%23,4%
10,8%3,2%
0%
20%
40%
60%
80%
100%
1 W ithout sta tut I NACTI VE 3 MOI S Ne wM--F-- M-F- M-F+M+F- M+F+
In annual turnover (in ):
0,6% 1,4% 4,9%0,8% 0,9%3,0%
0,1% 1,1%1,8%
4,2%11,4%
14,8%19,6%12,0%
22,2%
5,9% 2,3%
5,0%28,6%
16,9%
27,7%
40,4%
54,0%
20,6%
0%
20%
40%
60%
80%
100%
1Wi tho ut s ta tut I NA CT IV E 3 M OI S N ew
M --F-- M -F- M -F+
M +F- M +F+
Logically, heavier potentials should be present in RFM+ segments in absolute values.
Categories of potential:
Potential rates per clone clusters are grouped into four categories:
- P0: No potential turnover
- P1: Potential > 20 %
- P2: Potential between 15 and 20 %
- P3: Potential below 15
8/3/2019 Thierry Vallaud Thesis
33/47
TVallaud 33
5,0% 13,9%
41,7%13,4% 36,6%
13,0%
9,0%
15,6%
40,2%
0,0%
63,7%47,8%
0%
20%
40%
60%
80%
100%
Number Turnover Potential turnover
P0 P1 P2 P3
40% of the customers create 63% of the turnover and 47.8 of potential turnover. On average,
they achieve turnover of 2202 for an average potential of162. These customers who
already contribute substantially are the most likely (for the least perceived effort) to reachtheir potential.
Grouping in SML segments
9,5%29,3%
78,7% 82,0%
31,3%
23,1%
8,9% 6,4%
32,8%
47,6% 35,7%
0,0% 4,1% 0,2%
7,5%2,9%
0%
20%
40%
60%
80%
100%
P0 P1 P2 P3
New S M L
S customers represent a high proportion of "P1" and "P2", based on annual turnover generated
by P1 potentials. We find M customers mainly among potential P3, while L customers for
their part are found under "P0", but also "P3".
In yearly turnover (in )
0,0% 2,5% 1,8% 0,0%
39,0% 36,0%
10,0%12,8%
33,2%
25,8%
79,6%
25,3%
64,2%
7,6%
11,8 %
50,4%
0%
20%
40%
60%
80%
100%
P 0 P 1 P 2 P 3
New S M L
8/3/2019 Thierry Vallaud Thesis
34/47
TVallaud 34
P0 P1 P2 P3 TOTAL
Average
amount
Average
amount
Average
amount
Average
amount
Average
amount
New 4 187 120 430 456 165
S 1 004 222 422 701 384 M 2 131 1 673 1 756 1 732 1 746
L 6 439 3 911 6 493 3 963 4 401
TOTAL 3 853 448 961 2 202 1 393
In potential of turnover (in ):
3,9% 1,9% 0,0%
38,9% 35,7%14,4%
31,8%
11,9%
25,4%
50,4%
24,3%
61,3%
0%
20%
40%
60%
80%
100%
P1 P2 P3
L
M
S
New
S customers are over-represented in the "P1" category, with 38.9% of the potential of turnover
for this segment. In absolute terms, it is really L customers who have the highest potential. It
is with good customers that we can increase turnover as these have the most chance to
succeed than any other segments. The marketing budget can therefore be allocated on the
basis of average turnover and intensity of offers by potential. The two concepts are
complementary in the definition of the mechanics of loyalty/retention.
5-Results summary
The Kohonen network allows us to group customers into 40 clone clusters. The 4 by 10matrix had no empty group, so we retained it.
Customers within the same group resemble each other according to socio demographic and
consumption characteristics.
Using the deciles method, we assigned a turnover evolution rate to each customer in the
sample.
We created the following potential turnover score:
- Gold : evolution rate higher than 20%
- Silver : evolution rate between 15% and 20%
- Bronze : evolution rate below 15%
8/3/2019 Thierry Vallaud Thesis
35/47
TVallaud 35
We calculated potential turnover from this rate and turnover.
Our sample is composed 5,373,026 customers generating annual turnover 7.46 billion euros,
and representing potential turnover of 953.2 million euros.
Then the retailer can earn almost 12.77% more turnover from his customers.
In rate term, the sample is composed of 41.9% Bronze customers, 14.1% Silver customers and
of 44% Gold customers. In reality, it must be assumed that the best customers are those with
the highest absolute values.
76.6% of Gold customers are S and generate 29.9% of potential turnover.
65% of Gold customers are 3 months Inactive, RFM-- and RFM-, and they generate 40% of
annual turnover and 38.4% of potential turnover.
Logically customers with the highest evolution rate find themselves among customers with
poor turnovers values.
At the opposite end of the scale, customers with the highest turnover have the strongest
potential of turnover in terms of absolute value.
P1 P2 P3 TOTAL
Average
amount
Average
amount
Average
amount
Average
amount
New 49 77 34 52
S 60 71 75 64
M 431 302 120 182 L 1 055 1 104 279 336
TOTAL 120 163 162 137
The validation procedures for the models
Internal validity
We carried out several tests on our model:
- Division of our population into sub-populations for checking the allocation coherenceof the clone classes
- Benchmark of several classification techniques- Re-allocation of the classes by supervised models (C5, Bayesian network)- Connectivity of super classes
The internal validation methods will need of course to be completed
External validity
The customer of wallet share is in accordance with a TNS of 24%. Given overall
consumption, the achievable potential of the wallet share will increase to 28%. An extra 2%
of the wallet share is much more realistic.
8/3/2019 Thierry Vallaud Thesis
36/47
TVallaud 36
We would like to de-duplicate13
our base with Nielsen Home Scan Panel to check if sales
really do increase, but this is not yet possible in this context.
Conclusions
Discussion of the results of the research study
The results of our research study will be placed in the context of corporate customer potential
determination: determining customer potential represents a major part ofa companys direct
and promotional marketing investment. Most large loyalty programs are based on this notion.
We will look at how our approach compared with other methods enables us to establish
converging results to answer our research questions:
- The clustering technique (SOM) is used to identify customers which are similar and todefine realistic potential.
- We can estimate the stability of the clusters in several ways which show an internalstability
- We have developed a pragmatic approach which is a potential determination method:the clones method.
The limits and the contribution of our research study
We used specific clustering techniques for the purpose of validating our method. We shown
the eventual statistical limits of our approach in terms of complexity or reliability of the
models used.
For feasibility reasons we worked only with a single business area, the large grocery retail
sector in France, and used only accessory data from other business sectors.
We do not have access at the moment to data from foreign retailers, for example.
Calculation of potential turnover in group is very empirical and should be more scientifically
justified.
Further research
There are several ways to improve upon our research:
- Refine our choice of variables- Determine a more empirical method than the deciles/median method for estimating the
potential per group
- Make more rotations of the model in some other industrial sectors; we have done thisand it works quiet well, but it is important that others test it
- Validate the result in time, by observing the reality of potential values on sales
13 We merge the two data bases to find the doublons
8/3/2019 Thierry Vallaud Thesis
37/47
TVallaud 37
We hope that, by means of its strategic impact on company results and the fact that this
calculation is based on internal customer data already at hand; this method will find an
important use.
Bibliography
1. Aguilera, P. A., Frenich, A. G., Torres, J. A., Castro, H., Vidal, J. L. M., and Canton,M. (2001). Application of the cohune neural network in coastal water management:
Methodological development for the assessment and prediction of water quality.
Water Research, 35(17):40534062.
2. Anderson, B. (1999). Kohonen neural networks and language. Brain and Language,70(1):8694
3. B Meunier, E Dumas, I Piec, D Bechet, M Hebraud, - J Proteome Res, 2007 -Assessment of hierarchical clustering methodologies for proteomic data mining - les
4 versions aseanbiotechnology.info4. Baran, Stanley J. Theories of Mass Communication.5. Benavent and Crie http://christophe.benavent.free.fr/publications/ltv1.pdf6. Beran, R. (1986). Discussion of Wu, C.F.J.: Jackknife, bootstrap, and other resampling
methods in regression analysis (with discussion). Ann. Statist., 14:1295-1298.
7. Berend Wierenga and Gerrit Harm van Bruggen (2000), Marketing Management,Springer Support Systems: Principles, Tools, and Implementation, Springer
8. Berger, Paul D. and Nada I. Nasr (1998), "Customer lifetime value: Marketing modelsand applications," Journal of Interactive Marketing, 12 (1), p.1730
9. Bertrand Clarke et Dongchu Sun, Reference priors under the Chi-Squared distance:The Indian Journal of Statistics 1997, Volume 59, Series A, Pt. 2, 215-231
10.Boos, D.D. (2003). Introduction to the bootstrap world. Statist. Science, 18:168-174.11.Borko, H. and Bernick, M., 'Automatic document classification', Journal of the ACM,
10, 151-162 (1963).
12.Bremer and Joyce (1988), Human Judgment,The SJT View, North-Holand13.Bruce Cooil, Timothy L Keiningham, Lerzan Aksoy, Michael Hsu. (2007) A
Longitudinal Analysis of Customer Satisfaction and Wallet share: Investigating the
Moderating Effect of Customer Characteristics. Journal of Marketing 71:1, 67-83
14.Charles Romesburg Cluster Analysis for Researchers (2004) Lulu press p.13515.Ching-Hsue Cheng and You-Shyang Chen Classifying the segmentation of customer
value via RFM model and RS theory Expert Systems with Applications, In Press,
Corrected Proof, Available online 16 April 2008,Collectif, Recherche sur la Distribution moderne p.64, d: lUnivers du Livre
16.Ciampi, A. and Lechevallier, Y. (2000). Clustering large, multi-level data sets: anapproach based on Kohonen self-organizing maps. In Principles of Data Mining and
Knowledge Discovery. 4th European Conference, PKDD 2000. Proceedings (Lecture
Notes in Artificial Intelligence Vol.1910). Springer-Verlag, Berlin, Germany, pages
3538
17.Ciampi, A. and Lechevallier, Y. (2000). Clustering large, multi-level data sets: anapproach based on Kohonen self-organizing maps. In Principles of Data Mining and
Knowledge Discovery. 4th European Conference, PKDD 2000. Proceedings (Lecture
Notes in Artificial Intelligence Vol.1910). Springer-Verlag, Berlin, Germany, pages
3538
http://christophe.benavent.free.fr/publications/ltv1.pdfhttp://christophe.benavent.free.fr/publications/ltv1.pdf8/3/2019 Thierry Vallaud Thesis
38/47
TVallaud 38
18.Dahbur, K. and Muscarello, T. (2001). Hybrid Kohonen neural network in datamining. In Proceedings of the IASTED International Conference. Artificial
Intelligence and Applications. ACTA Press, Anaheim, CA, USA, pages 303.
19.David Huff, 18-Jun 2003 - University of Texas Austin, "A Retrospective View of theHuff Model and its Application to Spatial Interaction Analysis" University of
Redlands/ESRI Colloquium Series20.Dorofeyuk, A.A., 'Automatic Classification Algorithms (Review)', Automation and
Remote Control, 32, 1928-1958 (1971).
21.Dwyer, R.F. (1997), "Customer lifetime valuation to support marketing decisionmaking", Journal of Direct Marketing, Vol. 11 No.4, p.6-13.
22.Efron B. (1981) Non parametric estimates of standard error: the jackknife, thebootstrap and other methods. Biometrika 68. pp 589--599.
23.Eric Chen-Kuo Tsao, James C. Bezdek and Nikhil R. Pal "Fuzzy Kohonen clusteringnetworks 1994 Published by Elsevier Science B.V.
24.F. V. Jensen Introduction to Bayesian Networks, 1st edition 1996 Springer-VerlagNew York, Inc.
25.Fang, K.; He, S. The problem of selecting a given number of representative points in anormal population and a generalized mills ratio. Technical report, Department of
Statistics; Stanford University: 1982. MacQueen J. Some methods for classification
and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on
Mathematics, Statistics and Probability. 1967;3:281297.
26.Frank Plastria Static competitive facility location: An overview of optimisationapproaches European Journal of Operational Research, Volume 129, Issue 3, 16
March 2001, Pages 461-470.
27.Gehrlein W. V. General mathematical programming formulations for the statisticalclassification problem Operations research letters ISSN 0167-
6377 CODEN ORLED5
28.Harris, M.J. and N. Blisard. 1995. Characteristics of the Nielsen Homescan Data.Working paper. Washington, DC: U.S. Department of Agriculture, Economic
Research Service.
29.Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics.1979;28:100108.
30.http://en.wikipedia.org/wiki/Lifetime_value31.J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial
Intelligence Research, 4:77-90, 1996.
32.Jajuga K.Classification, Clustering and Data Analysis : Recent Advances andApplications2002 lavoisier
33.John A. McCartya,
and Manoj Hastak Segmentation approaches in data-mining: Acomparison of RFM, CHAID, and logistic regression Journal of Business Research,
Volume 60, Issue 6, June 2007, Pages 656-662
34.Juha Vesanto 1997 The SOM in data mining: analysis of world pulp and papertechnology
35.Julien Barnier Tout ce que vous navez jamais voulu savoir sur le Chi2 san s jamaisavoir eu envie de le demander Groupe de Recherche sur la Socialisation CNRS
UMR 5040 15 avril 2008
36.Kaski, S., "Data exploration using self-organizing maps. Acta PolytechnicaScandinavica, Mathematics, Computing and Management in Engineering Series No.
82, Espoo 1997.
37.Kohonen, T., Self-Organization and Associative Memory , New York : Springer-Verlag, 1988
http://en.wikipedia.org/wiki/Lifetime_valuehttp://en.wikipedia.org/wiki/Lifetime_valuehttp://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://en.wikipedia.org/wiki/Lifetime_value8/3/2019 Thierry Vallaud Thesis
39/47
TVallaud 39
38.Lerman, I.C., Les Bases de la Classification Automatique, Gauthier-Villars, Paris(1970).
39.M Roux -, 1985 Algorithmes de classification Editions Masson, Paris40.Mattias Otto ChemometricsStatistics and Computer Application in Analytical
Chemistry Publi 2007 Wiley-VCH
41.Nielsen, Inc. May 2006. Understanding the Homescan Advantage. Presentation byLiz Crews and Ed Groves, Nielsen at RTI International, Research Triangle Park, NC.42.O. Pourret, P. Naim and B. Marcot (2008). Bayesian Networks: A Practical Guide to
Applications. Chichester, UK: Wiley. ISBN 978-0-470-06030-8.
43.Olivier Brusset Segmentation Cibler, scorer, analyser, une seule limite, lesrendements Marketing Direct N92 - 01/04/2005 p.2
44.Pena M. Vanegas A. Valencia Digital Hardware Architectures of Kohonen's SelfOrganizing Feature Maps with Exponential Neighboring Function 2006 IEEE
International Conference on Reconfigurable Computing and FPGA's J. (ReConFig
2006) pp. 1-8
45.Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,1993.
46.Quinlan, R. (2004). Data mining tools see5 and c5.0.47.Rajanee Ranjan Encyclopaedia of Marketing Research Publi 2002, Anmol
Publications PVT. LTD., p.585.
48.Reilly, W.J. (1931) The law of retail gravitation, New York.49.S. Kaski, J. Nikkila, and T. Kohonen Methods for Exploratory Cluster Analysis
Intelligent Exploration of the Web De Piotr S. SzczepaniakPubli 2003 Springer
50.Size and Share of Customer Wallet. Rex Yuxing Du, Wagner A. Kamakura, Carl F.Mela. Journal of Marketing | Volume: 71 | Issue: 2 | Pps: 94-113
51.Tan, Peter J.,Dowe David L., Dix Tevor I, Building classification model in two steps1997
52.Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag,Berlin, 3rd edition, 1989.
53.Teuvo Kohonen. Self-Organizing Maps, 3rd edition. Springer, 2054.The Useful Words from a Decisional Corpus. Contribution of Correspondence
Analysis Springer Berlin / Heidelberg Volume 185/2005. p.159-179
55.Timothy L. Keiningham, Bruce Cooil, Lerzan Aksoy, Tor W. Andreassen, Jay Weiner.(2007) The value of different customer satisfaction and loyalty metrics in predicting
customer retention, recommendation, and share-of-wallet. Managing Service Quality
17:4, 361-384
56.Todd A. Stephenson An Introduction to Bayesian Network Theory and Usage
IDIAP-RR 00-03, 200057.Vallaud Thierry (2003), La fidlisation rentable : la proposition du modle composite,www. numlog.com
58.Venkatesan, Rajkumar and V. Kumar (2004), "A Customer Lifetime ValueFramework for Customer Selection and Resource Allocation Strategy," Journal of
Marketing, 68 (October), p.106-125.
8/3/2019 Thierry Vallaud Thesis
40/47
TVallaud 40
4.
Appendix
Appendix 1 : Translation of the filenames
Appendix 2 : Detail of the first data audit
The data set was audited it two stages: a first stage to determine all the data useful for the
analysis in the original data base, and a second stage to determinant the data available to
calculate potential. In the appendix, only the second stage is shown.
Analysis of the Potentiel_Ratio and Potentiel_Socio tables
Potentiel_Ratio contains 5 373 048 observations (Customer accounts)
It is composed of 26 fields
Potentiel_Socio contain 5 373 056 observations (Customer accounts)
It is composed of 18 fields
This audit is based on the combination of the two tables, i.e. 5 373 048 observations
8/3/2019 Thierry Vallaud Thesis
41/47
TVallaud 41
Data format
This is the original data format. We may have to change some formats to better achieve our
model objectives.
8/3/2019 Thierry Vallaud Thesis
42/47
TVallaud 42
RFM 3 months variable is empty-therefore discarded
Variable by variable analysis
Dichotomous variables
RFM 3 months Number % First audit comparison
New 247 326 4.60% 3.74%
Ex-customers 400 236 7.45%
Inactive 465 107 8.66% 19.15%
M--F-- 1 315 873 24.49% 21.00%
M-F- 1 302 089 24.23% 26.00%
M-F+ 248 619 4.63% 6.06%
M+F- 710 311 13.22% 11.65%
M+F+ 683 487 12.72% 12.40%
Total 5 373 048 100.00% 100.00%
Family statute Number % First audit comparison
Couple 1 557 871 28.99% 26.19%
Single 642 374 11.96% 10.81%
Empty 3 172 803 59.05% 62.99%
Total 5 373 048 100.00% 100.00%
SML on 12 months Number %
NA 7 816 0.15%NV 245 278 4.56%
I 2 573 0.05%
S 3 086 955 57.45%
M 1 014 777 18.89%
L 1 015 649 18.90%
Total 5 373 048 100.00%
Home type Number % First audit comparison
Flat 880 722 16.39% 18.76%
House and flat 1 300 0.02% 0.00%
House 1 576 829 29.35% 34.98%Empty 2 914 197 54.24% 65.02%
Total 5 373 048 100.00% 100.00%
Number of children in thehousehold
Number % First audit comparison
0 4 088 282 76.09% 75.72%
1 510 448 9.50% 9.57%
2 508 967 9.47% 9.54%
3 196 520 3.66% 3.79%
4 48 644 0.91% 1.01%
5 11 482 0.21% 0.22%
> 5 8 705 0.16% 0.14%
Total 5 373 048 100.00% 100.00%
8/3/2019 Thierry Vallaud Thesis
43/47
TVallaud 43
Social categories Number % First audit comparison
Farmer 49 892 0.93% 0.96%
Artisan 86 807 1.62% 1.79%
Other 84 696 1.58% 1.39%
Manager 188 017 3.50% 3.50%
Employee 737 469 13.73% 14.42%
Student 85 928 1.60% 1.28%
Housewife 211 003 3.93% 4.38%
Civil servant 233 386 4.34% 3.70%
Independent worker 42 913 0.80% 0.72%
Worker 138 099 2.57% 2.73%
Retired 664 304 12.36% 14.05%
Unemployed 147 700 2.75% 3.28%
Technician 91 956 1.71% 2.08%
Empty 2 610 877 48.59% 45.71%
24 1 0.00% 0.00%
Total 5 373 048 100.00% 100.00%
The value 24 is a mistake, we eliminate it.
Age Number % First audit comparison
0 to 18 years 8 604 0.16% 0.21%
19 to 29 years 317 371 5.91% 6.00%
30 to 39 years 537 010 9.99% 10.76%
40 to 49 years 652 497 12.14% 12.59%
50 to 59 years 649 038 12.08% 12.30%
60 to 69 years 458 669 8.54% 8.07%70 years and more 592 227 11.02% 10.92%
Empty 2 157 632 40.16% 39.15%
Total 5 373 048 100.00% 100.00%
Customer historic Number % First audit comparison
0 to 2 months 194 269 3.62% 2.76%
3 to 5 months 238 267 4.43% 2.54%
6 to 8 months 182 733 3.40% 3.20%
9 to 11 months 221 313 4.12% 3.26%
12 to 17 months 354 680 6.60% 6.39%18 to 23 months 231 513 4.31% 6.17%
24 to 35 months 517 972 9.64% 9.58%
36 to 47 months 396 706 7.38% 7.96%
48 to 59 months 403 389 7.51% 10.01%
60 months and more 2 631 717 48.98% 44.60%
Empty 489 0.01% 3.53%
Total 5 373 048 100.00% 100.00%
8/3/2019 Thierry Vallaud Thesis
44/47
TVallaud 44
Time since last purchase Number %
0 to 2 months 4 213 358 78.42%
3 to 5 months 562 908 10.48%
6 to 8 months 312 721 5.82%9 to 11 months 241 093 4.49%
12 to 17 months 42 968 0.80%
Total 5 373 048 100.00%
Numerical variables
RateOther
Rate BazarRate BOF
/ APFRate Porkbutcher LS
Rate PetRate
Beauty/Make
up
Rate BabyRate
Butcher
Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055
Mean 77.1 7.0 9.0 6.5 1.2 0.7 1.4 6.5Min -34 035.0 -27 666.7 -195.2 -1 580.0 -15.3 -61.2 -23.5 -125.2
Max 294.9 5 328.9 9 685.6 3 186.7 1 206.5 574.5 3 614.4 4 193.0
SD 22.2 16.3 8.4 6.7 3.7 2.4 5.8 8.6
Rate
Backer
Rate Pork
butcher
Rate
Dietetic
food
Rate
cheese
Rate fruits
and
vegetables
Rate fisher
Rate
frozen
food
Rate wine
Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055
Mean 2.5 2.6 0.4 1.5 8.4 1.9 3.1 2.1
Min -138.5 -74.5 -11.3 -29.4 -105.5 -200.0 -308.3 -530.3
Max 1 247.6 1 373.1 2 428.6 262.7 17 364.3 1 542.9 11 528.6 610.2
SD 4.9 4.7 2.3 2.9 12.0 4.5 7.6 5.4
Ratio
cleaning
products
Ratio
grocery
Ratio
liquid
Ratio
textil
Ratio ulta
fresh food
Rate of
pouldry
Rate first
price
Rate
retailer
brand 1
Rate
retailer
brand 2
Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055
Mean 8.9 16.8 10.8 2.0 4.7 1.7 6.2 14.6 2.0
Min -1 472.8 -1 463.6 -2 843.2 -1 250.0 -372.3 -37.5 -393.8 -194.9 -30.9
Max 14 007.1 16 122.2 6 680.4 2 713.2 5 814.3 2 822.8 26 242.0 7 501.1 2 031.8
SD 12.4 14.9 13.8 5.4 6.4 3.8 15.0 11.5 3.4
Monetary fields have not decimal symbol in the field. We have divided turnover per 100.
8/3/2019 Thierry Vallaud Thesis
45/47
TVallaud 45
SML 12months
Amount % of customersFiltered turnover
12 months% Filtered turnover
12 months
Filtered
turnover 12months: Mean
per customer
NA 7816 0.15% 0 0.00% 0.0
New 245278 4.56% 40 365 377 0.54% 164.6I 2573 0.05% 0 0.00% 0.0
S 3086955 57.45% 1 185 154 005 15.87% 383.9
M 1014777 18.89% 1 772 044 857 23.72% 1 746.2
L 1015649 18.90% 4 471 932 029 59.87% 4 403.0
Total 5 373 048 100.00% 7 469 496 268 100.00% 1 390.2
SML 12
monthsTotal turnover
% Total
turnover
Total turnover:
Mean per customer
Cumulated filtered
turnover
% of
cumulated
filtered
turnover
Cumulated
filtered
turnover: Mean
per customer
NA 29 189 799 0.06% 3 734.6 16 060 558 0.05% 2 054.8
New 118 781 921 0.25% 484.3 91 122 432 0.31% 371.5
I 19 120 403 0.04% 7 431.2 9 701 975 0.03% 3 770.7
S 11 506 545 207 23.89% 3 727.5 6 400 520 698 21.51% 2 073.4
M 10 946 321 611 22.72% 10 786.9 6 841 635 615 22.99% 6 742.0
L 25 549 213 252 53.04% 25 155.6 16 398 620 825 55.11% 16 146.0
Total 48 169 172 193 100.00% 8 965.0 29 757 662 104 100.00% 5 538.3
SML 12
months
Turnover annual on
promo
% Turnover
annual on
promo
Turnover annual on
promo: Mean per
customer
Total nb taken
reduction vouchers
(BA)
% Total nbtaken
reducti