DHL Data Mining Project - Singapore Management … Data Mining Project Table of Contents...

DHL Data Mining Project Customer Segmentation with Clustering Timothy TAN Chee Yong | Aditya Hridaya MISRA| Jeffery JI Jun Yao 3/30/2010

DHL Data Mining Project

Table of Contents Introduction to DHL and the data ............................................................................................................................................................................................................ 3

Problem ................................................................................................................................................................................................................................................ 3

Clustering Methods & Motivation ....................................................................................................................................................................................................... 3

Initial Manual Logical Segmentation (A & B) using Basic Statistics ..................................................................................................................................................... 4

KDD Process, Methodology & Software used .......................................................................................................................................................................................... 5

Data Preparation – Data Preprocessing, Integration and Selection ........................................................................................................................................................ 6

Selection and Separating Customer Transaction Data into Regular and Irregular Customers for Mining .............................................................................................. 7

Regular Customers: Data Mining using Clustering................................................................................................................................................................................... 8

Regular Customers: Cluster Analysis with Interactive Visualization ........................................................................................................................................................ 9

Using Parallel Plot to decide Key Cluster Groups Focus for Regular Customers ............................................................................................................................... 10

Summary Statistics of Cluster Groups ................................................................................................................................................................................................ 12

Key Cluster Analysis for REGULAR Customers using Histogram to present distribution of different Cluster Groups ...................................................................... 14

Validating Hierarchical Clustering (Key Clusters) with another Clustering technique, K-Means ...................................................................................................... 16

Further Analysis to aid interpretation with Tree Map & industry attributes ........................................................................................................................................ 19

Tree Map Analysis using cluster groupings ........................................................................................................................................................................................ 19

Findings & Recommendations for Decision Making for the Regular Customers Segment................................................................................................................ 24

Irregular Customers: Cluster Analysis with Interactive Visualization .................................................................................................................................................... 25

Final Conclusion ..................................................................................................................................................................................................................................... 29

Introduction to DHL and the data Product: Real-world DHL 2008 Malaysia data set

Objective: To analyze using data mining to perform customer segmentation

and analyze useful patterns in DHL transaction data for the purpose of

coming up with recommendations that can help in marketing, pricing and

future Business Intelligence decisions.

About DHL

DHL is the global market leader of the international express and logistics

industry, specializing in providing innovative and customized solutions from

a single source. DHL offers expertise in express, air and ocean freight,

overland transport, contract logistics solutions as well as international mail

services, combined with worldwide coverage and an in-depth understanding

of local markets. DHL's international network links more than 220 countries

and territories worldwide. Some 300,000 employees are dedicated to

providing fast and reliable services that exceed customers' expectation.

Problem

DHL lacks the information about their customer segmentation and profile to help them to make decisions in marketing, pricing and other business decisions. While they have strong domain knowledge and a rich past transaction data, they lack the expertise to mine out interesting patterns.

Clustering Methods & Motivation

The customer transaction data of a logistics company was analyzed. Data mining, clustering in particular, and visualization techniques were used to

find meaningful relationships and patterns within the existing data. These techniques enable the company to target their customers more

efficiently and improve their marketing processes. They also equip the organization with the much needed market intelligence that would enable it

to gain a much better insight and visibility of its customers. SAS JMP was used to interactively visualize the data and cluster the customers based on

their properties and characteristics. JMP also enabled us to evaluate the transactional patterns and data presentation and interpretation through

intuitive graphs. Parallel plots were used to study pattern across clusters and bubble plots enabled us to identify temporal patterns within

customer accounts. For accurate decision making, we used the knowledge discovery process (KDD), transforming the raw data into useful

knowledge. The raw data was pre processed using SAS tools and explored using SAS JMP. The statistical summarizing features of JMP in synergy

with the visualization techniques increase the potential for business decision making. The customers were broadly segmented as regular and

irregular customers. Cluster and temporal analysis enabled the decision making for all the customer segments.

Initial Manual Logical Segmentation (A & B) using Basic Statistics For completeness, we use simple tools like Excel to explore the data and logically segment the data, without using of any Data Exploration or Mining software.

Logical Segment A

Alternate Logical Segment B

For simplicity, we have broken it into 3 logical buckets to best represent the highly skewed and uneven data.

Rough Sketch to show the distribution of the data.

KDD Process, Methodology & Software used

Slides taken from Data Mining class slides.

For Cleaning/Integration, we used SAS Integration Studio to clean and import the structured data into the SAS Suite.

For Selection & Transformation, SAS Enterprise Guide was used to select and transform individual transaction records into derived aggregated customer

monthly and total data, for attributes like total_revenue, total_shipments. The transaction data was also segmented into regular and irregular data in

preparation for mining.

For Mining, SAS Enterprise Miner was initially used, however, due to the limited hardware resources, large dataset, and the less-friendly user interface that

Miner has, we decided to switch to SAS JMP – a light-weight Data Mining software under the SAS Suite for our mining tasks.

Benefits of SAS JMP over SAS Enterprise Miner

- light-weight thus faster

- supports for dendrogram for clustering

- supports interactive visual analytics

For Pattern Evaluation & Data Presentation, SAS JMP is used to interactively generate intuitive graphs to analysis and interpret the data.

For Knowledge Representation to support DHL Decision Making, we have compiled our findings in Microsoft Word document.

Data Preparation – Data Preprocessing, Integration and Selection Selection of Tables

As the objective of the data mining project is customer segmentation, only customer transaction information and other tables that may be useful is selected. The following ETL process is then carried out as shown in the diagrams below:

- Phase 1: Useful attributes are then selected and combined into a single table via Inner joins

- Phase 2: Aggregation All transactions that belongs to the same customers are aggregated into a single bill_account record

- Phase 2: “Bucketing” the aggregated temporal numerical values into 12 separate months, with 13th attribute as the total.

The final table, that is ready for data mining, has a total of 70 columns, and 20,386 unique customer accounts/records.

Reduction of dimensions by the selection of key attributes for clustering input

Based on the discussion with the client with domain knowledge, 65 attributes columns were used as inputs to overcome the problem of the curse of

dimensionality.

Selection and Separating Customer Transaction Data into Regular and Irregular Customers for Mining For the purpose of future analysis, reference & targeted marketing for DHL, we have saved the records (data points) into separate fact tables for each cluster.

Percentage No. of Customers (N)

No of Regular Customer Transactions 1,975,980 78.80% 4478

No. of Irregular Customer Transactions 531,515 21.20% 15908

Total: 2,507,495 100% 20386

Definition and Rationale for splitting into Regular and Irregular Customers

We have defined regular customers as customers with bill account that has consistent engagement (no zero values) with DHL for all the months.

The rest of the customers will be classified as irregular customers. Some of them have zero values for transactions in between months. And some are new

customers with zero at the beginning of the months. However for simplicity and for the purpose of clustering, we have separated them.

Regular Customers: Data Mining using Clustering 65 attributes columns were used as inputs, with the key attributes as number of (1) transactions, (2) weight, (3) pieces, (4) shipments, (5) revenue for each

month and its aggregated totals. An example is shown here. Revenue_1 represents the aggregated revenue in dollars for the month of January 2008.

Hierarchical Clustering Results

Ward Hierarchical (Cluster Frequencies)

Key Clusters (based on Counts):

1) Cluster 1 2) Cluster 11 3) Cluster 19

By analyzing the dendrogram and the distinctive distance between cluster 21 and 22 onwards, we have decided to prune the “tree” and choose 21 clusters.

Regular Customers: Cluster Analysis with Interactive Visualization In this section, after we have generated the clusters, we will use visual analytics like parallel plot, histogram, tree map and other graphs to understand, group

and interpret the clusters to find useful patterns that may support decision making.

After clustering, we moved on to a more visual and analytical layout for data exploration. Interactive data visualization through the use of parallel plots,

histograms, tree maps and bubble plots covers almost all the different properties that the data has in order for us to interpret it effectively. Every point in the

multi dimensional data was mapped as a line on a 2 dimensional plane using parallel plots.

Using Parallel Plot to decide Key Cluster Groups Focus for Regular Customers In order to not use data mining blindly (simply using machine learning) with the default K=20 size, we will use data visualization with parallel plots and human

interpretation. As shown with parallel plots, we can clearly see a clear general pattern to help to differentiate the cluster groupings (colours) for further analysis.

Grouping 1 (Total N= 3713) Cluster 1 (N=3713)

Analysis Isolated cluster with very high values for all attributes Largest no. of customers

Grouping 2 (Total N= 415)

Cluster 11 (N=415)

Very low no. of customer transactions Moderately high values for other attributes 2nd largest no. of customers

Grouping 3 (Total N= 123)

Cluster 19 (N=123)

Very low no. of customer transactions High total billed weight Low pieces and shipments Very low no. of customer transactions 4th Largest no. of customers

Grouping 4 (Total N= 171) Cluster 6 (N=97) Cluster 16 (N=38) Cluster 2 (N=23) Cluster 5 (N=4) Cluster 21 (N=4) Cluster 4 (N=1) Cluster 18 (N=5)

Very low no. of customer transactions Very low billed weight Very low to Low pieces and shipments 3rd Largest no. of customers

Grouping 5 (Total N= 55) Cluster 7 (N=1) Cluster 8 (N=17) Cluster 14 (N=7) Cluster 10 (N=21) Cluster 13 (N=3) Cluster 9 (N=1) Cluster 3 (N=1) Cluster 20 (N=1) Cluster 12 (N=1) Cluster 17 (N=1) Cluster 15 (N=1)

Very low no. of customer transactions Very low to low total billed weight Very low pieces and shipments Smallest size market (55 Customers)

See Zoomed in view

Zoomed in view – Cluster Group Parallel Plot

Analysis While clusters 1 and 11 show a similar behavior across different attributes, cluster 19 follows a pattern of sharp dips and rises relatively. This can be attributed to the drop in sum of pieces. Similar distribution followed by lines of different colors representing different groupings i.e. the clusters can in turn be grouped into different groups with similar characteristics. By zooming in, and observing the patterns, we are able to see some similarities and differences for the different clusters. The clusters with similar patterns has been grouped into a cluster grouping and given a particular colour to help us distinguish. For example, we can see that cluster 2 and cluster 21 are similar. They both have a relatively high sum_of_pieces value and sum_of_shipments as compared to their other values, creating a mountain-like shape that can be easily distinguished with the human eye. Here we can see how Parallel plot can help us effectively and intuitively group the clusters. For further analysis, by creating a new col “ClusterGrp” and assigning each record to a cluster group based on their patterns, we can generate summary statistics to analyse the Cluster Groups further.

Summary Statistics of Cluster Groups

Analysis

Cluster Grp 5 and 3 has the similar highest revenue per transaction ($400+/trans) and revenue per shipments. I could possibly further group them as a customer

segment which makes up of a sizeable 38% (25.5+12.88) of the total revenue market share. Cluster Grp 1 an extremely high revenue per weight volume.

By generating another parallel plot with useful ratios, we can make more useful analysis. Note that since, shipment and transaction values are almost the same, they are interchangeable. Cluster Grp 1: Has the highest rev per weight and the highest % of the total revenue 30% (market), and a relatively low weight per trans/shipment. We perceive that most transactions in Cluster Grp 1 are likely to be transported by air. However, as the number of customers (N) is extremely high (3713), we cannot conclude that they are air-based. It could mean that these shipments are light-weight land/sea based but they are time-critical, thus explaining for their high revenue per transaction. Cluster Grp 3 and 5: They have similar patterns. Cluster 3 and 5 can be merged to give a sizeable market revenue size of 39%. Given their low revenue per unit weight and high weight per shipment, they are likely to be provided as a high volume, slow delivery, container-based service transported by land or ship. Cluster Grp 2: Seem to be a middle-tier customer segment with moderate revenue per trans, and moderate revenue per unit weight, with a total of 15% total revenue. Cluster Grp 4: Has very low revenue/transaction and low weight per shipment/transaction comprising of 16% of the total revenue market. However, they have the highest revenue per weight.

4

1 2

Based on this graph we can see that Cluster Grp 1 (linear) and 5 are important to focus on as they make up of more than 50% of the total regular customers revenue. Cluster Grp 5 is interesting because it comprises a high revenue with just half of the no. of customer transactions. It would be interesting to find out why is this so.

Cluster Grp 5 has the largest billed weight % .

Key Cluster Analysis for REGULAR Customers using Histogram to present distribution of different Cluster Groups For the purpose of future analysis, reference & targeted marketing for DHL, we have saved the records (data points) into separate fact tables for each cluster.

Group 1, containing Cluster 1 (N=3713) Cluster 1 has too many outliners. Needs further classification. Use Decision Tree? Relatively Small Total_Rev But many N.

Group 2, containing Cluster 11 (N=415) “Medium” Total_Rev

Group 3, containing Cluster 19 (N=123) “Large” Total_Rev

M=136000

M=28000

M=375000

M=17100

M=4116

M=734

*For the purpose of analyzing, we have used sampling by selecting the largest representative cluster from the group to generate its distribution.

Grouping 4 (Total N= 150) Cluster 6 (N=97)* Cluster 16 (N=38) Cluster 5 (N=4) Cluster 21 (N=4) Cluster 4 (N=1) Cluster 18 (N=5) Cluster 17 (N=1)

Cluster 6

Grouping 5 (Total N= 54) Cluster 8 (N=17) Cluster 14 (N=7) Cluster 10 (N=21)* Cluster 13 (N=3) Cluster 9 (N=1) Cluster 3 (N=1) Cluster 20 (N=1) Cluster 12 (N=1) Cluster 3 (N=1) Cluster 17 (N=1)

Cluster 10

M=171727

M=700000

Validating Hierarchical Clustering (Key Clusters) with another Clustering technique, K-Means

By looking the graph comparison, it seems at first glance that the results of the different techniques are similar. To validate if it is actually similar, we will look at the actual mapping of each cluster based on distribution, and their key attributes. Based on the graph generated using Graph Builder, it seems like the Ward clusters can be mapped to the K-means cluster using the count. Ward Cluster 1 K-Means Cluster 17 Ward Cluster 11 K-Means Cluster 6 Ward Cluster 19 K-Means Cluster 9 However, as it might be presumptuous to just conclude that the results of the two different clustering techniques are the same, we will investigate further by generating the distribution for each mapping to see if they are pointing to the same data records.

Hierarchical & K-means Mapping Similarity (Hierarchical Cluster 1 VS K-means Cluster 17)

Hierarchical Cluster 1 (N=3713)

K-means

Cluster 17 (N=3892)

Ward Cluster 1 K-Means Cluster 17: As their distribution and their mean are somewhat similar, we can safely conclude that the two cluster results are the same.

M=33100


Hierarchical Cluster 11 (N=415)

K-means

Cluster 6 (N=421)


M=136000

M=177000


Hierarchical Cluster 19 (N=123) Analysis:

K-means

Cluster 9 (N=90) Analysis:


Conclusion: The results of the two different Clustering techniques give similar results

Thus we can simply use a clustering technique to segment and interpret the data.

M=590 000

Further Analysis to aid interpretation with Tree Map & industry attributes In order to drill further down into the data set and view the industrial backgrounds of different cluster groupings, we decided to

generate tree maps for our analysis. The tree maps used represent both the Division Description as well as the Major Group

Description in the form of nested rectangles. By joining the industry code directory table with the customer and transaction table,

we can further analyse the key clusters based on their industry.

Reason for using Tree maps:

They make a very efficient use of space. Thus, we used it to display a lot of industries simultaneously

See different patters within the data by making use of the correlation between size and color.

Tree Map Analysis using cluster groupings Cluster Grouping 1 (size by local_revenue)

Cluster Grouping 1 (size by billed_weight)

Cluster Grouping 2 (size by local_revenue)








Cluster Grouping Tree Map Analysis:

Tree Maps is not directly linked to clustering but is rather used to understand the grouped cluster by studying their industrial details. Thus, although not directly

a part of clustering, tree maps helped us to view the industrial backgrounds.

Clusters and cluster groupings have a low variability when it comes to sizing using different attributes i.e local revenue and billed weight. For almost all the

cluster groupings, the local revenue bears a positive correlation with billed weight. Thus the contribution by different clusters and cluster groupings to the local

revenue and the billed weight is directly proportional, an exception being cluster 1 where the Electronics industry appears to contribute a major proportion of

the billed weight unlike its contribution to the local revenue.

Tree Map as a data visualization technique is rather static in nature. But using JMP as our tool for mining as well as statistical analysis, we were able to

dynamically filter out the data based on cluster groupings and generate tree maps dynamically. Thus, handling tree maps for interactive analytics was something

that we have learnt through this project. Electronics and General Merchandise expand across all the segments/ cluster groupings. Cluster Grouping 1 focuses

more on commodity shipping with a relatively low weight.

Linking it back to our cluster findings

Cluster Groupings

Cluster Properties Industrial details (Key contributors)

Revenue %

Conclusion

Grp 1 (N=3713)

Highest revenue per weight Very high values for all attributes Largest no. of customers

Electronics Wholesale General Merchandise

30% Grp 1 and Grp 3&5 have almost the same industrial background. The major differentiator for industry being Industrial machinery for grp 2 that along with relatively low customers attributes to its high weight per shipment. Cluster grp 1: Relatively low weight per shipment due to a very high number of customers. Industrial machinery has a huge proportion (explains for high weight per shipment)

Grp 3 and 5 Total=178 (N=123) (N=55)

High weight per shipment Low revenue per unit weight

Electronics Industrial Machinery General Merchandise

39%

Grp 2 (N=415)

Moderate revenue per trans Moderate revenue per unit weight 2nd largest no. of customers

Electronics Wholesale General Merchandise

15% Grp 2 and grp 4 have almost the same contribution to revenue. Parallel Plots Group 2(revenue/ weight) is TWICE Group 4’s (revenue/ weight)

Grp 4 (N=171)

Has very low revenue/transaction Low weight per shipment/transaction 3rd Largest no. of customers

Electronics Wholesale General Merchandise Industrial Machinery

16%

Bubble Plot Analysis

Cluster Groupings(1->5)

Cluster Groupings(1->5)

Clusters(1->21)

Clusters(1->21)

In order to represent the data using the 3 key attributes (local revenue, billed weight and shipments) that contributed in the cluster formation and to

further embark on a temporal analysis, we decided to use a bubble plots. In order to do a trend analysis, we will take the example of cluster grouping 5.

The revenue increases during the first 8 months and so does the revenue. However, the revenue falls during the later part of the year with the maximum

dip observed in the December. The size of the bubble (number of shipments) remains the same throughout the year except for December when the

number falls below the average. Plotting the tree map using local revenue to vary the bubble size, we get a similar trend for cluster grouping 5.

Findings & Recommendations for Decision Making for the Regular Customers Segment Due to the limited domain knowledge we have about the logistic industry and the business, we find it hard to come up with detailed recommendations.

Irregular Customers: Cluster Analysis with Interactive Visualization Based on statistical analysis, the irregular customers (at least one 0 revenue in a month) make up of 1/3 of the total revenue.

Hierarchical clustering (ward)

Deciding on the number of

cluster

Number of cluster: 21. Most important clusters are 1 and 21

Cluster 1 Cluster 21

Cluster 1

The mean is 12 while the median is 6. Which means this data set is skewed. For the measure of position, median will be better. For majority of the customer

inside this cluster, the total number of transaction is around 6 and the total revenue per year is only 1400. From these two number, we can roughly get the

average revenue per transaction (1436/6=239.33).

Cluster 21

For this cluster, even the customers did not have consistent transaction for every month, but the total number of transactions and revenue is still considerable

and we should not ignore them. For this cluster, the difference between the mean and median is not very big, so we can take mean as a measure of the position.

In average, customers in this cluster have 80 transactions per year which means 6 transactions per month. We should study on the characteristic of this group

and find out ways to make these customers become our loyal or regular customers.

Data Exploration

To find out the unique characteristic of customers whose total number of transactions is above 500, we will examine the follow subset of records.

Observation of the data:

For irregular customers, some of them may be new customers to DHL in 2008 and should not be categorized as the irregular customer. For example, customer A

started to have transaction with DHL from May and have consistent number of transactions after May till the end of the year, even though this customer does

not have transaction before May, it does not mean he is not loyal or consistent. We should try to separate these customers from other irregular customers.

bill_account trans_1 trans_2 trans_3 trans_4 trans_5 trans_6 trans_7 trans_8 trans_9 trans_10 trans_11 trans_12

20024616 0 0 0 0 0 19 76 63 89 78 103 100

20007379 65 46 40 57 44 41 47 47 0 50 50 51

20024105 0 0 0 0 0 0 12 195 305 27 9 0

20021263 130 129 72 50 43 34 60 34 0 0 0 0

20013839 0 0 0 0 35 80 71 108 78 53 47 95

20045202 0 0 85 111 101 99 110 112 93 92 59 83

20045201 494 425 22 1 1 21 2 0 0 0 0 0

20023251 0 2 41 25 71 170 202 135 98 80 107 86

20040189 0 0 0 0 0 54 131 177 162 162 196 168

20024242 0 0 0 0 0 103 212 180 207 175 98 78

20019323 234 295 280 241 22 0 0 0 0 0 0 0

20023723 0 0 0 0 111 157 140 133 147 132 126 140

20014670 156 125 159 347 112 114 146 13 1 4 1 0

20003306 106 89 95 107 108 113 127 123 0 95 102 114

Another point I personal felt very important is I observe many customers have a lot of transaction with DHL for the first few months, then the number of the

transaction gradually decrease till 0. We should try to find out the reason behind it. Is it because of the nature of business (For example, fishing activities are

determined by seasons). Or it is because DHL were losing their customers to the competitors.

Limitations of Data Mining (Clustering)

Based on what we analyzed, there are new regular customers (consistent) but they begin their transactions with DHL after than Jan. For simplicity sake, we have

decided to exclude these “regular” irregular customers, in the regular analysis. We have analyzed about 4000 customers which has no zero

transactions/revenues in all the months. Since our focus is on the customer segmentation, by the means of clustering, we find that this analysis is sufficient to

derive unique clusters for groups of customers (N>100). However, though these customers are excluded, for the purpose of targeted Marketing (Advertisement),

these specific customers should be included as well.

Further analysis could to be done on the “Irregular” Customers

Further studies on new customers (with trailing zero values) and customers with decreasing transactions could be mined to come up with recommendations to help employ targeted marketing and encourage pricing discounts to keep and win back customers.

Challenges, Limitations and Lessons Learnt

1) Use of Interactive Visual Analytics & Data Mining Techniques To validate our final results (via human data analyst) customer segmentation into 5 cluster “groupings”, we went back to the Ward clustering dendrogram and set the no. of clusters as 5, and the results are as shown on the right. By comparing the system-generated results and our results, we realized that we have overlooked the fact that there are 2 outliers – Cluster 3 and Cluster 9 – the 2 largest customers. That we have identified at the start at our Initial Manual Logical Segmentation (A & B) using Basic Statistics. To rectify this, we can break our initial 5 cluster groupings into 6 cluster grouping by creating a separate cluster grouping for the two largest customers. The system generated 5 clusters has correctly clustered cluster 3, and 9 into a single cluster. However, the rest of the groupings (grouping cluster 1, 11, and 19 to Cluster 1) do not make any sense, thus showing that a human data mining analyst is still important to coming up with useful and logical findings.

2) Data complexity and understanding, and lack of domain knowledge

3) Doing Clustering with software (SAS Enterprise Miner to JMP)

4) To come up with meaningful and useful patterns for decision making

Final Conclusion

Knowledge discovery with DHL large data set has enabled us to adopt the roles of a Database Analyst (DBA), Data Analyst, and some aspect of a Business

Analyst. We have identified 6 unique customer segments for DHL by using various clustering techniques and validating them. Interactive visual analytics and data

mining techniques can empower everyday data analysts to gain insights and formulate informed decision. We find that the the best combination is to have an

“intelligent” data mining analyst who has deep industry knowledge in the field (eg. Global Logistics) and yet have a deep understanding of data mining

algorithms and techniques to apply and use. That way, useful and relevant findings and recommendations can be communicated to the decision makers of the

business, to realize the full potential of data mining for business intelligence.

Date post:	20-Apr-2018
Category:	Documents
Upload:	lamtuong
View:	219 times
Download:	1 times

DHL Data Mining Project - Singapore Management … Data Mining Project Table of Contents...

Documents