TheApplication ofAdaBoost CustomerChurnPrediction · another improved AdaBoost algorithm, called...

The Application ofAdaBoost in Customer Churn Prediction

Shao Jinbol, Li Xiu2, Liu Wenhuang3Automation Department, Tsinghua University, Beijing, 100084, China

[email protected]

ABSTRACT

Since attracting new customers is known to be more expensive, the enhancement of existing relationships is of pivotalimportance to companies. Therefore, as part of the customer relationship management (CRM) strategy, predictingcustomer chum and improving customer retention have attracted more and more attention. Being aware of the defectionprone customers beforehand, companies could react in time to prevent the chum by offering the right set of products,modifying the sales strategy and providing customized services. Therefore, high predictive performance couldultimately lead to profit increasing for companies.In this paper, we use the AdaBoost which is a main branch of boosting algorithms to predict the customer chum. Wehave implemented three different boosting schemes: Real AdaBoost, Gentle AdaBoost and Modest AdaBoost. Appliedto a credit debt customer database of an anonymous commercial bank in China, they are proven to significantly improveprediction accuracy comparing with other algorithms, like SVM. The assessment and comparison of these algorithmsare made to analyze the traits of them. Data processing and sampling scheme are also detailed in this paper.

Keywords: Customer Chum, Prediction, boosting, AdaBoost

1. INTRODUCTION

This paper studies the customer chum that is a hot topicin CRM and also the most important issues in enterprises.Customer chum - the propensity of customers to ceasedoing business with a company in a given time period -has become an important problem for many firms whichinclude publishing, investment services, insurance,electric utilities, health care providers, credit cardproviders, banking, Intemet service providers, telephoneservice providers, online services, and cable servicesoperatorsE'l. Obviously, customer chum figures directlyin how long a customer stays with a company, and inturn the customer's lifetime value to that company. Byanalyzing the current of a customer's lifetime profit to acompany 2], it is easy to find that most of the company'sprofits are contributed by frequent customers andattracting new customers is more expensive thanretaining the existing ones. Therefore, the enhancementof relationships with existing customers is of pivotalimportance to companies. Being aware of the defectionprone customers beforehand, companies could react intime to prevent the chum. So, customer chum predictionis the first and also a very important step to preventcustomer chum. What we try to do is to identify inadvance those customers who are likely to chum at somelater date. The company then can target these customerswith special programs or incentives to forestall thecustomer from chuming.

The most widely used model for predicting the customerchum is the binary classification model. The customerscan be classified into two categories: going to chum ornot. Many methods and algorithms are used to solve thisproblem, such as classification tree 3], neural network 4]and genetic algorithmsE5'. Decision tree based algorithmscan uncover the classification rules for classifying

records with unknown class membership. Nevertheless,when decision tree based algorithms are extended todetermine the probabilities associated with suchclassifications[6], it is possible that some leaves in adecision tree have similar class probabilities. Neuralnetworks can determine a probability for a predictionwith its likelihood. However, comparing with decisiontree based algorithms these algorithms do not explicitlyexpress the uncovered pattems in a symbolic, easilyunderstandable way. Genetic algorithms can produceaccurate predictive models, but they cannot determinethe likelihood associated with their predictions. Thisprevents these techniques from being applicable to thetask of predicting chum, which requires the ranking ofcustomers according to their likelihood to chumrn7.

Except algorithms above, some scholars put forwardsome other methods to predict the chum. Luo[81 appliedBayesian multi-net classifier in customer modeling oftelecommunications CRM and got effective results.Zhao[91 introduced an improved one-class SVM andtested it on a wireless industry customer chum data set.Ding['0] studied the application of sequential patternassociation analysis in the prediction of customer chumin banking. Lu[n] used survival analysis to modelcustomer lifetime value which is a powerful andstraightforward measure that synthesizes customerprofitability and chum (attrition) risk at individualcustomer level. Some other scholars also use somecombination methods to predict the churn[7][12]. All ofthese have made good attempts in predicting the chumand ultimately increasing the customers' value for thecompanies.

Lemmens and Croux[13] are the first who applied theensemble leaming algorithm in prediction of customerchum. They tested bagging and stochastic gradientboosting[14], one of the most recent boosting variants, on

1-4244-0885-7/07/$20.00 (© 2007 IEEE.

Authorized licensed use limited to: IEEE Xplore. Downloaded on October 14, 2008 at 09:22 from IEEE Xplore. Restrictions apply.

a customer database of an anonymous U.S. wirelesstelecom company and reported a significant predicationaccuracy improvement. Our work is to put the researchone step forward. We focus on the boosting and applythree different boosting schemes to a credit debtcustomer database of an anonymous commercial bank inChina. Data processing and sampling scheme aredetailed in the section after next. The assessment andcomparison of these algorithms are made to analyze thetraits of them. Ultimately, we draw a conclusion.

i on round t is denoted D, (i). Initially, all weightscould be set equally, but on each round, the weights ofincorrectly classified examples are increased so that theweak learner is forced to focus on the hard examples inthe training set. The weak learner's job is to find a weakhypothesis h,: X - IR appropriate for the distribution

D,. The goodness of a weak hypothesis is measured byits error:

-t = Pr-D, [ht (xi ) #y, ]2. METHODOLOGY(1)ZE Dt(i)

i:ht (x,)#y,

Boosting is one of the most important recentdevelopments in classification methodology. It is atechnique of combining a set of weak classifiers to formone high-performance prediction rule (a powerful"strong" classifier or "committee"). It works bysequentially applying a classification algorithm tore-weighted versions of the training data and then takinga weighted majority vote of the sequence of classifiersthus produced.

The first practical boosting algorithm, called AdaBoost,was proposed by Freund and Schapirel'5I in 1996.AdaBoost is adaptive in that it adapts to the error rates ofthe individual weak hypotheses. This is the basis of itsname "Ada" is short for "adaptive."['16]

AdaBoost has many advantages. It is fast, simple andeasy to program. It has no parameters to tune (except forthe number of round T). It requires no prior knowledgeabout the weak learner and so can be flexibly combinedwith any method for finding weak hypotheses. Finally, itcomes with a set of theoretical guarantees givensufficient data and a weak learner that can reliablyprovide only moderately accurate weak hypotheses. Thisis a shift in mind set for the learning-system designer:instead of trying to design a learning algorithm that isaccurate over the entire space, we can focus on findingweak learning algorithms that only need to be better thanrandom [16]

In 1999, Schapire and Singer[17] studied boosting in anextended framework in which each weak hypothesisgenerates not only predicted classifications, but alsoself-rated confidence scores which estimate thereliability of each of its predictions. They also discussedsome essential questions in boosting. Then they gave animproved generalized version of AdaBoost. Thealgorithm takes as input a training set

(xI, y1),..., (xm,yi) where each xi belongs to some

domain or instance space X, and each label yi is in thelabel set Y = 1-1, +1} . AdaBoost calls a given weak orbase learning algorithm repeatedly in a series of roundst = 1, ..., T. One of the main ideas of the algorithm is tomaintain a distribution or set of weights over the trainingset. The weight of this distribution on training example

So the steps ofthe generalized AdaBoost algorithm are:

For t =,...,T:

-Train weak learner using distribution Dt I

-Get weak hypothesis h,:X -* IR with error

-t = E Dt (i) (2)i:ht (xi)#yi

-Choose

-Update:

D~1(i) = Dt(i){ eat if: ht (xi ) = yZt Uezt if:ht(x,) y

Dt (i) exp(-a,y,h, (Xi ))Zt

(3)

(4)

where Zt is a normalization factor (chosen so that Dt_lwill be a distribution).-Output the final hypothesis:

T

H(x) = sign( aht (x))t=l

(5)

And then they proved that, in order to minimize trainingerror, a reasonable approach might be to greedilyminimize the bound given in the theorem by minimizingz, on each round of boosting. It can be verified that Zis minimized when

a =-ln( 1)2 WI

Wb= E D(i)i:yih1(xi)=b

(6)

(7)

So they replaced the °;t in the generalized AdaBoost

steps with the new at = ln( 1) to form a new2 WI

AdaBoost algorithm the Real AdaBoost. The RealAdaBoost algorithm uses class probability estimates w,

art E I


to construct real-valued contributions a{h, (x). And it isusually treated as a basic "hardcore" boosting algorithm.

In 2000, Friedman, Hastie and Tibshirani[18] put forwardanother improved AdaBoost algorithm, called GentleAdaBoost. Here the update is

I=1nW+I -WI2 W+I+W

rather than a

(8)

1ln(Wl) This makes the Gentle2 WI

AdaBoost have better generalizing ability so as to sthe overfitting problem and noise sensitive problemformer AdaBoost algorithms are facing. Some empiievidence suggests that this more conservative algorihas similar performance to the Real AdaBoost, and ooutperforms it, especially when stability is an issue.

In 2005, A. Vezhnevets and V. Vezhnevetsl'91 introdiModest AdaBoost algorithm. They used

Dt (i) =(n - Dt (i))Zt

to construct the new

(a, = W+1(1 - Wl)-W l(1 -W )Wb= Z D(i)

i:yih, (Xi)=b

olvethatricalithm'ften

iced

Our study is performed on a database provided by ananonymous bank in China. The database has nearly20,000 observations in total. We select 1,524observations from the database to form our experimentdataset. The observations that are lack of importantattributes or lack of too many attributes (more than 3000of the total) are excluded.

We select the attributes (variables) of the customers(observations) after fixing the observations. The variableselection is done by first excluding the attributes used formanagement of the bank, such as Customer ID. Then weexclude all variables containing more than 30°O ofmissing values. We retain 19 variables, includingcustomer demographics (e.g. the number of children inthe household, or the education level of the customer)variables, behavioral (e.g. the type of the customer's debt,the type of hypothecation, or the term of the debt), andcompany interaction (e.g. the number of exceeding timelimit times).

We also need to translate the character attributes intonumbers. For the attribute whose available values have

(9) trend (e.g. the education level of the customer), we cantranslate the values of the attribute into numbers withtrend (e.g. education level low equals 1, middle equals 2and high equals 3). For the attribute whose availablevalues have no trend (e.g. the type of the customer's

10) debt), we should extend this attribute (variable) toIll) several variables (shown in Figure 1).

They applied the new algorithm to UCI MachineLearning Repository database and compared the resultwith using Gentle AdaBoost. In some datasets of the UCIdatabase, the Modest AdaBoost outperforms the GentleAdaBoost in error rate and seems to be more stable isresistant to overfitting more. The drawback of ModestAdaBoost is that training error decreases much slowerthen in Gentle AdaBoost scheme and often does notreach zero point. Because they decreased weakclassifiers' contribution if it works "too good" on datathat had been already correctly classified with highmargin. This makes the algorithm better generalizingability but lower learning speed.

It is known that there is no algorithm fit for all datasets.So, in our experiment, we will apply these threerepresentative AdaBoost algorithms to our dataset to seewhich one is the most suitable algorithm for our problem.And we will use stumps as weak classifier for bothmethods. This choice was made because stumps areconsidered to be the "weakest of all" among commonlyused weak learners, so we hope that using stumps lets usinvestigate the difference in performance resulting fromdifferent boosting schemes.

3. DATA PROCESSING AND SAMPLING SCHEME

Figure 1

The handling of missing values is operated differently forthe continuous and the categorical predictors. For thecontinuous variables, the missing values are imputed bythe mean of the non-missing ones. For categoricalpredictors, an extra level is created for each of them,indicating whether the value was missing or not.

At last we define the meaning of churn. The staffs of thebank have already classified and rated these customersaccording to customers' credit by their experiences inbanking. We define the customers whose credit rates are"low" as churners. The churners are around 500 of thetotal customers. The churn response (customer label) iscoded as a variable with y = 1 if the customer churns,and y =1 otherwise.

Now we have the full experiment dataset that has 1524customers, 27 predictor variables and the label variable.Then we divide the full experiment dataset into twodifferent datasets averagely. The first one containing halfof the total observations is used for training the


classifiers and the other observations are used for testingand estimating the classifiers. 0.24

0.22

As we could see, customers' defection is stillstatistically speaking a rare event (around 5% of thetotal customers, even less in some other industries, e.g.1.5% in wireless industry). Consequently, when thechum predictive model is estimated on a random sampleof the customers' population, the vast majority ofnon-churners in this proportional training dataset (i.e. thenumber of churners in this randomly drawn sample isproportional to the real-life churn proportion) willdominate the statistical analysis, which may hinder thedetection of churn drivers, and eventually decrease thepredictive accuracy. So what we do is re-sampling thechurners in the training set to make a relatively balancedtraining set. This will increase the size of the training setand also the computational work amount. But we couldachieve it by setting the initial distribution of the trainingexamples, if the weak learner is an algorithm that can usethe weights distribution on the training examples. In thisway, we can keep the computational work amount thesame with of proportional training setP151. Both samplingschemes are assessed in next section. And we will givean empirical conclusion.

0.2

Error Rate of Balanced Sampled

Real AdaBoostModest AdaBoostGentle AdaBoost

0.18

g 0.16

a) 0.14

0.12

0.1

0.08

0.06 L0 10 20 30 40 50 60

Iterations70 80 90 100

Figure 3 Error rate of balanced sampled

Fig.2 and Fig.3 show the error rates of the threeAdaBoost algorithms in two different sampling schemes.It can be seen that the lowest error rate is about 4.5%.This seems to be an excellent performance. But if wetake the badly unbalanced dataset (about 5% are churners)into account, we will find that such a rule maybe don'tisolate any potentially riskiest customers. So, for rareevents, the error rate is often inappropriate.

4. RESULTS AND DISCUSSION

Our experiments are done in Matlab 2006b Edition withthe help of GML AdaBoost Matlab Toolbox. First, let'ssee the error rates of these three AdaBoost algorithms intwo sampling schemes. The error rate here means thepercentage of incorrectly classified observations in thevalidation dataset.

Error Rate0.065

Real AdaBoostModest AdaBoostGentle AdaBoost

0.06

0.055 i

Fn0.05-

0.045

0.04 _o

So, we should choose another assessment criterion. Thelift is a usually used criterion in prediction. And what wereally care about is who the riskiest customers are. So,we use the top-decile lift as the assessment criterion. Thetop-decile lift focuses on the most critical group ofcustomers regarding their chum risk. The top-decile liftequals the proportion of churners in this risky segment,A

Zioo%, divided by the proportion of churners in the wholeA

validation set, ;z:

Top - decile lift = Ao% (12)

The higher the top-decile lift, the better the classifier.And the score values of the chum risk can be obtained inthese algorithms by making the final hypothesis:

T

H(x) = Z akht (x)t=l

10 20 30 40 50 60 70Iterations

80 90 100

Figure 2 Error rate with the proportional training set

(13)

The top 10% riskiest customers is also potentially anideal segment for targeting a retention marketingcampaign.


Top-decile lift8

7

6

O00

Real AdaBoostModest AdaBoostGentle AdaBoostCSVMC4.5

20 40 60 80 100 120 140 160 180 200Iterations

Figure 4 Top-decile lift with the proportional training set

Top-decile lift of Balanced Sampled

7-

6Real AdaBoostModest AdaBoostGentle AdaBoostCSVM

- 5

F-

00- 4-

3 t

0 20 40 60 80 100 120 140 160 180 200Iterations

Figure 5 Top-decile lift of balanced sampled

Fig.4 and Fig.5 show the top-decile lift of the threeAdaBoost algorithms in two different sampling schemes.Comparing with C-SVM (C-SVM in balanced sampled isCWC-SVM[91) and C4.5, AdaBoost have a relativelybetter performance and really improve a lot twice or

more. That means there are nearly 80% of the totalpotential churners in our predicted 10% riskiestcustomers.

Now let's compare these three AdaBoost algorithms witheach other. We can see that Real AdaBoost and GentleAdaBoost have close performances on our dataset withboth sampling schemes and in both assessment criteria.The one with higher learning speed, less error rate or

higher lift value is usually more overfitting. Just in Fig.4,Gentle AdaBoost performs very slightly better than RealAdaBoost. Modest AdaBoost seems to have some troublein our dataset. It is too "modest" to fit our heavilyunbalanced dataset. It can't boost at all unless we

resample the data to form a balanced training set. Withbalanced training set, Modest AdaBoost is still learningslower than the other two algorithms. But it resists theoverfitting problem very well, which can be seen inFig.5.

The two different sampling schemes give us differentresults as well as some clue. Balanced sampling schememake a big drawback in error rate for all three algorithms.But it improves the top-decile lift of Real and GentleAdaBoost slightly and top-decile lift of ModestAdaBoost greatly. So, if we take top-decile lift as theprior assessment criterion, balanced sampling scheme isa good choice to solve the heavily unbalanced problemof training set. Actually, balanced sampling scheme"helps" Modest AdaBoost a lot for our dataset, even inerror rate. It puts Modest AdaBoost to start to boost,shown in Fig.3.

0.50.45

0.4

0.35

a)a

cu 0.25-t5

a) m

0015

0 5 10 15 20 25 30Attnbute

Figure 6 Weights of each attribute in three algorithms

There is also another advantage of AdaBoost. It canindicate the potential rules of the classification process.Each weak learner uses an attribute (variable) to classifythe dataset. The attribute number can be obtained easily.And the absolute values of the weights of these weaklearners represent the "confidence" of the weak learners,i.e. the attributes. So, we can find which attribute is themost powerful influencing factor to the classification.Fig.6 shows the weights of each attribute in the threeAdaBoost algorithms. It could be seen that the top threeinfluencing factors are the amount of the debt, thecustomer's duty level and the type of repayment. And wecan also find that different AdaBoost algorithms havenearly the same result in choosing attributes.

5. CONCLUSION

To sum up, AdaBoost algorithms make a really goodperformance in predicting the customer chum. They cannot only determine a probability for a prediction with itslikelihood, but also explicitly indicate the potential rulesof the classification process.

Real and Gentle AdaBoost are adaptable to the heavilyunbalanced churner dataset even with the "weakest"learner stump. Modest AdaBoost are too conservativeto adapt to the dataset. But balanced sampling schemecan improve the performance of each algorithm intop-decile lift with drawback in error rate. Modest

5

(D

au) 4-y0-0

3 .


AdaBoost shows its resistance ability to overfitting withbalanced sampling scheme.

There is still a lot of further work to do. New improvedAdaBoost algorithms are put forward ceaselessly. Wecould keep searching for more suitable AdaBoostalgorithms for customer chum prediction. And we couldalso try some other weak learners in further experiments.Whatever, we hope our work helpful for companies tobetter identify the riskiest customers' segment in terms ofchum risk, and ameliorate the retention strategy.Ultimately, they could reduce the losses caused by thechum.

ACKNOWLEDGEMENT

The authors are very grateful to the anonymous bank thatsupplied the data to perform the analysis and to themanagers who help us a lot by sharing their insights andexpertise.

REFERENCES

[1] S Neslin, S Gupta, W Kamakura, J Lu, C Mason,"Defection Detection: Improving PredictiveAccuracy of Customer Chum Models", WorkingPaper, Teradata Center at Duke University, 2004.3

[2] Wells, Melanie, "Brand ads should target existingcustomers" Advertising Age, pp26-47, 1993.

[3] CP Wei, IT Chiu, "Turning telecommunications calldetails to churn prediction: a data mining approach",Expert Systems With Applications, Vol. 23, Issue 2,pplO3-112, 2002.

[4] Mozer, M. C., Wolniewicz, R., Grimes, D. B., etc,"Churn reduction in the wireless industry", Advancesin Neural Information Processing Systems,pp935-941, 2000(12).

[5] Eiben, A.E.; Koudijs, A.E.; Slisser, F. Source,"Genetic modeling of customer retention", LectureNotes in Computer Science, Vo l. 1391, pp 178, 1998

[6] J. R. Quinlan, "Decision trees as probabilisticclassifiers", Proc. 4th Int. Workshop MachineLearning, pp3l-37, Irvine, CA, 1987.

[7] Wai-Ho Au, Keith C. C. Chan, and Xin Yao, "ANovel Evolutionary Data Mining Algorithm withApplications to Churn Prediction", IEEETransactions on Evolutionary Computation, Vol. 7,Issue 6, pp532-545, 2003

[8] LUO Ning, MU Zhichun, "Bayesian NetworkClassifier and Its Application in CRM", Computer

[9] Y Zhao, B Li, X Li, W Liu, S Ren, "Customer ChumPrediction Using Improved One-Class SupportVector Machine", Lecture Notes in ComputerScience, Vol. 3584, pp300-307, 2005

[10]Ding-An Chianga, Yi-Fan Wangb, Shao-Lun Leea,Cheng-Jung Lina, "Goal-oriented sequential pattern

for network banking churn analysis", Expert Systemswith Applications, Vol. 25, Issue 3, pp293-302,2003.

[11]Lu Junxiang, "Modeling Customer Lifetime Valueusing Survival Analysis - an Application in thetelecommunications Industry", SAS Institute Paper120 -28,2001

[12]Louis Anthony Cox, Jr. "Data Mining and CausalModeling of Customer Behaviors",Telecommunication Systems, Vol. 21, pp349-381,2002

[13]A Lemmens, C Croux, "Bagging and BoostingClassification Trees to predict churn", Journal ofMarketing Research, Vol. 43, No. 2, pp276-286.2006

[14]Friedman, Jerome H. "Stochastic Gradient Boosting",Computational Statistics and Data Analysis, Vo l. 38,Issue 4, pp367-378. 2002

[15]Y Freund and R. E. Schapire, "Game theory, on-lineprediction and boosting", In Proceedings of theNinth Annual Conference on ComputationalLearning Theory, pages 325-332, Desenzano delGarda, Italy, 1996.

[16]Yoav Freund, Robert E. Schapire, "A ShortIntroduction to Boosting" Journal of JapaneseSociety for Artificial Intelligence, Vol. 14, No. 5,pp771-780, September, 1999.

[17]R.E. Schapire and Y. Singer, "Improved boostingalgorithms using confidence-rated predictions",Machine Learning, Vol. 37, No. 3, pp297-336,December 1999.

[18]Jerome Friedman, Trevor Hastie, and RobertTibshirani, "Additive logistic regression: Astatistical view of boosting", The Annals ofStatistics,Vol. 28, No. 2, pp337-374, 2000.

[19]A. Vezhnevets and V. Vezhnevets, "ModestAdaBoost - teaching AdaBoost to generalize better"Graphicon 2005.

Application, Vol. 24, No. 3, pp79-81, 2004


Date post:	30-Apr-2018
Category:	Documents
Upload:	dinhkien
View:	216 times
Download:	1 times

TheApplication ofAdaBoost CustomerChurnPrediction · another improved AdaBoost algorithm, called...

Documents