Predicting Online Community Churners using Gaussian Sequences

PREDICTING ONLINE COMMUNITY CHURNERS USING GAUSSIAN SEQUENCES MATTHEW ROWE SCHOOL OF COMPUTING AND COMMUNICATIONS [email protected] | @MROWEBOT International Conference on Social Informatics 2014 Barcelona, Spain

Predicting Online Community Churners using Gaussian Sequences 1

The Issue of Churn Churner: a user, or subscriber, who stops using a service!

Churners = Loss

Social

Social capital

Expertise

Vibrancy

Financial


How do churners and non-churners develop?

How can we exploit development information to detect churners?

Engineered Static Features

Predicting Churners

¨  Defining churn: in the context of online communities ¨  User Lifecycle Model

¤ Mining development signals ¤ Churners vs. Non-churners

¨  Prediction Models ¤ Gaussian Sequences ¤ Single/Dual-Gaussian Sequence Model

¨  Experiments ¨  Conclusions + Future Work


Outline

2008 2010 2012

020

040

060

080

0

Time

Post

s Fr

eque

ncy

Churners posted for the final

time in this window


●

●

●●●

●●●●●●●●

●●

●

●

●

●●●●●

●●

●●

●

●●

●

●

●●

●●●●

●

●●

●●

●●●●●●●●●●●

●

●

●●●●●●●●●

●

●

●

●●●●

●

●

●

●●

●●●●●●●●

●

●●●●

●

●

●●●

●●●

●●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●●●●●●●

●●

●●

●

●

●

●●

●

●●

●

●●●●●

●●●●

●●

●●

●

●

●●●

●●●

●

●●

●

●

●●●

●●

●●●

●

●

●

●●●●●

●●●

●●●

●●●●

●

●

●●

●●

●●●

●●

●●●●●●

●

●

●●

●

●

●

●●

●●

●

●

●●

●●●●●

●

●

●

●

●●

●

●

●●

●●●●

●

●

●

●

●●

●●●

●

●●●●●

●●

●●

●●

●

●●

●

●●●

●

●

●●●

●●

●

●

●

●●●

●

●●●

●

●●

●

●●●●●●●

●●●

●

●●

●

●

●

●

●

●●●

●

●

●●●●●

●

●●●

●●●

●

●●●●●●●●

●●

●

●●●

●●●●●●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●●

●●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●●

●

●

●●●

●

●

●●●

●

●

●

●●●●

●

●

●

●●●●

●●

●

●

●

●●

●

●

●

●

●

●

●●●●

●●

●

●●

●

●

●●

●●●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●●

●●

●

●●●●●●

●

●●

●●

●

●●

●●

●

●

●●

●

●

●●

●●

●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●

●

●●●●●●

●

●

●

●

●

●●●●

●

●●●●●●●●●●●●●●

●●

●

●

●

●

●●●●

●

●

●

●●●●●●●●●●

●●

●

●

●●●●

●

●●●●

●

●●●●●●●●●●

●

●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

Δ

p(x)

100 101 102

10−3

10−2

10−1 Mean

Median

Δ = maximum #days between posts

Analysis Point

Defining Churners Facebook SAP ServerFault Boards.ie

(i.e. in-degree, out-degree, clustering coe�cient, closeness centrality, etc.), induc-ing a J48 decision tree to di↵erentiate between churners and non-churners whenusing social network properties formed from the reply-to graph of the onlinecommunities. In this paper, we implement this model as our baseline by en-gineering the same features using the same experimental setup. Our approachdi↵ers from existing work by assessing churners’ and non-churners’ developmentsignals, and inducing a joint-probability function from such information.

Table 1. Splits of users within the datasets and the churn window duration

Platform #Churners #Non-churners Churn WindowFacebook 1,033 1,199 [04-11-2011, 28-08-2012]SAP 10,421 7,255 [29-11-2009,07-09-2010]Server Fault 12,314 11,144 [13-06-2010,24-12-2010]Boards.ie 65,528 6,120,008 [01-01-2005,13-02-2008]

3 Datasets

To provide a broad examination of user lifecycles across di↵erent online commu-nity platforms we used data collected from four independent platforms:1. Facebook: Data was obtained from Facebook groups related to Open Uni-

versity degree course discussions. Although Facebook provides the ability tocollect social network data for users, we did not collect such data in this in-stance and instead used the reply-to graph within the groups to build socialnetworks for individual users.

2. SAP Community Network (SAP): The SAP Community Network is a com-munity question answering system related to SAP technology products andinformation technologies. Users sign up to the platform and post questionsrelated to technical issues, other users then provide answers to those ques-tions and should any answers satisfy the original query, and therefore solvethe issue, the answerer is awarded points.

3. Server Fault. Similar to SAP, Server Fault is a platform that is part of theStack Overflow question answering site collection.1 The platform functionsin a similar vein to SAP by providing users with the means to post questionspertaining to a variety of server-related issues, and allowing other communitymembers to reply with potential answers.

4. Boards.ie This platform is a community message board that provides a rangeof dedicated forums, where each forum is used to discuss a given topics (e.g.Rugby Union, Xbox360 games, etc.). We were provided with data coveringthe period 1998-2008 and, like SAP and ServerFault, we also had access tothe reply-to graph in each forum.

1http://stackoverflow.com/

Split each dataset’s users into training (80%) and testing (20%)

Predicting Online Community Churners using Gaussian Sequences

1 2 3 … k

s

Model the terms used by user u Term Count

Semantic 17

Web 5

Model the actions by user u to other users

Model the actions to user u by other users

5

User Lifecycles We divide lifetime into equal activity periods

s In-degree Out-degree Lexical

{ } =

Experiment with different settings of k={5,10,20}

Cross-Entropy:


Mining User Evolution Signals

1 2 3 … n

Applied Information Theory measures between stages

H(P1, P2) H(P2, P3) H(Pn-1, Pn) H(P1, P2)

H(P2, P3)

…

H(Pn-1, Pn)

x

Produces: {x1,x2,…, xM} = X∈ℝMxS M users’ measures derived from S stages

6

Period Entropy Historical Cross-Entropy Community Cross-Entropy


User Evolution Signals

●●

● ●

●

1 2 3 4 5

0.5

0.7

indegree k = 5

Lifecycle Stage

H

●●

● ●

●

●●

● ●●

● ●●

●● ●

● ●

●●

2 4 6 8 10

0.2

0.4

0.6

indegree k = 10

Lifecycle Stage

H

● ●●

●● ●

● ●

●●

● ●●

●●

●● ●

●●

●

●●●●

●●

●

●

●

●●●●●

●●●

●

●

5 10 15 20

0.1

0.3

indegree k = 20

Lifecycle Stage

H

●

●●●●

●●

●

●

●

●●●●●

●●●

●

●

●●●●

●

●

●●●●●●●●●

●●●●●

● ● ●● ●

1 2 3 4 50.5

0.7

0.9

outdegree k = 5

Lifecycle Stage

H

● ● ●● ●

● ● ● ● ●

●● ● ● ● ● ● ● ● ●

2 4 6 8 10

0.6

0.8

outdegree k = 10

Lifecycle Stage

H

●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●●●●●●●●●●

●●●●●●●●

●●

5 10 15 20

0.5

0.7

0.9 outdegree k = 20

Lifecycle Stage

H

●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●

1 2 3 4 5

4.0

4.3

4.6

lexical k = 5

Lifecycle Stage

H

●●

●●

●

● ● ● ● ●

●●

● ● ● ●● ● ● ●

2 4 6 8 10

3.8

4.2

4.6 lexical k = 10

Lifecycle Stage

H

●●

● ● ● ●● ● ● ●

●● ●

●● ●

● ● ● ●

●●●●

●●●●●●

●

●●●●●

●●

●●

5 10 15 203.4

3.8

4.2

lexical k = 20

Lifecycle Stage

H ●●●●

●●●●●●

●

●●●●●

●●

●●

●●●●●●●

●●●●

●●●●

●●●●●

Fig. 2. Period entropy distribution on ServerFault for di↵erent fidelity settings (k) forusers’ lifecycles and di↵erent measures of social (indegree and out degree) and lexicaldynamics. The green dashed line shows the non-churners, while the red solid line showsthe churners.

the probability distribution, we instead used all posts to return Q.4 We thencalculated the the cross-entropy as above between the distributions. (H(Pu, Q))over the di↵erent lifecycle stages. Again, as with period cross-entropies, we findchurners’ signals to have a lower magnitude than non-churners suggesting thatnon-churners’ properties tend to diverge from the community as they progressthroughout their lifetime within the online community platforms.

5 Churn Prediction from Gaussian Sequences

Above we plotted the 95% confidence intervals of a given measurement m (e.g.the period entropy of users’ in-degree at lifecycle stage 1) for both churnersand non-churners. If we assume that the distribution of a given measurement(m) at a particular lifecycle stage (s) is normally distributed, then for eachmeasurement we have two signals (one for churners and one for non-churners)

4 For instance, for the global in-degree distribution we used the frequencies of receivedmessages for all users.

For each stage: 1.  Derive values for each user

2.  Derive mean and 95% CI of churners & non-churners 3.  Plot curves of each class


●●

● ●

●

1 2 3 4 5

0.5

0.7

indegree k = 5

Lifecycle Stage

H

●●

● ●

●

●●

● ●●

● ●●

●● ●

● ●

●●

2 4 6 8 10

0.2

0.4

0.6

indegree k = 10

Lifecycle Stage

H

● ●●

●● ●

● ●

●●

● ●●

●●

●● ●

●●

●

●●●●

●●

●

●

●

●●●●●

●●●

●

●

5 10 15 20

0.1

0.3

indegree k = 20

Lifecycle Stage

H

●

●●●●

●●

●

●

●

●●●●●

●●●

●

●

●●●●

●

●

●●●●●●●●●

●●●●●

● ● ●● ●

1 2 3 4 50.5

0.7

0.9

outdegree k = 5

Lifecycle Stage

H

● ● ●● ●

● ● ● ● ●

●● ● ● ● ● ● ● ● ●

2 4 6 8 100.

60.

8

outdegree k = 10

Lifecycle Stage

H●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●●●●●●●●●●

●●●●●●●●

●●

5 10 15 20

0.5

0.7


Lifecycle Stage

H

●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●

1 2 3 4 5

4.0

4.3

4.6

lexical k = 5

Lifecycle Stage

H

●●

●●

●

● ● ● ● ●

●●

● ● ● ●● ● ● ●

2 4 6 8 10

3.8

4.2

4.6 lexical k = 10

Lifecycle Stage

H

●●

● ● ● ●● ● ● ●

●● ●

●● ●

● ● ● ●

●●●●

●●●●●●

●

●●●●●

●●

●●

5 10 15 203.4

3.8

4.2

lexical k = 20

Lifecycle StageH ●

●●●●●

●●●●

●

●●●●●

●●

●●

●●●●●●●

●●●●

●●●●

●●●●●






Period Variation: Entropy Signals Increasing Lifecycle Fidelities

User Properties

8

Non-Churners: •  Share more connections (greater entropy)

•  Invest more and get more out of the community


Community-Comparisons: Cross-Entropy Signals

●

●●

●

2.0 3.0 4.0 5.0

2.5

3.5

indegree k = 5

Lifecycle Stage

H●

●●

●●

●●

●

●

●●

●● ● ●

●

●

2 4 6 8 101.0

2.0

3.0

indegree k = 10

Lifecycle Stage

H

●

●●

●● ● ●

●

●

●

●● ● ● ● ●

●

●

●●●

●●●●●

●

●●

●

●●

●

●●●

●

5 10 15 20

0.5

1.5

indegree k = 20

Lifecycle Stage

H

●●●

●●●●●

●

●●

●

●●

●

●●●

●

●●●●●●●●●●

●●

●●●●●

●●

● ●

●

●

2.0 3.0 4.0 5.0

4.0

5.0

outdegree k = 5

Lifecycle Stage

H

● ●

●

●● ●

●

●

● ● ●●

● ●●

●

●

2 4 6 8 104.

05.

0

outdegree k = 10

Lifecycle Stage

H● ● ●

●● ●

●●

●

● ● ●● ●

●

● ●

●

●●●●

●

●●●

●

●●●

●●●

●●●●

5 10 15 20

3.5

4.5

5.5

outdegree k = 20

Lifecycle Stage

H

●●●●

●

●●●

●

●●●

●●●

●●●●

●●●

●●●●●●●●

●●●●●●●

●

●

●

●

●

2.0 3.0 4.0 5.0

7.9

8.2

8.5

lexical k = 5

Lifecycle Stage

H

●

●

●

●

●●

●

●

● ● ●

●●

●● ●

●

2 4 6 8 10

7.6

8.0

8.4

lexical k = 10

Lifecycle Stage

H

● ● ●

●●

●● ●

●

●

● ● ●● ● ● ●

●

●●

●●●●●

●

●●

●

●●●●

●●

●●

5 10 15 20

7.0

8.0

lexical k = 20

Lifecycle StageH

●●

●●●●●

●

●●

●

●●●●

●●

●●

●●●●●●●●●●●

●●●●●●●●

Fig. 4. Community cross-entropy distribution for di↵erent fidelity settings (k) for users’lifecycles and di↵erent measures of social (indegree and out degree) and lexical dynam-ics.

In the above equation, N�f(.)|µ̂, �̂2

�defines the conditional probability of

the observed measurement f(.) being drawn from the given gaussian of measurem in lifecycle stage s. We have also included a slack variable �m,s to controlfor influence on the churn probability; its inclusion is necessary because we mayhave an outlier measure for u and should limit over fitting as a consequence- note that this variable is indexed by both m and s as it is specific to boththe lifecycle stage, and the measure under inspection. Given our formulation ofthe churn probability in a particular lifecycle stage s and based on measure m,we can therefore derive the joint probability of u churning over the observedsequence of measures (m 2 M) and his lifecycle stages (s 2 S) as follows - weterm this the Single-Gaussian Sequence Model :

Q(u|b) =Y

m2M

Y

s2S

⇢⇣�m,sN

�f(u,m, s)|µ̂c

m,s, (�̂cm,s)

2�⌘

The parameter ⇢ smooths zero probability values given our joint calculation.Assuming we have |S| lifecycle stages, and |M | measures, then the slack variablesare stored within a parameter vector: b where - where �m,s 2 b.

9

Churners: •  Diverge from community norms, but slower

How can we exploit development information to detect churners?

Prediction Models 10



Gaussian Sequences Assume that measure, m, (in-degree) is normally

distributed for a given stage (s)

●●

● ●

2.0 3.0 4.0 5.0

0.01

0.05

indegree k = 5

Lifecycle Stage

H

●●

● ●

●

● ● ●●

●●

●●

●

●

● ●

2 4 6 8 10

0.00

0.03

indegree k = 10

Lifecycle Stage

H

●●

●

●●

●

●

● ●

● ● ●

● ●●

●● ●

●●●●●

●

●●●●●●●●●●

●●●

5 10 15 20−0.0

10.

02

indegree k = 20

Lifecycle Stage

H

●●●●●

●

●●●●●●●●●●

●●●

●●●

●●

●●

●●●●●●

●●●●●●

●

●●

●

2.0 3.0 4.0 5.0

0.02

0.06

outdegree k = 5

Lifecycle Stage

H ●

●●

●

●

●

●●

●●

●● ●

●● ● ●

2 4 6 8 10

0.01

0.04


Lifecycle Stage

H ●●

●● ●

●● ● ●

● ●

●●

● ●

● ●●

●●

●

●

●●●●●●●

●

●●●●●●●

5 10 15 200.00

0.04

outdegree k = 20

Lifecycle Stage

H ●●

●

●

●●●●●●●

●

●●●●●●●

●

●●

●●●

●●●●●●

●●●●●

●●

●

●●

●

2.0 3.0 4.0 5.00.3

0.6

0.9

lexical k = 5

Lifecycle Stage

H

●

●●

●

●

●●

●●

●●

● ● ● ● ● ●

2 4 6 8 10

0.2

0.5

0.8 lexical k = 10

Lifecycle Stage

H ●

●●

● ● ● ● ● ●

●

●● ● ● ● ● ● ● ●

●●●●●●●●●●●●●●●●●●

5 10 15 20

0.1

0.4

0.7 lexical k = 20

Lifecycle Stage

H ●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

Fig. 3. Period cross-entropy distribution on ServerFault for di↵erent fidelity settings(k) for users’ lifecycles and di↵erent measures of social (indegree and out degree) andlexical dynamics.

that each correspond to a sequence of Gaussians measured over the k lifecyclestages:

Definition 1 (Gaussian Sequence). Let m be a given measurement, s be agiven lifecycle stage drawn from the set of lifecycle stages s 2 S, then m issaid to be normally distributed on s and defined by N

�µ̂m,s, (�̂m,s)2

�where µ̂m,s

and �̂m,s denote the maximum likelihood estimates of the mean and standarddeviation respectively. Then the Gaussian Sequence of m is defined as follows:

Gm =⇣N�µ̂m,1, (�̂m,1)2,N

�µ̂m,2, (�̂m,2)2, . . . ,N

�µ̂m,|S|, (�̂m,|S|)

2�⌘

.

5.1 Single-Gaussian Sequence Model

Under the assumption that a given measurement has a Gaussian distribution ats then for an arbitrary user (u) we may measure the likelihood that the userbelongs within a given distribution given his measurement at that stage. Usingthe convenience function f(u,m, s) we can compute the probability that the userbelongs to the churn gaussian, at that time step (s), using:

P (u|�m,s) /�m,sN�f(u,m, s)|µ̂c

m,s, (�̂cm,s)

2�

Gaussian Sequence is a chain of Gaussians for measure m over the k=|S| stages:

Rationale: uncommon overlap between 95% CI bounds

●●

● ●

●

1 2 3 4 5

0.5

0.7

indegree k = 5

Lifecycle Stage

H

●●

● ●

●

●●

● ●●

● ●●

●● ●

● ●

●●

2 4 6 8 10

0.2

0.4

0.6

indegree k = 10

Lifecycle Stage

H

● ●●

●● ●

● ●

●●

● ●●

●●

●● ●

●●

●

●●●●

●●

●

●

●

●●●●●

●●●

●

●

5 10 15 20

0.1

0.3

indegree k = 20

Lifecycle Stage

H

●

●●●●

●●

●

●

●

●●●●●

●●●

●

●

●●●●

●

●

●●●●●●●●●

●●●●●

● ● ●● ●

1 2 3 4 50.5

0.7

0.9

outdegree k = 5

Lifecycle Stage

H

● ● ●● ●

● ● ● ● ●

●● ● ● ● ● ● ● ● ●

2 4 6 8 10

0.6

0.8

outdegree k = 10

Lifecycle Stage

H

●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●●●●●●●●●●

●●●●●●●●

●●

5 10 15 20

0.5

0.7


Lifecycle Stage

H

●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●

1 2 3 4 5

4.0

4.3

4.6

lexical k = 5

Lifecycle Stage

H

●●

●●

●

● ● ● ● ●

●●

● ● ● ●● ● ● ●

2 4 6 8 10

3.8

4.2

4.6 lexical k = 10

Lifecycle Stage

H

●●

● ● ● ●● ● ● ●

●● ●

●● ●

● ● ● ●

●●●●

●●●●●●

●

●●●●●

●●

●●

5 10 15 203.4

3.8

4.2

lexical k = 20

Lifecycle Stage

H ●●●●

●●●●●●

●

●●●●●

●●

●●

●●●●●●●

●●●●

●●●●

●●●●●






0.4 0.6 0.8

0.3978

0.3988

m

f(m)


Gaussian Sequence Prediction Models

1:18 • M. Rowe

Our aim therefore is to induce some function f that has as its domain a given user’s developmentsignal modelled as a feature vector (x) and the churn probability as its co-domain, hence: f : Rn ! [0, 1].To induce this function we used three methods: (i) logistic regression; (ii) a dual-gaussian sequencemodel; and (ii) a linear model with elastic-net regularisation. We now explain each in turn.

7.1.1 Detection Model 1: Logistic Regression Model. We used the logistic regression model to predictthe conditional probability of user ui churning as follows:

Pr(Y = 1 | xi) =1

1 + e�b

|xi

(10)

The model’s coefficients (b) define the weight attached to each feature within the linear model(f(xi|b) = b

|xi). In order to derive the model’s coefficients we used the maximum likelihood esti-mation ˆ� of the model’s coefficients. Following fitting, the derived model is used to predict the churnprobability of each user within the test portion of the data.

7.1.2 Detection Model 2: Dual-Gaussian Sequence Model. When inspecting each different mea-surement (e.g. the period entropy of users’ in-degree at lifecycle stage 1) for both churners and non-churners, we plotted the the development signals for both sets of users along with their 95% confidenceintervals.. Our second model, presented initially in our earlier work [Rowe 2014], is based upon thepremise that a given measurement (m) at a particular lifecycle stage (s) is normally distributed. Thus,for each measurement we have two signals (one for churners and one for non-churners) that each cor-respond to a sequence of Gaussians measured over the k lifecycle stages. We define this more concretelyas follows: given a measurement m,5 and a lifecycle stage s drawn from a set of lifecycle stages S, weassume that m is normally distributed at s and thus characterised by N �

µ̂m,s, (�̂m,s)2�

where µ̂m,s and�̂m,s denote the maximum likelihood estimates of the mean and standard deviation respectively. Thenthe Gaussian Sequence of m is defined as follows:

Gm =

⇣N �

µ̂m,1, (�̂m,1)2,N �

µ̂m,2, (�̂m,2)2, . . . ,N �

µ̂m,|S|, (�̂m,|S|)2�⌘

(11)

In essence we have two competing gaussian distributions at a particular lifecycle stage: the churngaussian, formed from measurements of the known churner users, and; the non-churn gaussian,formed from measurements of known non-churners. We can therefore specify the probability of theuser u belonging to the churner class based on measurement m and lifecycle stage s as follows:

P (u|�m,s) /h�m,sN

�f(u,m, s)|µ̂c

m,s, (�̂cm,s)

2�

� (1� �m,s)N�f(u,m, s)|µ̂n

m,s, (�̂nm,s)

2�i

+

Above we have modified the maximum likelihood estimates for the mean and standard deviation tocorrespond to the churner (c) and non-churner classes (n). We also incorporated the slack variable �m,s

which is indexed by the measurement and lifecycle stage, and controls for over-penalising class mem-bership - we learn this parameter as �m,s 2 b. The subtraction of the churn-distribution membershipprobability by the �m,s-scaled non-churn-distribution membership probability is wrapped within thepositive value operand []+ in order to return a non-negative value. We can then calculate the joint churn

5We defined a measurement, or measure, earlier as the combination of a given dynamics (e.g. in-degree) and a given developmentindicator (e.g. period entropy); hence M is the set of all 9 possible measurements.

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1, Publication date: October 2014.

Single-Gaussian Sequence Model

●

●●

●

2.0 3.0 4.0 5.0

2.5

3.5

indegree k = 5

Lifecycle Stage

H

●

●●

●●

●●

●

●

●●

●● ● ●

●

●

2 4 6 8 101.0

2.0

3.0

indegree k = 10

Lifecycle Stage

H

●

●●

●● ● ●

●

●

●

●● ● ● ● ●

●

●

●●●

●●●●●

●

●●

●

●●

●

●●●

●

5 10 15 20

0.5

1.5

indegree k = 20

Lifecycle Stage

H

●●●

●●●●●

●

●●

●

●●

●

●●●

●

●●●●●●●●●●

●●

●●●●●

●●

● ●

●

●

2.0 3.0 4.0 5.0

4.0

5.0

outdegree k = 5

Lifecycle Stage

H

● ●

●

●● ●

●

●

● ● ●●

● ●●

●

●

2 4 6 8 10

4.0

5.0

outdegree k = 10

Lifecycle Stage

H

● ● ●●

● ●●

●

●

● ● ●● ●

●

● ●

●

●●●●

●

●●●

●

●●●

●●●

●●●●

5 10 15 20

3.5

4.5

5.5

outdegree k = 20

Lifecycle Stage

H

●●●●

●

●●●

●

●●●

●●●

●●●●

●●●

●●●●●●●●

●●●●●●●

●

●

●

●

●

2.0 3.0 4.0 5.0

7.9

8.2

8.5

lexical k = 5

Lifecycle Stage

H

●

●

●

●

●●

●

●

● ● ●

●●

●● ●

●

2 4 6 8 10

7.6

8.0

8.4

lexical k = 10

Lifecycle Stage

H

● ● ●

●●

●● ●

●

●

● ● ●● ● ● ●

●

●●

●●●●●

●

●●

●

●●●●

●●

●●

5 10 15 20

7.0

8.0

lexical k = 20

Lifecycle Stage

H

●●

●●●●●

●

●●

●

●●●●

●●

●●

●●●●●●●●●●●

●●●●●●●●

Fig. 4. Community cross-entropy distribution for di↵erent fidelity settings (k) for users’lifecycles and di↵erent measures of social (indegree and out degree) and lexical dynam-ics.

In the above equation, N�f(.)|µ̂, �̂2

�defines the conditional probability of

the observed measurement f(.) being drawn from the given gaussian of measurem in lifecycle stage s. We have also included a slack variable �m,s to controlfor influence on the churn probability; its inclusion is necessary because we mayhave an outlier measure for u and should limit over fitting as a consequence- note that this variable is indexed by both m and s as it is specific to boththe lifecycle stage, and the measure under inspection. Given our formulation ofthe churn probability in a particular lifecycle stage s and based on measure m,we can therefore derive the joint probability of u churning over the observedsequence of measures (m 2 M) and his lifecycle stages (s 2 S) as follows - weterm this the Single-Gaussian Sequence Model :

Q(u|b) =Y

m2M

Y

s2S

⇢⇣�m,sN

�f(u,m, s)|µ̂c

m,s, (�̂cm,s)

2�⌘

The parameter ⇢ smooths zero probability values given our joint calculation.Assuming we have |S| lifecycle stages, and |M | measures, then the slack variablesare stored within a parameter vector: b where - where �m,s 2 b.

Probability of u belonging to the churn Gaussian, in each measure and stage

β∈b learnt using (single|dual) stochastic gradient descent

ρ smoothes zero probabilities (set to 0.1)

Dual-Gaussian Sequence Model Probability of u belonging to the churn Gaussian minus the probability of

belonging to the non-churn Gaussian , in each measure and stage

Mining User Development Signals for Online Community Churner Detection • 1:19

probability over observed measures and lifecycle stages as follows - we term this the Dual-GaussianSequence Model:

Q(u|b) =Y

m2M

Y

s2S

⇢h�m,sN

�f(u,m, s)|µ̂c

m,s, (�̂cm,s)

2�

� (1� �m,s)N�f(u,m, s)|µ̂n

m,s, (�̂nm,s)

2�i

+

Here ⇢ acts as a smoother to chain together zero-probability values. Now, for this detection model,our objective is to minimise the squared-loss between a user’s forecasted churn probability and theobserved churn label - given that the former is in the closed interval [0, 1] and the latter is from the set{0, 1} - our parameters are L2-regularised to control for over-fitting:

argmin

b

⇤

X

(xi,yi)2D

�yi �Q(u|b)�2 + �||b||2 (12)

Using this objective, we then used stochastic gradient descent to calculate the setting of each � 2b by minimising the loss between a single user’s forecasted churn probability and his actual churnlabel (i.e. either 0 - did not churn - or 1 - did churn). We experimented with two learning procedures:stochastic gradient descent (SGD), and dual-stochastic gradient descent (D-SGD) - the latter being anovel contribution in our prior work [Rowe 2014] - however we found the difference in performance tobe insignificant and thus favoured the former given its reduced computational complexity (i.e. O(m)

per learning epoch rather than O(m ⇥ m)). We refer the reader to our prior work [Rowe 2014] for amore thorough presentation of the models and learning procedures used.

7.1.3 Detection Model 3: Linear Model with Elastic-Net Regularisation. Our third detection modelcombines a linear model with elastic-net regularisation to predict the probability of a given user churn-ing using the linear combination of the user’s feature vector and the learnt weight vector (b). Wecombine both L1 and L2 regularisation within the predictive function to control for overfitting in thetraining segment by using ↵-weighting between the L1 and L2 penalties (i.e. Lasso and Ridge). Tolearn the parameters of the model we used stochastic coordinate descent, which is a modification ofthe average coordinate descent algorithm proposed by Friedman et al. [Friedman et al. 2010].6 Ourobjective is as follows:

argmin

b⇤

1

N

X

(xi,yi)2D

1

2

�yi � b

|xi

�2+ �(1� ↵)

1

2

||b||22 + �↵||b||1!

(13)

Therefore we calculate the derivative of �j 2 b based on a single instance (i.e. (xi, yi) 2 D) at a timeas follows, based on applying the chain rule from Equation 13:

ri�j = �xij(yi � b

|xi) + �(1� ↵)�j + �↵ (14)

Unlike with the average coordinate descent model, in this instance we use the stochastic learningroutine of shuffling the order of D each training epoch and then iterating through the set of training

6N.b. we found the use of a stochastic learning routine to achieve the same levels of accuracy as the average coordinate descentapproach, but to be computationally more efficient given that the derivative is calculated per learning instance for a singlefeature.

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1, Publication date: October 2014.

¨ Aim: to maximise detection of churners ¤ I.e. Area Under ROC Curve

1.  Do we perform better than the state of the art?

2.  How do we fare against existing classifiers?


Experiments


Experimental Setup

1:20 • M. Rowe

Table IV. Number of instances within the training and testing datasets usedfor the experiments across the different lifecycle fidelities. The number of

instances decreases as the lifecycle fidelity increases as we require each userto have posted double the fidelity number of posts.

Fidelity Facebook SAP ServerFault Boards.ie|Train| |Test| |Train| |Test| |Train| |Test| |Train| |Test|

5 306 72 1,099 302 1,229 338 6,338 1,70010 204 48 716 205 688 177 4,979 1,33420 123 27 448 129 375 84 3,635 995

instances, deriving the error in prediction and the derivative of each parameter and thus updatingaccordingly. As a result our update rule is the following, for a given training instance with index i:

�j = �j � ⌘ri�j (15)

In this model we have two hyperparameters than must be tuned: the learning rate ⌘ and the reg-ularisation weight �. The use of elastic-net regularisation means that we can examine the spectrumbetween using solely a lasso penalty (↵ = 1), or solely a ridge penalty (↵ = 0), or somewhere in themiddle (↵ = 0.5). Rather than tuning ↵ as a hyperparameter, we adopted a different approach and in-dexed linear models using the following settings: ↵ = {0, 0.5, 1}, thereby tuning a hyperpameter vector✓ = {⌘,�}, for each setting. We explain in the following section the model tuning approach that wasapplied.

7.2 ExperimentsIn order to compare the above models and judge how well they fare against existing work, we conducteda series of experiments: firstly, to tune the different models’ hyperparameters; and secondly, to applythem to a held-out test portion of users. We begin by first defining our experimental setup.

7.2.1 Experimental Setup. As mentioned above, for the four platforms’ datasets we divided usersinto training and testing sets using an 80%:20% split respectively. Given that we experimented withdifferent lifecycle fidelities (k = {5, 10, 20}), this reduced the number of users, and thus instances, inour dataset and hence the training and testing splits - Table IV shows the number of instances per splitand lifecycle fidelity. The reason for this reduction is that we require each user to have posted 2k postsprior to the churn cutoff point at which we perform our analysis, this provides sufficient informationfrom which to mine users’ development signals from. For setting up our experiments we first performedmodel tuning (using the training set for each platform), and then applied the tuned models to the held-out test split - in this latter setting we repeatedly applied each tuned model 25 times and took theaverage area under the Receiver Operator Characteristic curve (ROC).

7.2.1.1 Model Tuning. For our experiments we had two models to tune: the dual-gaussian sequencemodel and the linear model using elastic-net regularisation; both of which require their hyperparam-eters to be selected. To tune the hyperparameters (� and ⌘) we ran 10-fold cross validation over thetraining splits and recorded the average ROC; we then selected the best performing hyperparametercombinations. Both � and ⌘ were varied through {10�8, 10�7, . . . , 10�1}. For the dual-gaussian sequencemodel we set ⇢ = 0.1, variance of this smoothing parameter will be investigated in future work; whilefor the linear model with elastic-net regularisation we tuned three variants of the model with differentsettings for ↵ where ↵ = {0, 0.5, 1} - this allowed us to examine the performance of solely L1 (i.e. lasso)regularisation (↵ = 1), solely L2 (i.e. ridge) regularisation (↵ = 0), or combining both equally (↵ = 0.5).The logistic regression model did not require the tuning of hyperparameters. Once model tuning wasACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1, Publication date: October 2014.

Stage 1: Model Tuning 10-Fold Cross-Validation to tune model hyperparameters

Stage 2: Model Testing Apply tuned models to held out test splits

14

Baseline 1 (B1-J48): Decision tree with measures as features Baseline 2 (B2-NB): Social network features (e.g. centrality)

Web Science Conference

2011

Results (ROC)


15

we denote by B2-NB, we implemented the approach from [2] using features de-rived from the social network of users: in-degree, out-degree, closeness-centrality,betweenness-centrality, reciprocity, average number of posts in initiated threads,average number of posts within participated threads, popularity (% of user au-thored posts that receive replies), initialisation (% of threads authored by theuser), and polarity. We first tested the J48 classifier, as used in [2], but foundthis to be poor performing5 therefore we used the Naive Bayes classifier instead.

Table 2. Area under the Receiver Operator Characteristic (ROC) Curve results forthe di↵erent Gaussian Sequence Models and Learning Procedures

Baselines SGD D-SGDPlatform Lifecycle Fidelity B1-J48 B2-NB Single-N Dual-N Single-N Dual-NFacebook 5 0.559 0.461 0.570 0.472 0.548 0.478

10 0.531 0.491 0.569 0.554 0.593 0.54520 0.478 0.444 0.664 0.500 0.528 0.583

SAP 5 0.594 0.497 0.573 0.527 0.545 0.53310 0.533 0.494 0.553 0.503 0.584 0.590

20 0.478 0.582 0.500 0.500 0.540 0.525ServerFault 5 0.583 0.530 0.522 0.556 0.583 0.577

10 0.534 0.546 0.500 0.557 0.569 0.589

20 0.463 0.530 0.500 0.634 0.486 0.484Boards.ie 5 0.504 0.611 0.524 0.547 0.526 0.518

10 0.512 0.593 0.500 0.539 0.501 0.49620 0.560 0.553 0.500 0.501 0.500 0.502

6.3 Results: Churn Prediction Performance

For the model testing phase of the experiments, we took the best performing hy-perparameters for each model and learning procedure, trained the model usingthis setting using with entire training split, and then applied it to the test split;we did this twenty-times for each model (as each induction of the parameter vec-tor is a↵ected by the stochastic nature of the learning procedure) and took theaverage ROC value. These ROC values for the di↵erent models and baselines areshown in Table 2. The results show that for certain proposed models we signif-icantly outperformed the baselines for two of the datasets.6 Surpassing B1-J48indicates that our proposed Gaussian models beat a widely-used classificationmodel when detecting churners - given that this baseline makes use of the samefeatures as our proposed model.

The results indicate variance across the prediction model as to which modelperforms best and under what conditions. For instance, the single-gaussian model

5 We also tested support vector machines and the perceptron classifier.6 Testing for significance using the Student T-test for independent samples.

No clear winner among Gaussian Sequence Models

Significantly outperform baselines for 2/4 datasets

Conclusions + Future Work

¨  Gaussian Sequences allow user measures to be chained together in a joint probability model ¤ Mined from users’ development trajectories ¤ Examined across various lifecycle fidelities

¨  Exceed performance of baselines for 2/4 datasets

¨  FW1: Churn point prediction + user rankings ¨  FW2: Towards a theory of churner development

@mrowebot | [email protected] http://www.lancaster.ac.uk/staff/rowem/

Questions?


Date post:	27-Jun-2015
Category:	Science
Upload:	matthew-rowe
View:	361 times
Download:	3 times