User Behavior Modeling on Financial Message Boards

Post on 15-Apr-2017

88 views 0 download

transcript

User Behavior Modeling on Financial Message

BoardsPritha D.N

Sahaj BiyaniDecember 9, 2015

Introduction

Investors Hub

Objective

• To identify the roles users assume in these message board forums.

• Validate the “90-9-1 Rule for Participation Inequality” in the message boards community.

Dataset• Free US listed stocks message boards

• Time Period: January, 2001 - June, 2015

• Total Message Boards: 6,278

• Total Users: 52,558

• Total Posts: 5,624,024

Dataset Analysis

• Percentage of initiated posts: 30%

• 19% of users did not initiate any post.

• 80% of users initiated less than 20 posts.

Dataset Analysis

• Number of boards user participated in:

• 56% of users are active only on 1 board.

• 90% of users are limited to/ active on less than 20 boards.

Dataset Analysis

• Average response time of replies a user makes:

Dataset Analysis

• Number of posts across boards:

• 80% of posts made on less than 200 boards.

• 1000 out of 6278 boards account for 90% of posts made.

Dataset Analysis

• Percentage of initiated posts: 30%• From the graph we infer,

• 19% of users did not initiate any post.

• 80% of users initiated less than 20 posts.

Features1. Number of threads a user initiated over time2. Number of replies a user made over time3. Number of users a user replies to4. Number of users who reply to a user5. Number of boards a user is active on6. Number of followers7. Replier share , AVG[proportion of replies a user gets on a board]8. Reply share, AVG[proportion of reply a user makes on a board]9. Average Response time10. Volume of content he posted11. Number of links he has posted

Content Related

User Network StructureActivity of User

Methodology

• Data Preprocessing

• Feature Selection/Extraction

• Clustering

• Role Inference

Data Preprocessing

• We use Min-Max Normalization• Normalize data between [0 – 1]

Feature Selection

• Step 1 – Feature Extraction • Do Principal Component Analysis• Do K-means on the projected data and extract feature labels

• Step 2 – Feature importance using Random Forest classifier

Principle Component % Variance Cumulative %

Variance

1 62.16 62.16

2 15.07 77.23

3 07.95 85.18

4 05.74 90.92

5 03.57 94.49

6 01.67 96.16

7 01.48 97.64

8 00.68 98.32

9 00.59 98.91

10 00.55 99.46

11 00.54 100

Feature Extraction using PCA

Scree Plot

Choosing the number of clusters

Elbow Plot

• Plot the Within Group Sum of Squares versus K, and look at the “elbow-point” in the plot.

• The first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.

• Choose the number after the last big drop.

• This "elbow" cannot always be unambiguously identified.

Silhouette Coefficient

a(i) is the average dissimilarity of with all data within the same cluster.

b(i) is the lowest average dissimilarity of to any other cluster, of which is not a member.

Feature Selection

• Train a Random Forest classifier using all the features and labels assigned by K-means.

• Feature importance is defined as the total decrease in node impurity (weighted by the probability of reaching that node ,which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble.

Clustering Users• Applied K-Means clustering with K=4.• Run 10 times with different seeds.• 300 iterations in a single run.

Clusters User Count % of UsersCluster 1 47295 91.7

Cluster 2 360 0.73

Cluster 3 3322 6.44

Cluster 4 581 1.13

Cluster AnalysisInitiation of Posts by users of each cluster

Cluster 130%

Cluster 222%

Cluster 344%

Cluster 43%

Post Initiation Share

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Series10

200

400

600

800

1000

1200

Cluster 1

Cluster 2

Cluster 3

Cluster 4

10.9

1066.6

228.298.3

Initiation Per User

Cluster AnalysisReplies by users of each cluster

Cluster 122%

Cluster 228%

Cluster 347%

Cluster 44%

Reply Share

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Series10

500

1000

1500

2000

2500

3000

Cluster 1

Cluster 2

Cluster 3

Cluster 4

17.5

2946.3

534.9

255.9

Reply Per User

Cluster Analysis

Clust er1 Clust er 2 Clust er 3 Clust er 4

34

22

24

18

20

27

11

25

41

49

22

44

4

2

42

2

Inter-Cluster reply %Cluster1 Cluster4 Cluster2 Cluster3

Cluster AnalysisFeature 3: Number of users a user replies to

Cluster AnalysisFeature 4: Number of users who reply to a user

Role Inference• Cluster1: Lurkers

• The post initiated per user and reply made per user ratio are very less.

• Cluster2: Super Users• Very active. Contribute most to the boards. Engage with lot of users.

• Cluster3: Contributors• Account for 45% of total post initiations, 46% of total replies made. Have a high

response time meaning they respond very fast. Backbone of the forum.

• Cluster4: Taciturns• Limited to themselves. Initiate very less but reply often mostly to users in their own

cluster.

Participation Inequality

% of Users Content Contributed0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

91.73

24

0.73

26

6.44

46

1.134

Lurkers Super-Users Contributors Taciturns

Conclusion

• Users take up different roles on online communities and the cluster of users can be identified by their behavioral pattern.

• Participation Inequality exists on financial message boards.

Conclusion

• Users take up different roles on online communities and the cluster of users can be identified by their behavioral pattern.

• Participation Inequality exists on financial message boards as well.

Thank You!