Presentation at Socialcom2014: Gauging Heterogeneity in Online Consumer Behaviour Data: A Proximity...

transcript

Gauging Heterogeneity in Online Consumer Behaviour Data:

A Proximity Graph Approach

Natalie de Vries, Ahmed Shamsul Arefin, Pablo Moscato

The Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM)School of Electrical Engineering and Computer ScienceFaculty of Engineering and Built EnvironmentThe University of Newcastle, Australia

Agenda

• Introduction and objectives• Dataset characteristics• Outline of the study• Methodology• Results• Significance of the work and future research

directions• Questions

Introduction

• Increase in online behaviours towards brands• Increasing importance of social media in marketing strategies• High levels of heterogeneity amongst consumers• Need for clustering consumers or objects into similar groups

Middle-aged

females

Middle-aged males

Retirees

Teenagers

Housewives

Introduction: Importance of Clustering in Marketing

“Brand lovers”

“Brand haters”

“Excited sharers”

“Online lurkers”

“Quiet supporters”

• Gaining insights into consumer behaviour• Market segmentation• Targeted marketing strategies• Personalised marketing messages• Online technologies available to personalise brand

messages at a very small or individual level

“Old-fashioned Way”Modern “data-driven way”

Objectives of this Study• Create an understanding of the natural groupings in a

consumer cohort based on their online consumer behaviours towards a particular brand

• Find a suitable distance measure for analysing a specific dataset in a specific context

• Explore the use of meta-features for finding a more accurate partitioning of respondents

• Uncover the best way to cluster consumers; e.g. using raw data or using a form of meta-features and using either; intra- or inter-construct relationships

Methodology: Dataset collection and preparation

Construct Source Code Number of Items

Usage Intensity

(Jahn and Kunz 2012)

Functional Value FUV 4

Hedonic Value HED 4

Social Interaction Value SOC 4

Customer Engagement CE 5

Customer Loyalty LO 6

Brand Involvement (Carlson and O'Cass 2012) INV 6

Co-Creation Value (O'Cass and Ngo 2011) CCV 6

SNS-Specific Loyalty Behaviours

(O'Cass and Carlson 2012)

Self-Brand-Congruency

(Hohenstein, Sirgy et al.

2007)SBC 5

Survey Constructs

Category No. Explanation Percentage

of sample

1 Fashion Brands 31.54%

2 Community, Charities, Personality and Sports Fan Pages 23.99%

3 Other Services 19.68%

4 Other Consumer Goods 8.09%

5 Hospitality (Restaurants, Cafes, Bars) 7.28%

6 Consumer Electronics 7.01%

7 Automotive 2.43%

Respondents’ chosen brands’ categories

Methodology: Outline of the study

Methodology: Difference Meta-features

The difference of values between two measured features might be capable to distinguish between two given categories, even when those features are not able to do so alone (De Paula et al, 2011)

Previous successful application of difference meta-features in Alzheimer’s Disease biomarker detection (De Paula et al. 2011) and (Arefin et al. 2012), both in PLoS ONE.

Data collection and pre-

processing

Meta-features: Pair-wise

differences

Meta-features: Pair-wise products

Intra- and inter-construct relationships

Distance Computation

Data preparation

1 2 3 4 5 6 7 8 9 10 11

f1f2Meta-f

Class A Class B

1 2 3 4 5 6 7 8 9 10 11 12

Meta-f

Class A Class B

Methodology: Product Meta-features

The product of values between two measured features might be capable to distinguish between two given categories, even when those features are not able to do so alone.

This study is the first to trial the application of this idea.

Left, the values of f1 (blue) and f2 (red) do not distinguish the classes well but their product (meta-feature in green) does.

Data collection and pre-

processing

Meta-features: Pair-wise

differences

Meta-features: Pair-wise products

Intra- and inter-construct relationships

Distance Computation

Data preparation

1 2 3 4 5 6 7 8 9 10 11 120

Meta-f

Class A Class B1 2 3 4 5 6 7 8 9 10 11 12

f1f2Meta-f

Class A Class B

Methodology: Distance Computation and Dataset Variations

• Distance matrices computed for all 7 datasets• Various distance/correlations metrics used on each

of the dataset variations

Distance Metrics:• Pearson• Spearman• Robust• Euclidean• Cosine

Various datasets:• Original• Difference

meta-features• Product meta-

features

Interactions:• Intra-construct

item relationships

• Inter-construct item relationships

Values of k for kNN Cliques: k=3 k=4 k=5 k=6

= 7 datasets and 140 graphs

Methodology: MST-kNN and kNN Cliques

Complete graph Minimum Spanning Tree Select and remove edges that are not k-Nearest Neigbors

Final forest (a forest is a set of trees) = clusters

Previous applications of the MST-kNN method• U.S. Stock market time series data (Inostroza-Ponta, Berretta, & Moscato, 2011)

• Yeast gene expression data (Inostroza-Ponta, Mendes, Berretta, & Moscato, 2007)

• Alzheimer’s disease data - in the order of 1 million data elements (Arefin, Mathieson, Johnstone, Berretta, & Moscato, 2012)

• Prostate cancer data (Capp et al., 2009)

These examples show the methodology proposed here has a proven scalability for larger datasets

MST-kNN + kNN Cliques Results

Results: Clustering Highlights

Heterogeneous cluster?More homogenous cluster?

And what about the statistical difference of the clustering result that these highlights came from?

Results: Clustering and Significance ValuesData Rows selected Distance

Metric

MST-kNN merged with the kNN cliques of

p-values

Wilcoxon’s Test Kruskal-Wallis

Original All

Robust 5NN 0.021187 0.042364

Spearman 6NN 0.025987 0.051962

Robust 6NN 0.028565 0.057117

Pearson 3NN 0.030232 0.060451

Spearman 3NN 0.040661 0.081306

Euclidean 6NN 0.041232 0.082448

Difference Metafeatures

‘Intra’ constructs

Robust 3NN 0.016551 0.033095

Robust 6NN 0.017177 0.03434

Pearson 3NN 0.018628 0.0372481

Pearson 6NN 0.019066 0.038124

Pearson 5NN 0.019656 0.039303

All Pearson 3NN 0.020594 0.041180

Product Metafeatures

‘Inter’ ConstructsSpearman 3NN 0.016949 0.033891

Pearson 4NN 0.01757 0.035132

All Pearson 4NN 0.017721 0.035433

‘Inter’ ConstructsPearson 6NN 0.01781 0.035611

Pearson 3NN 0.017816 0.035624

‘Inter’ Constructs Robust 4NN 0.017998 0.035988

Results: Analysis of clusters

Cluster No. of respondents

Avg. Age

Age range

% Males/Females

1 103 20.5 17-32 39.8 / 60.2

2 92 21.3 18-36 39.1 / 60.9

3 31 23.4 19-49 51.6 / 48.4

4 71 21.0 18-44 40.8 / 59.2

5 4 22.3 20-24 75 / 25

6 18 21.1 18-26 33.3 / 66.7

7 10 22.5 18-29 20 / 80

8 5 21 20-24 80 / 20

9 20 23 19-44 45 / 55

10 12 22 18-45 41.7 / 58.3

11 5 26.4 20-46 0 / 100

Clusters’ demographic informationThis figure presents the frequencies of the respondents’ chosen brand categories for two of the largest clusters

The difference in degrees of heterogeneity between different clusters can be seen in these figures.

Furthermore, these two clusters highlight the differences in brand preferences amongst respondents that do exist within each cluster of similar consumers

Heterogeneous spread of respondents’ chosen brand categories

Contribution and Significance

• Methodological guide for the investigation of several distance measures, meta-features, relationships of theoretical construct items to find ‘best’ clustering results

• Expanded on the MST-kNN clustering method for increased potential to find statistically significant clusters of categories of consumers and their chosen brands

• The clustering methodology used in this study highlights the high levels of heterogeneity found in consumer’s online behaviours towards brands

Future Research Directions

• Various domains and contexts to apply the novel process outlined in this study

• Combine a study using survey data as well as ‘live’ behaviour data from social networking sites (real-time interactions)

• Further exploration of meta-features in both survey data and ‘real’ online behaviour clustering studies; ‘differences’ meta-features in this study yielded better results

• This study guides the development of future feature selection models to identify group of consumers according to higher-order characteristics.

Thank you

Questions?

We would like to thank Dr. Jamie Carlson and Mr. Benjamin Lucas for their advise and proofreading.Dr. Jamie Carlson supervised Ms. de Vries’ thesis project and the initial collection and analysis of this data.

Thanks to Mario Inostroza-Ponta for the use of his MST-kNN images.

References (from paper)

• [1] I. P. Cvijikj and F. Michahelles, "Online engagement factors on Facebook brand pages," Social Network Analysis and Mining, vol. 3, pp. 843-861, 2013.• [2] B. Jahn and W. Kunz, "How to transform consumers into fans of your brand," Journal of Service Management, vol. 23, pp. 344-361, 2012.• [3] T. S. Chung and M. Wedel, "Adaptive personalization of mobile information services," in Handbook of Service Marketing Research, R. T. Rust and M.-H.

Huang, Eds., ed Cheltenham: Edward Elgar Publishing Limited, 2014.• [4] N. J. de Vries, J. Carlson, and P. Moscato, "A Data-Driven Approach to Reverse Engineering Customer Engagement Models: Towards Functional

Constructs," PLoS ONE, vol. 9, p. e102768, 2014.• [5] B. Jahn and W. Kunz, "How to Transform Consumers into Fans of your Brand," Journal of Service Management, vol. 23, pp. 344-361, 2012.• [6] J. Carlson and A. O'Cass, "Optimizing the Online Channel in Professional Sport to Create Trusting and Loyal Consumers: The Role of the Professional

Sports Team Brand and Service Quality. ," Journal of Sport Management, vol. 26, p. 463, 2012.• [7] N. Hohenstein, M. J. Sirgy, A. Herrmann, and M. Heitmann, "Self-Congruity: Antecedents and Consequences," in 34th La Londe International Research

Conference in Marketing Communications and Consumer Behaviour Aix en Provance: France University Paul Cezanne, 2007, pp. 118-130.• [8] A. O'Cass and L. Ngo, "Examining the Firm’s Value Creation Process: A Managerial Perspective of the Firm’s Value Offering Strategy and Performance,"

British Journal of Management, vol. 22, pp. 646-671, 2011.• [9] A. O'Cass and J. Carlson, "An Empirical Assessment of Consumers' Evaluations of Web Site Service Quality: Conceptualizing and Testing a Formative

Model," Journal of Services Marketing, vol. 26, pp. 419-434, 2012.• [10] L. D. Peters, "Theory Testing in Social Research," The Marketing Review, vol. 3, pp. 65-82, 2002.• [11] M. R. de Paula, M. G. Ravetti, R. Berretta, and P. Moscato, "Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel Biomarkers for

Early Detection Of Clinical Alzheimer’s Disease," PLoS ONE, vol. 6, pp. 1-14, 2011.• [12] A. S. Arefin, L. Mathieson, D. Johnstone, R. Berretta, and P. Moscato, "Unveiling Clusters of RNA Transcript Pairs Associated with Markers of Alzheimer's

Disease Progression," PLoS ONE, vol. 7, Sep 21 2012.• [13] M. Inostroza-Ponta, R. Berretta, A. Mendes, and P. Moscato, "An automatic graph layout procedure to visualize correlated data," in Artificial Intelligence

in Theory and Practice, ed: Springer, 2006, pp. 179-188.• [14] A. S. Arefin, L. Mathieson, D. Johnstone, R. Berretta, and P. Moscato, "Unveiling clusters of RNA transcript pairs associated with markers of Alzheimer’s

disease progression," PLoS ONE, vol. 7, p. e45535, 2012.• [15] A. Capp, M. Inostroza-Ponta, D. Bill, P. Moscato, C. Lai, D. Christie, et al., "Is there more than one proctitis syndrome? A revisitation using data from the

TROG 96.01 trial," Radiotherapy and oncology, vol. 90, pp. 400-407, 2009.• [16] M. Inostroza-Ponta, A. Mendes, R. Berretta, and P. Moscato, "An integrated QAP-based approach to visualize patterns of gene expression similarity," in

Progress in Artificial Life, ed: Springer, 2007, pp. 156-167.• [17] M. Inostroza-Ponta, R. Berretta, and P. Moscato, "QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization," PloS one, vol.

6, p. e14468, 2011.• [18] A. S. Arefin, M. Inostroza-Ponta, L. Mathieson, R. Berretta, and P. Moscato, "Clustering nodes in large-scale biological networks using external memory

algorithms," in Algorithms and Architectures for Parallel Processing, ed: Springer, 2011, pp. 375-386.• [19] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "Gpu-fs-knn: A software tool for fast and scalable knn computation using GPUs," PLoS ONE, vol. 7, p.

e44000, 2012.

• [20] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "kNN-Borůvka-GPU: A Fast and Scalable MST Construction from kNN Graphs on GPU," in Computational Science and Its Applications–ICCSA 2012, ed: Springer, 2012, pp. 71-86.

• [21] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach on GPU," in Computer Science & Education (ICCSE), 2012 7th International Conference on, 2012, pp. 585-590.

• [22] E. J. Chesler and M. A. Langston, Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data: Springer, 2006.

• [23] M. Hollander, D. A. Wolfe, and E. Chicken, Nonparametric statistical methods vol. 751: John Wiley & Sons, 2013.• [24] C. E. Shannon, "The mathematical theory of communication," The Bell System Technical Journal, vol. 27, pp. 379-423 & 623-

656, 1948.• [25] A. W. Kruglanski, "The Human Subject in the Psychology Experiment: Fact and Artifact," in Advances in Experimental Social

Psychology vol. 8, L. Berkowittz, Ed., ed New York: Academic Press, 1975, pp. 101-147.• [26] H. Krasnova, S. Spiekermann, K. Koroleva, and T. Hildebrand, "Online Social Networks: Why we Disclose," Journal of

Information Technology, vol. 25, pp. 109-125, 2010.• [27] S. C. Chu and Y. Kim, "Determinants of Consumer Engagement in electronic Word of Mouth (eWoM) in Social Networking

Sites," International Journal of Advertising, vol. 30, pp. 47-75, 2011.• [28] J. M. Pinho and A. M. Soares, "Examining the Technology Acceptance Model in the Adoption of Social Networks," Journal of

Research in Interactive Marketing, vol. 5, pp. 116-129, 2011.

Additional reference from presentation:• Arefin AS, Mathieson L, Johnstone D, Berretta R, Moscato P (2012) Unveiling Clusters of RNA Transcript Pairs Associated with Markers of

Alzheimer’s Disease Progression. PLoS ONE 7(9): e45535. doi: 10.1371/journal.pone.0045535• Rocha de Paula M, Gómez Ravetti M, Berretta R, Moscato P (2011) Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel

Biomarkers for Early Detection Of Clinical Alzheimer's Disease. PLoS ONE 6(3): e17481. doi: 10.1371/journal.pone.0017481

References cont.

Presentation at Socialcom2014: Gauging Heterogeneity in Online Consumer Behaviour Data: A Proximity...

Data & Analytics