Post on 31-Jul-2015
transcript
Gauging Heterogeneity in Online Consumer Behaviour Data:
A Proximity Graph Approach
Natalie de Vries, Ahmed Shamsul Arefin, Pablo Moscato
The Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM)School of Electrical Engineering and Computer ScienceFaculty of Engineering and Built EnvironmentThe University of Newcastle, Australia
Agenda
• Introduction and objectives• Dataset characteristics• Outline of the study• Methodology• Results• Significance of the work and future research
directions• Questions
Introduction
• Increase in online behaviours towards brands• Increasing importance of social media in marketing strategies• High levels of heterogeneity amongst consumers• Need for clustering consumers or objects into similar groups
Middle-aged
females
Middle-aged males
Retirees
Teenagers
Housewives
Introduction: Importance of Clustering in Marketing
“Brand lovers”
“Brand haters”
“Excited sharers”
“Online lurkers”
“Quiet supporters”
• Gaining insights into consumer behaviour• Market segmentation• Targeted marketing strategies• Personalised marketing messages• Online technologies available to personalise brand
messages at a very small or individual level
“Old-fashioned Way”Modern “data-driven way”
Objectives of this Study• Create an understanding of the natural groupings in a
consumer cohort based on their online consumer behaviours towards a particular brand
• Find a suitable distance measure for analysing a specific dataset in a specific context
• Explore the use of meta-features for finding a more accurate partitioning of respondents
• Uncover the best way to cluster consumers; e.g. using raw data or using a form of meta-features and using either; intra- or inter-construct relationships
Methodology: Dataset collection and preparation
Construct Source Code Number of Items
Usage Intensity
(Jahn and Kunz 2012)
UI 3
Functional Value FUV 4
Hedonic Value HED 4
Social Interaction Value SOC 4
Customer Engagement CE 5
Customer Loyalty LO 6
Brand Involvement (Carlson and O'Cass 2012) INV 6
Co-Creation Value (O'Cass and Ngo 2011) CCV 6
SNS-Specific Loyalty Behaviours
(O'Cass and Carlson 2012)
ON 3
Self-Brand-Congruency
(Hohenstein, Sirgy et al.
2007)SBC 5
Survey Constructs
Category No. Explanation Percentage
of sample
1 Fashion Brands 31.54%
2 Community, Charities, Personality and Sports Fan Pages 23.99%
3 Other Services 19.68%
4 Other Consumer Goods 8.09%
5 Hospitality (Restaurants, Cafes, Bars) 7.28%
6 Consumer Electronics 7.01%
7 Automotive 2.43%
Respondents’ chosen brands’ categories
Methodology: Outline of the study
Methodology: Difference Meta-features
The difference of values between two measured features might be capable to distinguish between two given categories, even when those features are not able to do so alone (De Paula et al, 2011)
Previous successful application of difference meta-features in Alzheimer’s Disease biomarker detection (De Paula et al. 2011) and (Arefin et al. 2012), both in PLoS ONE.
Data collection and pre-
processing
Meta-features: Pair-wise
differences
Meta-features: Pair-wise products
Intra- and inter-construct relationships
Distance Computation
Data preparation
1 2 3 4 5 6 7 8 9 10 11
-6
-4
-2
0
2
4
6
8
10
12
f1f2Meta-f
Class A Class B
1 2 3 4 5 6 7 8 9 10 11 12
-6
-4
-2
0
2
4
6
8
10
12
f1 f2
Meta-f
Class A Class B
Methodology: Product Meta-features
The product of values between two measured features might be capable to distinguish between two given categories, even when those features are not able to do so alone.
This study is the first to trial the application of this idea.
Left, the values of f1 (blue) and f2 (red) do not distinguish the classes well but their product (meta-feature in green) does.
Data collection and pre-
processing
Meta-features: Pair-wise
differences
Meta-features: Pair-wise products
Intra- and inter-construct relationships
Distance Computation
Data preparation
1 2 3 4 5 6 7 8 9 10 11 120
2
4
6
8
10
12
14
16
18
f1
f2
Meta-f
Class A Class B1 2 3 4 5 6 7 8 9 10 11 12
0
2
4
6
8
10
12
14
16
18
f1f2Meta-f
Class A Class B
Methodology: Distance Computation and Dataset Variations
• Distance matrices computed for all 7 datasets• Various distance/correlations metrics used on each
of the dataset variations
X X X
Distance Metrics:• Pearson• Spearman• Robust• Euclidean• Cosine
Various datasets:• Original• Difference
meta-features• Product meta-
features
Interactions:• Intra-construct
item relationships
• Inter-construct item relationships
Values of k for kNN Cliques: k=3 k=4 k=5 k=6
= 7 datasets and 140 graphs
Methodology: MST-kNN and kNN Cliques
Complete graph Minimum Spanning Tree Select and remove edges that are not k-Nearest Neigbors
Final forest (a forest is a set of trees) = clusters
Previous applications of the MST-kNN method• U.S. Stock market time series data (Inostroza-Ponta, Berretta, & Moscato, 2011)
• Yeast gene expression data (Inostroza-Ponta, Mendes, Berretta, & Moscato, 2007)
• Alzheimer’s disease data - in the order of 1 million data elements (Arefin, Mathieson, Johnstone, Berretta, & Moscato, 2012)
• Prostate cancer data (Capp et al., 2009)
These examples show the methodology proposed here has a proven scalability for larger datasets
MST-kNN + kNN Cliques Results
Results: Clustering Highlights
Heterogeneous cluster?More homogenous cluster?
And what about the statistical difference of the clustering result that these highlights came from?
Results: Clustering and Significance ValuesData Rows selected Distance
Metric
MST-kNN merged with the kNN cliques of
size
p-values
Wilcoxon’s Test Kruskal-Wallis
Original All
Robust 5NN 0.021187 0.042364
Spearman 6NN 0.025987 0.051962
Robust 6NN 0.028565 0.057117
Pearson 3NN 0.030232 0.060451
Spearman 3NN 0.040661 0.081306
Euclidean 6NN 0.041232 0.082448
Difference Metafeatures
‘Intra’ constructs
Robust 3NN 0.016551 0.033095
Robust 6NN 0.017177 0.03434
Pearson 3NN 0.018628 0.0372481
Pearson 6NN 0.019066 0.038124
Pearson 5NN 0.019656 0.039303
All Pearson 3NN 0.020594 0.041180
Product Metafeatures
‘Inter’ ConstructsSpearman 3NN 0.016949 0.033891
Pearson 4NN 0.01757 0.035132
All Pearson 4NN 0.017721 0.035433
‘Inter’ ConstructsPearson 6NN 0.01781 0.035611
Pearson 3NN 0.017816 0.035624
‘Inter’ Constructs Robust 4NN 0.017998 0.035988
Results: Analysis of clusters
Cluster No. of respondents
Avg. Age
Age range
% Males/Females
1 103 20.5 17-32 39.8 / 60.2
2 92 21.3 18-36 39.1 / 60.9
3 31 23.4 19-49 51.6 / 48.4
4 71 21.0 18-44 40.8 / 59.2
5 4 22.3 20-24 75 / 25
6 18 21.1 18-26 33.3 / 66.7
7 10 22.5 18-29 20 / 80
8 5 21 20-24 80 / 20
9 20 23 19-44 45 / 55
10 12 22 18-45 41.7 / 58.3
11 5 26.4 20-46 0 / 100
Clusters’ demographic informationThis figure presents the frequencies of the respondents’ chosen brand categories for two of the largest clusters
The difference in degrees of heterogeneity between different clusters can be seen in these figures.
Furthermore, these two clusters highlight the differences in brand preferences amongst respondents that do exist within each cluster of similar consumers
Heterogeneous spread of respondents’ chosen brand categories
Contribution and Significance
• Methodological guide for the investigation of several distance measures, meta-features, relationships of theoretical construct items to find ‘best’ clustering results
• Expanded on the MST-kNN clustering method for increased potential to find statistically significant clusters of categories of consumers and their chosen brands
• The clustering methodology used in this study highlights the high levels of heterogeneity found in consumer’s online behaviours towards brands
Future Research Directions
• Various domains and contexts to apply the novel process outlined in this study
• Combine a study using survey data as well as ‘live’ behaviour data from social networking sites (real-time interactions)
• Further exploration of meta-features in both survey data and ‘real’ online behaviour clustering studies; ‘differences’ meta-features in this study yielded better results
• This study guides the development of future feature selection models to identify group of consumers according to higher-order characteristics.
Thank you
Questions?
We would like to thank Dr. Jamie Carlson and Mr. Benjamin Lucas for their advise and proofreading.Dr. Jamie Carlson supervised Ms. de Vries’ thesis project and the initial collection and analysis of this data.
Thanks to Mario Inostroza-Ponta for the use of his MST-kNN images.
References (from paper)
• [1] I. P. Cvijikj and F. Michahelles, "Online engagement factors on Facebook brand pages," Social Network Analysis and Mining, vol. 3, pp. 843-861, 2013.• [2] B. Jahn and W. Kunz, "How to transform consumers into fans of your brand," Journal of Service Management, vol. 23, pp. 344-361, 2012.• [3] T. S. Chung and M. Wedel, "Adaptive personalization of mobile information services," in Handbook of Service Marketing Research, R. T. Rust and M.-H.
Huang, Eds., ed Cheltenham: Edward Elgar Publishing Limited, 2014.• [4] N. J. de Vries, J. Carlson, and P. Moscato, "A Data-Driven Approach to Reverse Engineering Customer Engagement Models: Towards Functional
Constructs," PLoS ONE, vol. 9, p. e102768, 2014.• [5] B. Jahn and W. Kunz, "How to Transform Consumers into Fans of your Brand," Journal of Service Management, vol. 23, pp. 344-361, 2012.• [6] J. Carlson and A. O'Cass, "Optimizing the Online Channel in Professional Sport to Create Trusting and Loyal Consumers: The Role of the Professional
Sports Team Brand and Service Quality. ," Journal of Sport Management, vol. 26, p. 463, 2012.• [7] N. Hohenstein, M. J. Sirgy, A. Herrmann, and M. Heitmann, "Self-Congruity: Antecedents and Consequences," in 34th La Londe International Research
Conference in Marketing Communications and Consumer Behaviour Aix en Provance: France University Paul Cezanne, 2007, pp. 118-130.• [8] A. O'Cass and L. Ngo, "Examining the Firm’s Value Creation Process: A Managerial Perspective of the Firm’s Value Offering Strategy and Performance,"
British Journal of Management, vol. 22, pp. 646-671, 2011.• [9] A. O'Cass and J. Carlson, "An Empirical Assessment of Consumers' Evaluations of Web Site Service Quality: Conceptualizing and Testing a Formative
Model," Journal of Services Marketing, vol. 26, pp. 419-434, 2012.• [10] L. D. Peters, "Theory Testing in Social Research," The Marketing Review, vol. 3, pp. 65-82, 2002.• [11] M. R. de Paula, M. G. Ravetti, R. Berretta, and P. Moscato, "Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel Biomarkers for
Early Detection Of Clinical Alzheimer’s Disease," PLoS ONE, vol. 6, pp. 1-14, 2011.• [12] A. S. Arefin, L. Mathieson, D. Johnstone, R. Berretta, and P. Moscato, "Unveiling Clusters of RNA Transcript Pairs Associated with Markers of Alzheimer's
Disease Progression," PLoS ONE, vol. 7, Sep 21 2012.• [13] M. Inostroza-Ponta, R. Berretta, A. Mendes, and P. Moscato, "An automatic graph layout procedure to visualize correlated data," in Artificial Intelligence
in Theory and Practice, ed: Springer, 2006, pp. 179-188.• [14] A. S. Arefin, L. Mathieson, D. Johnstone, R. Berretta, and P. Moscato, "Unveiling clusters of RNA transcript pairs associated with markers of Alzheimer’s
disease progression," PLoS ONE, vol. 7, p. e45535, 2012.• [15] A. Capp, M. Inostroza-Ponta, D. Bill, P. Moscato, C. Lai, D. Christie, et al., "Is there more than one proctitis syndrome? A revisitation using data from the
TROG 96.01 trial," Radiotherapy and oncology, vol. 90, pp. 400-407, 2009.• [16] M. Inostroza-Ponta, A. Mendes, R. Berretta, and P. Moscato, "An integrated QAP-based approach to visualize patterns of gene expression similarity," in
Progress in Artificial Life, ed: Springer, 2007, pp. 156-167.• [17] M. Inostroza-Ponta, R. Berretta, and P. Moscato, "QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization," PloS one, vol.
6, p. e14468, 2011.• [18] A. S. Arefin, M. Inostroza-Ponta, L. Mathieson, R. Berretta, and P. Moscato, "Clustering nodes in large-scale biological networks using external memory
algorithms," in Algorithms and Architectures for Parallel Processing, ed: Springer, 2011, pp. 375-386.• [19] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "Gpu-fs-knn: A software tool for fast and scalable knn computation using GPUs," PLoS ONE, vol. 7, p.
e44000, 2012.
• [20] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "kNN-Borůvka-GPU: A Fast and Scalable MST Construction from kNN Graphs on GPU," in Computational Science and Its Applications–ICCSA 2012, ed: Springer, 2012, pp. 71-86.
• [21] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach on GPU," in Computer Science & Education (ICCSE), 2012 7th International Conference on, 2012, pp. 585-590.
• [22] E. J. Chesler and M. A. Langston, Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data: Springer, 2006.
• [23] M. Hollander, D. A. Wolfe, and E. Chicken, Nonparametric statistical methods vol. 751: John Wiley & Sons, 2013.• [24] C. E. Shannon, "The mathematical theory of communication," The Bell System Technical Journal, vol. 27, pp. 379-423 & 623-
656, 1948.• [25] A. W. Kruglanski, "The Human Subject in the Psychology Experiment: Fact and Artifact," in Advances in Experimental Social
Psychology vol. 8, L. Berkowittz, Ed., ed New York: Academic Press, 1975, pp. 101-147.• [26] H. Krasnova, S. Spiekermann, K. Koroleva, and T. Hildebrand, "Online Social Networks: Why we Disclose," Journal of
Information Technology, vol. 25, pp. 109-125, 2010.• [27] S. C. Chu and Y. Kim, "Determinants of Consumer Engagement in electronic Word of Mouth (eWoM) in Social Networking
Sites," International Journal of Advertising, vol. 30, pp. 47-75, 2011.• [28] J. M. Pinho and A. M. Soares, "Examining the Technology Acceptance Model in the Adoption of Social Networks," Journal of
Research in Interactive Marketing, vol. 5, pp. 116-129, 2011.
Additional reference from presentation:• Arefin AS, Mathieson L, Johnstone D, Berretta R, Moscato P (2012) Unveiling Clusters of RNA Transcript Pairs Associated with Markers of
Alzheimer’s Disease Progression. PLoS ONE 7(9): e45535. doi: 10.1371/journal.pone.0045535• Rocha de Paula M, Gómez Ravetti M, Berretta R, Moscato P (2011) Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel
Biomarkers for Early Detection Of Clinical Alzheimer's Disease. PLoS ONE 6(3): e17481. doi: 10.1371/journal.pone.0017481
References cont.