Group Formation in Large Social Networks: Membership, Growth, and EvolutionLars Backstrom, Dan Huttenlocher, Joh Kleinberg, Xiangyang
Lan
Presented by Natalia Dragan
Data Mining Techniques(CS6/73015-001) Fall06 Kent State University
Outline
Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions
Introduction
Structure of society: people tend to come together and form groups
Why it is important to study To understand better decision-making behavior Tracking early stages of an epidemic Following popularity of new ideas and technologies
Significant growth in the scale and richness of on-line communities and social media (MySpace, Face book, LiveJournal, Flickr)
Problems with the studying social processes Easy to build theoretical models (subgraph
branching out rapidly over links in the network or collection of small disconnected components growing in
a ‘speckled’ fashion) Hard to make concrete empirical statements
about these types of processes
Lack of reasonable vocabulary for studying group evolution
Present work Principles by which groups develop and evolve in
large-scale social networks
Crucial point: focusing on networks where the members have explicitly identified themselves as a group or community (vs. an unsupervised graph clustering problem of inferring
“community structures” in a network)
3 main types of questions: Membership, Growth, Change
Membership, Growth, Change Membership
Structural features that influence whether a given individual will join a particular group
Growth Structural features that influence whether a given group will
grow significantly over time
Change How focus of interest changes over time How these changes are correlated with changes in the set
of group members
Related work
Identifying tightly-connected clusters within a given graph Dill et al. consider implicitly identified communities
For a set of features (ZIP code, particular keyword, etc.) they consider a subgraph of the Web consisting of all pages containing this feature
Use of online social networks for data mining The structure of the communities is not exploited
Diffusion on innovation study Property which is “diffusing” in the present work –
membership in a given group
Diffusion of Innovations
Diffusion of Innovations is a theory that analyzes, as well as helps explain, the adaptation of a new innovation. In other words it helps to explain the process of social change.
An innovation is an idea, practice, or object that is perceived as new by an individual or other unit of adoption. The perceived newness of the idea for the individual determines his/her reaction to it (Rogers, 1995).
In addition, diffusion is the process by which an innovation is communicated through certain channels over time among the members of a social system. Thus, the four main elements of the theory are the innovation, communication channels, time, and the social system.
http://hsc.usf.edu/~kmbrown/Diffusion_of_Innovations_Overview.htm
Related work on Diffusion of Innovation How a social network evolves as it’s
members attributes change (Sarkar and Moore, Holme and Newman)
Social network evolution in a university setting (Kossinets and Watts)
Evolution of topics over time (Wang and McCallum)
Property which is “diffusing” in the present work is a membership in a given group
Road map
Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions
Sources of data
LiveJournal Free on-line community with ~ 10 mln members 300,000 update the content in 24-hour period Maintaining journals, individual and group blogs Declaring who are their friends and to which communities
they belong DBLP
On-line database of computer science publications (about 400,000 papers)
Friendship network – co-authors in the paper Conference - community
Community Membership
Study of processes by which individuals join communities in a social network
Fundamental question about the evolution of communities: who will join in the future?
Membership in a community – “behavior” that spreads through the network Diffusion of innovation study perspective for this question
Dependence on number of friends: start towards membership prediction Underlying premise in diffusion studies: an
individual probability of adopting a new behavior increases with the number of friends (K) already engaging in the behavior
Theoretical models concentrate on the effect of K, while the structural properties are more influential in determining membership
Approach hypothesis
For moderate values of K an individual with K friends in a group is significantly more likely to join if these K fiends are themselves mutual friends than if there are not
Dependence on number of friends
.
..
.
1st snapshot 2nd snapshot
...
C.
..
...
..
C.K = 3
.. ..
..
.
. ..
P(k) = 2/3
..
Probability P(k) of joining community = fraction of triples (u,C,k)
- user (u) , C - community, - friend
. .
Law of Diminishing returns
In economics, diminishing returns is the short form of diminishing marginal returns. In a production system, having fixed and variable inputs, keeping the fixed inputs constant, as more of a variable input is applied, each additional unit of input yields less and less additional output. This concept is also known as the law of increasing opportunity cost or the law of diminishing returns.
http://en.wikipedia.org/wiki/Diminishing_returns
Dependence on number of friends: LiveJournal
Dependence of number of friends: DBLP
Discussion of results
The plots for LJ and DBLP have similar shapes, dominated by “diminishing returns” property (curve continues increasing, but more and more slowly even for large K)
P(2)>2P(1) – benefit of having a second friend is particularly strong (S-shaped behavior)
Curve for LJ is quite smooth (1/2 billion triples vs. 7.8 million for DBLP)
A broader range of features
Features related to the community C (11) Number of members (|C|) Number of individuals with a friend in C (fringe of C) Number of edges with both ends in the community (|Ec|) etc.
Features related to an individual u and her set S of friends in community C (8) Number of friend in community (|S|) Number of adjacent pairs in S Number of pairs in S connected via a path in Ec etc.
Estimating probability on a broader range of features Decision-tree techniques were applied to
these features to make advances in estimating the probability of an individual joining a community
The technique incorporates Individual’s position in the network (structural
features) Level of activities among members (group
features)
Predictions for LJ and DBLP 1st snapshot 2nd snapshot
C
Fringe Fringe
C
u.... ....
Data point (u,C) Probability UC
LJ: 17,076,344 data points, 875 communities DBLP: 7,651,013 data points
LJ: 14,448 joined community DBLP: 71,618 joined community
20 decisions tree were built for estimation about joining
Splitting process for LJ
Each of 875 communities have half of their fringe members included in the training set (with the independent probability 0.5)
At each node in the decision tree Every possible feature Every binary split treshold for that feature
were examined
Of all such pairs the split which produces the largest decrease in entropy was chosen
Splitting process for LJ New splits were installed until there are fewer than
100 positive cases at the node
Leaf nodes predict the ratio of positive to total cases for that node
Averaging technique For every case the set of decision trees, for which this case
is not included in the training set, were built The average of these predictions is a prediction for the
case
Averaging model (Simple description) Selecting a model that explains the data from all the possible
models, the one which better fits the data is usually selected.
But sometimes there is some model that explains really well the data, creating a model selection uncertainty, which is usually ignored.
BMA (Bayesian Model Averaging) provides a coherent mechanism for accounting for this model uncertainty, combining predictions from the different models according to their probability.
J. A.and Madigan D. Hoeting and A.E.and Volinsky C.T. Raftery. Bayesian model averaging: A tutorial (With Discussion). Statistical Science, 44(4):382--417, 1999
Averaging model (Simple description) Example: we have an evidence D and 3 possible
hypothesis h1, h2 and h3. The posterior probabilities for those hypothesis are P( h1 | D ) = 0.4, P( h2 | D ) = 0.3 and P( h3 | D ) = 0.3 Giving a new observation, h1 classifies it as true and h2 and h3
classify it as false, then the result of the global classifier (BMA) would be calculated as follows:
Top two level splits for predicting single individuals joining communities in LJ
Performance achieved with the decision trees
Prediction performance for single individuals joining communities in LJ
Prediction performance for single individuals joining communities in DBLP
Internal connectedness of friends
Individuals whose friends in community are linked to one another are significantly more likely to join the community
Road map
Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions
Community Growth Prediction task: identifying which communities
will grow significantly over a given period of time Binary classification problem Training set
Community growth
Class 0 (50.6) Class 1 (49.4%)
>9 % < 18 %
Data set: 13 570 communitiesTo make predictions: 100 decision trees on 100 independent samples using the community features were builtBinary split is installed until a node has less than 50 data points
Top two levels of decision tree splits for predicting community growth in LJ The features and splits varied depending on the sample,
but the top 2 splits were stable
Solution to the problem
Averaging tree techniques was used
Three baselines with a single feature were considered Size of the community Number of people in the fringe of the community Ratio of these two features and combination of all
three features
Results
Predicting community growth: baselines based on three different features, and performance using all features
By including the full set of features predictions with reasonably good performance were received
Road map
Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions
Movement between communities How people and topics move between communities
Fundamental question: given a set of overlapping communities do topics tend to follow people or do people tend to follow topics
Experiment set up: 87 conferences for which there is DBLP data over at least 15-year period Cumulative set of words in titles is a proxy for top-level
topics
Time Series and Detected Bursts: Term Bursts
OOPSLA’03
.. .
. . ..“Micro-Pattern Evolution”
“Micro-Pattern in Java”
Tw,C(y) = 2/6
. . . . . .2000 2001 2002 2003 2004 2005 y
Tw,C
Term bursts
. . .
. . ..
“Micro-Pattern” is hot atOOPSLA in 2003
Time Series and Detected Bursts: Term Bursts Tw,C(y) – fraction of paper titles at conference C in
year y that contain the word w
Bursts in the usage of w For each time series Tw,C is an interval in which Tw,C(y) is
twice the average rate (“burst rate”) Burst detection technique exploiting stochastic model for
term generation is used Burst intervals serve to identify the “hot topics”
(focus of interest at a conference)
Word w is hot at conference C in year y if the year y is contained in a burst interval of the time series Tw,C
Time Series and Detected Bursts: Movement Bursts Author movement
Authors do not publish every year Movement is asymmetric
Member of a conference C in year y Has published there in the 5 years leading up to y
Author a moves into C from B in year y (B -> C) a has a paper in conference C in year y and a is a member of B in year y-1 Property of two conferences and a year
B C
2002 2003Smith
. .“Micro-Pattern Evolution”
Time Series and Detected Bursts: Movement Bursts
B
.. .
. ...
Dill
MB,C(y) = 2/5
. . . . . .2000 2001 2002 2003 2004 2005 y
MB,C
Movement bursts
. . .. . ..
BrownC
2001 2002
. ..
..
Time Series and Detected Bursts: Movement Bursts MB,C(y) – fraction of authors at conference C
in year y with the property that they are moving into C from B (B -> C)
MB,C – time series representing author movement
B -> C movement bursts an interval of y in which the value MB,C(y)
exceeds the overall average by an absolute difference of .10
Burst detection is used to find burst intervals
Goal of the Experiment 1
Identify how word burst and movement burst intervals are aligned in time?
Word burst intervals identify hot terms Movement burst intervals identify conference
pairs B,C during which there was significant movement
Experiment 1: Papers contributing to Movement Bursts Characteristics of papers associated with some
movement burst into a conference C They exhibit different properties from arbitrary papers at C
Using of terms currently hot at C Using of terms that will be hot at C in the future
Paper at C in y contributes to some movement burst at C If one of the authors is moving B -> C in y y is a part of B -> C movement bursts
2002 2004
Movement burst
. .ICPC’02 OOPSLA’03
2003
“Micro-pattern Evolution”Smith
Papers contributing to Movement Bursts Paper uses hot term
If one of the words in its title is hot for the conference and year in which it appears
Question: do papers contributing to movement bursts differ from arbitrary papers in the way they use hot terms?
Papers contributing to a movement burst contain elevated frequencies of currently and expired hot terms, but lower frequencies of future hot terms
A burst of authors moving into C from B are drawn to topics currently hot at C
Experiment 2: Alignment between different conferences Conferences B and C are topically aligned in a year y
If some word is hot at both B and C in year y Property of two conference and a specific year
Hypothesis: two conferences are more likely to be topically aligned in a given year if there is also a movement burst going between them
“Micro-pattern”
“Micro-pattern”OOPSLA’03ICSM’03
Results
56.34% of all triples (B,C,y) such that there is B->C movement burst containing year y have the property that B and C are topically aligned in year y
16.2 % of all triples (B,C,y) have the property that B and C are topically aligned in year y
The presence of a movement burst between 2 conferences enormously increases the chance they share a hot term
Movement bursts or term bursts come first? There is a B -> C movement burst, and hot
terms w such that B and C are topically aligned via w in some year y inside the movement burst
3 events of interest The start of the burst for w at conference B The start of the burst for w at conference C The start of the B -> C movement burst
Four patterns of author movement and topical alignmentB -> C movement burst Term burst intervals
Shared interest is 50 % more frequent than others
Much more frequent for B and C to have a shared burst term that is already underway before the increase in author movement takes place
194 32
35 61
Conclusions
The ways in which communities in social network grow over time were considered At the level of individuals and their decision to join
communities At a more global level, in which a community can
evolve in membership and content
Thank you!