Download - Group Formation in Large Social Networks: Membership, Growth, and Evolution Lars Backstrom, Dan Huttenlocher, Joh Kleinberg, Xiangyang Lan Presented by.

Group Formation in Large Social Networks: Membership, Growth, and EvolutionLars Backstrom, Dan Huttenlocher, Joh Kleinberg, Xiangyang

Lan

Presented by Natalia Dragan

Data Mining Techniques(CS6/73015-001) Fall06 Kent State University

Outline

Introduction Motivation Present work contributions Related work Membership, Growth, Evolution Conclusions

Introduction

Structure of society: people tend to come together and form groups

Why it is important to study To understand better decision-making behavior Tracking early stages of an epidemic Following popularity of new ideas and technologies

Significant growth in the scale and richness of on-line communities and social media (MySpace, Face book, LiveJournal, Flickr)

Problems with the studying social processes Easy to build theoretical models (subgraph

branching out rapidly over links in the network or collection of small disconnected components growing in

a ‘speckled’ fashion) Hard to make concrete empirical statements

about these types of processes

Lack of reasonable vocabulary for studying group evolution

Present work Principles by which groups develop and evolve in

large-scale social networks

Crucial point: focusing on networks where the members have explicitly identified themselves as a group or community (vs. an unsupervised graph clustering problem of inferring

“community structures” in a network)

3 main types of questions: Membership, Growth, Change

Membership, Growth, Change Membership

Structural features that influence whether a given individual will join a particular group

Growth Structural features that influence whether a given group will

grow significantly over time

Change How focus of interest changes over time How these changes are correlated with changes in the set

of group members

Related work

Identifying tightly-connected clusters within a given graph Dill et al. consider implicitly identified communities

For a set of features (ZIP code, particular keyword, etc.) they consider a subgraph of the Web consisting of all pages containing this feature

Use of online social networks for data mining The structure of the communities is not exploited

Diffusion on innovation study Property which is “diffusing” in the present work –

membership in a given group

Diffusion of Innovations

Diffusion of Innovations is a theory that analyzes, as well as helps explain, the adaptation of a new innovation. In other words it helps to explain the process of social change.

An innovation is an idea, practice, or object that is perceived as new by an individual or other unit of adoption. The perceived newness of the idea for the individual determines his/her reaction to it (Rogers, 1995).

In addition, diffusion is the process by which an innovation is communicated through certain channels over time among the members of a social system. Thus, the four main elements of the theory are the innovation, communication channels, time, and the social system.

http://hsc.usf.edu/~kmbrown/Diffusion_of_Innovations_Overview.htm

Related work on Diffusion of Innovation How a social network evolves as it’s

members attributes change (Sarkar and Moore, Holme and Newman)

Social network evolution in a university setting (Kossinets and Watts)

Evolution of topics over time (Wang and McCallum)

Property which is “diffusing” in the present work is a membership in a given group

Road map


Sources of data

LiveJournal Free on-line community with ~ 10 mln members 300,000 update the content in 24-hour period Maintaining journals, individual and group blogs Declaring who are their friends and to which communities

they belong DBLP

On-line database of computer science publications (about 400,000 papers)

Friendship network – co-authors in the paper Conference - community

Community Membership

Study of processes by which individuals join communities in a social network

Fundamental question about the evolution of communities: who will join in the future?

Membership in a community – “behavior” that spreads through the network Diffusion of innovation study perspective for this question

Dependence on number of friends: start towards membership prediction Underlying premise in diffusion studies: an

individual probability of adopting a new behavior increases with the number of friends (K) already engaging in the behavior

Theoretical models concentrate on the effect of K, while the structural properties are more influential in determining membership

Approach hypothesis

For moderate values of K an individual with K friends in a group is significantly more likely to join if these K fiends are themselves mutual friends than if there are not

Dependence on number of friends

.

..

.

1st snapshot 2nd snapshot

...

C.

..

...

..

C.K = 3

.. ..

..

.

. ..

P(k) = 2/3

..

Probability P(k) of joining community = fraction of triples (u,C,k)

- user (u) , C - community, - friend

. .

Law of Diminishing returns

In economics, diminishing returns is the short form of diminishing marginal returns. In a production system, having fixed and variable inputs, keeping the fixed inputs constant, as more of a variable input is applied, each additional unit of input yields less and less additional output. This concept is also known as the law of increasing opportunity cost or the law of diminishing returns.

http://en.wikipedia.org/wiki/Diminishing_returns

Dependence on number of friends: LiveJournal

Dependence of number of friends: DBLP

Discussion of results

The plots for LJ and DBLP have similar shapes, dominated by “diminishing returns” property (curve continues increasing, but more and more slowly even for large K)

P(2)>2P(1) – benefit of having a second friend is particularly strong (S-shaped behavior)

Curve for LJ is quite smooth (1/2 billion triples vs. 7.8 million for DBLP)

A broader range of features

Features related to the community C (11) Number of members (|C|) Number of individuals with a friend in C (fringe of C) Number of edges with both ends in the community (|Ec|) etc.

Features related to an individual u and her set S of friends in community C (8) Number of friend in community (|S|) Number of adjacent pairs in S Number of pairs in S connected via a path in Ec etc.

Estimating probability on a broader range of features Decision-tree techniques were applied to

these features to make advances in estimating the probability of an individual joining a community

The technique incorporates Individual’s position in the network (structural

features) Level of activities among members (group

features)

Predictions for LJ and DBLP 1st snapshot 2nd snapshot

C

Fringe Fringe

C

u.... ....

Data point (u,C) Probability UC

LJ: 17,076,344 data points, 875 communities DBLP: 7,651,013 data points

LJ: 14,448 joined community DBLP: 71,618 joined community

20 decisions tree were built for estimation about joining

Splitting process for LJ

Each of 875 communities have half of their fringe members included in the training set (with the independent probability 0.5)

At each node in the decision tree Every possible feature Every binary split treshold for that feature

were examined

Of all such pairs the split which produces the largest decrease in entropy was chosen

Splitting process for LJ New splits were installed until there are fewer than

100 positive cases at the node

Leaf nodes predict the ratio of positive to total cases for that node

Averaging technique For every case the set of decision trees, for which this case

is not included in the training set, were built The average of these predictions is a prediction for the

case

Averaging model (Simple description) Selecting a model that explains the data from all the possible

models, the one which better fits the data is usually selected.

But sometimes there is some model that explains really well the data, creating a model selection uncertainty, which is usually ignored.

BMA (Bayesian Model Averaging) provides a coherent mechanism for accounting for this model uncertainty, combining predictions from the different models according to their probability.

J. A.and Madigan D. Hoeting and A.E.and Volinsky C.T. Raftery. Bayesian model averaging: A tutorial (With Discussion). Statistical Science, 44(4):382--417, 1999

Averaging model (Simple description) Example: we have an evidence D and 3 possible

hypothesis h1, h2 and h3. The posterior probabilities for those hypothesis are P( h1 | D ) = 0.4, P( h2 | D ) = 0.3 and P( h3 | D ) = 0.3 Giving a new observation, h1 classifies it as true and h2 and h3

classify it as false, then the result of the global classifier (BMA) would be calculated as follows:

Top two level splits for predicting single individuals joining communities in LJ

Performance achieved with the decision trees

Prediction performance for single individuals joining communities in LJ

Prediction performance for single individuals joining communities in DBLP

Internal connectedness of friends

Individuals whose friends in community are linked to one another are significantly more likely to join the community

Road map


Community Growth Prediction task: identifying which communities

will grow significantly over a given period of time Binary classification problem Training set

Community growth

Class 0 (50.6) Class 1 (49.4%)

>9 % < 18 %

Data set: 13 570 communitiesTo make predictions: 100 decision trees on 100 independent samples using the community features were builtBinary split is installed until a node has less than 50 data points

Top two levels of decision tree splits for predicting community growth in LJ The features and splits varied depending on the sample,

but the top 2 splits were stable

Solution to the problem

Averaging tree techniques was used

Three baselines with a single feature were considered Size of the community Number of people in the fringe of the community Ratio of these two features and combination of all

three features

Results

Predicting community growth: baselines based on three different features, and performance using all features

By including the full set of features predictions with reasonably good performance were received

Road map


Movement between communities How people and topics move between communities

Fundamental question: given a set of overlapping communities do topics tend to follow people or do people tend to follow topics

Experiment set up: 87 conferences for which there is DBLP data over at least 15-year period Cumulative set of words in titles is a proxy for top-level

topics

Time Series and Detected Bursts: Term Bursts

OOPSLA’03

.. .

. . ..“Micro-Pattern Evolution”

“Micro-Pattern in Java”

Tw,C(y) = 2/6

. . . . . .2000 2001 2002 2003 2004 2005 y

Tw,C

Term bursts

. . .

. . ..

“Micro-Pattern” is hot atOOPSLA in 2003

Time Series and Detected Bursts: Term Bursts Tw,C(y) – fraction of paper titles at conference C in

year y that contain the word w

Bursts in the usage of w For each time series Tw,C is an interval in which Tw,C(y) is

twice the average rate (“burst rate”) Burst detection technique exploiting stochastic model for

term generation is used Burst intervals serve to identify the “hot topics”

(focus of interest at a conference)

Word w is hot at conference C in year y if the year y is contained in a burst interval of the time series Tw,C

Time Series and Detected Bursts: Movement Bursts Author movement

Authors do not publish every year Movement is asymmetric

Member of a conference C in year y Has published there in the 5 years leading up to y

Author a moves into C from B in year y (B -> C) a has a paper in conference C in year y and a is a member of B in year y-1 Property of two conferences and a year

B C

2002 2003Smith

. .“Micro-Pattern Evolution”

Time Series and Detected Bursts: Movement Bursts

B

.. .

. ...

Dill

MB,C(y) = 2/5

. . . . . .2000 2001 2002 2003 2004 2005 y

MB,C

Movement bursts

. . .. . ..

BrownC

2001 2002

. ..

..

Time Series and Detected Bursts: Movement Bursts MB,C(y) – fraction of authors at conference C

in year y with the property that they are moving into C from B (B -> C)

MB,C – time series representing author movement

B -> C movement bursts an interval of y in which the value MB,C(y)

exceeds the overall average by an absolute difference of .10

Burst detection is used to find burst intervals

Goal of the Experiment 1

Identify how word burst and movement burst intervals are aligned in time?

Word burst intervals identify hot terms Movement burst intervals identify conference

pairs B,C during which there was significant movement

Experiment 1: Papers contributing to Movement Bursts Characteristics of papers associated with some

movement burst into a conference C They exhibit different properties from arbitrary papers at C

Using of terms currently hot at C Using of terms that will be hot at C in the future

Paper at C in y contributes to some movement burst at C If one of the authors is moving B -> C in y y is a part of B -> C movement bursts

2002 2004

Movement burst

. .ICPC’02 OOPSLA’03

2003

“Micro-pattern Evolution”Smith

Papers contributing to Movement Bursts Paper uses hot term

If one of the words in its title is hot for the conference and year in which it appears

Question: do papers contributing to movement bursts differ from arbitrary papers in the way they use hot terms?

Papers contributing to a movement burst contain elevated frequencies of currently and expired hot terms, but lower frequencies of future hot terms

A burst of authors moving into C from B are drawn to topics currently hot at C

Experiment 2: Alignment between different conferences Conferences B and C are topically aligned in a year y

If some word is hot at both B and C in year y Property of two conference and a specific year

Hypothesis: two conferences are more likely to be topically aligned in a given year if there is also a movement burst going between them

“Micro-pattern”

“Micro-pattern”OOPSLA’03ICSM’03

Results

56.34% of all triples (B,C,y) such that there is B->C movement burst containing year y have the property that B and C are topically aligned in year y

16.2 % of all triples (B,C,y) have the property that B and C are topically aligned in year y

The presence of a movement burst between 2 conferences enormously increases the chance they share a hot term

Movement bursts or term bursts come first? There is a B -> C movement burst, and hot

terms w such that B and C are topically aligned via w in some year y inside the movement burst

3 events of interest The start of the burst for w at conference B The start of the burst for w at conference C The start of the B -> C movement burst

Four patterns of author movement and topical alignmentB -> C movement burst Term burst intervals

Shared interest is 50 % more frequent than others

Much more frequent for B and C to have a shared burst term that is already underway before the increase in author movement takes place

194 32

35 61

Conclusions

The ways in which communities in social network grow over time were considered At the level of individuals and their decision to join

communities At a more global level, in which a community can

evolve in membership and content

Thank you!