The Slashdot Zoo: Mining a Social Network with Negative Edges · Learning; H.4.0 [Information...

The Slashdot Zoo: Mininga Social Network with Negative Edges

Jérôme KunegisDAI-Labor

Technische Universität BerlinErnst-Reuter-Platz 7

10587 Berlin, [email protected]

Andreas LommatzschDAI-Labor

Technische Universität BerlinErnst-Reuter-Platz 7

10587 Berlin, [email protected]

Christian BauckhageDeutsche Telekom

LaboratoriesErnst-Reuter-Platz 7

10587 Berlin, Germanychristian.bauckhage

@telekom.de

ABSTRACT

We analyse the corpus of user relationships of the Slash-dot technology news site. The data was collected from theSlashdot Zoo feature where users of the website can tag otherusers as friends and foes, providing positive and negative en-dorsements. We adapt social network analysis techniques tothe problem of negative edge weights. In particular, we con-sider signed variants of global network characteristics such asthe clustering coefficient, node-level characteristics such ascentrality and popularity measures, and link-level character-istics such as distances and similarity measures. We evaluatethese measures on the task of identifying unpopular users,as well as on the task of predicting the sign of links and showthat the network exhibits multiplicative transitivity whichallows algebraic methods based on matrix multiplication tobe used. We compare our methods to traditional methodswhich are only suitable for positively weighted edges.

Categories and Subject Descriptors

I.2.6 [Computing Methodologies]: Artificial Intelligence—Learning ; H.4.0 [Information Systems Applications]:General

General Terms

Experimentation, Theory

Keywords

Social network, Slashdot Zoo, negative edge, link prediction

1. INTRODUCTIONSocial network analysis studies social networks by means

of analysing structural relationships between people. Ac-cordingly, social networks are usually modeled using directedgraphs, were an edge between two nodes represents a rela-tionship between two individuals. While most social networkmodeling approaches allow for weighted edges, the weightsare usually restricted to positive values. However, some re-lationships such as distrust and dislike are inherently nega-tive. In such cases, the social network contains negative edge

weights.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2009, April 20–24, 2009, Madrid, Spain.ACM 978-1-60558-487-4/09/04.

Figure 1: The Slashdot Zoo corpus represented asa graph, where nodes represent users and edges in-dicate relationships. The network contains 77,985nodes and 510,157 edges. “friend” relationships areshown in green edges and “foe” relationships in red;the orientation of edges is not shown. The graph iscentered at user CmdrTaco, founder and editor ofthe site.

Slashdot1 is a technology news website founded in 1997.It publishes stories written by editors or submitted by usersand allows users to comment on them. In 2002, the siteadded the Zoo feature, which lets users tag other users asfriends and foes. It is therefore an early example of an onlineservice with a social networking component. In constrastto most popular social networking services, Slashdot is oneof the few sites that also allows users to rate other usersnegatively.

In this paper, we systematically study signed versions of

1http://slashdot.org/

WWW 2009 MADRID! Track: Social Networks and Web 2.0 / Session: Interactions in Social Communities

741

http://slashdot.org/

methods for network analysis. We study the Slashdot Zoocorpus on a global level, on the level of individual nodes,and on the level of individual edges. On the global level, weanalyse measures that characterize the network as a whole.We consider the clustering coefficient of the Slashdot Zoocorpus and propose the signed clustering coefficient and therelative signed clustering coefficient. On the node level, westudy measures applying to individual nodes of the network.We review signed popularity and centrality measures andevaluate them on the task of indentifying troll users. On theedge level, we analyse similarity measures that apply to pairsof nodes and evaluate their use for the task of predictingsigned links.

The network analysis methods we will present are all basedon the concept of transitivity which stipulates that relationsbetween any two nodes in the network can be described bypaths between the two nodes. In networks with negativeedge weights the concept of transitivity has to take into ac-count negative values. In the simplest case we can ask: Ifthere is a path of signed edges between two nodes in the net-work, what relation can we induce between the two nodes?We will show that the solution to this question is a mul-tiplication rule exemplified by the phrase the enemy of my

enemy is my friend. By analysing the signed graph on vari-ous levels, we will show that this multiplicative transitivityrule is indeed valid for the Slashdot Zoo.

The rest of the paper is structured as follows. In Section 2,we discuss related work. Section 3 presents the Slashdot Zoocorpus. For network analysis on the global level, Section 4presents the signed clustering coefficient. Section 5 discussesvarious popularity measures on the node level and evaluatesthem for the task of indentifying unpopular users. Section 6reviews distance and similarity measures on the link leveland evaluates them on the task of link sign prediction. Weconclude in Section 7.

2. RELATED WORKSocial network analysis (SNA) has a background in soci-

ology [23]. The proliferation of Web 2.0 sites which focus onuser participation for content creation resulted in very largedatasets that call for advanced data mining techniques. Ageneral discussion of Slashdot can be found in [18]. TheSlashdot discussion threads are studied in [6].

Most of the web-based social network analysis considersthe case of unsigned networks, where edges are either un-weighted, or only weighted with positive values [23]. Recentstudies [9] describe the social network extracted from Essem-bly, an ideological discussion site that allows users to markother users as friends, allies and nemeses, discussing the se-mantics of the three relation types. These works model thedifferent types of edges by means of three different graphs.In this paper, we avoid such overhead and analyse all edgesin a single graph with weighted edges.

Other recent work [27] considers the task of discoveringcommunities from social networks with negative edges. How-ever, the negative edges are only used to separate communi-ties (clusters), and do not serve as a measure of popularityor similarity. Signed graphs have been used in anthropologyto model friendship and enmity [7].

Work on trust networks is by definition concerned withnegative edges. Work in that field has mostly focused ondefining global trust measures using path lengths or adapt-ing PageRank [3, 5, 13]. In this paper we compare these

approaches to our techniques and find them to yield inferiorresults. Collaborative filtering aims at predicting or recom-mending links in a bipartite user-item graph [8]. The edgeweights in such a graph often admit negative values, indicat-ing a dislike of the item in question. However, the methodsof collaborative filtering cannot immediately be applied tosocial network analysis, because links in the bipartite graphare not directed and relations between users extracted froma bipartite graph are necessarily symmetric.

The clustering coefficient was first described in [25] andextended to positively weighted edges in [12]. The task oflink prediction in social networks is described in [19] for thecase of positive edges. Distance and similarity in unsignedsocial networks are described in [26]. These measures arebased on shortest-path distances and spectral measures suchas PageRank and HITS. Graph kernels are described in [2,14, 16]. Their application to link prediction and recommen-dation is covered in [11]. In all these works however, onlypositively weighted edges are considered.

Balance in a signed graph is defined in [?].Variants of Laplacian graph kernels that apply to networks

with negative edges are described in [17, 10], where they areused in an undirected setting for collaborative filtering. Cen-trality and trust measures based on the graph Laplacians aredescribed in [21] for graphs with only positive edge weights.

3. THE SLASHDOT ZOOThe Slashdot Zoo corpus we consider in this paper con-

tains 77,985 users and 510,157 links. Each link consists of auser who gave an endorsement and a user who received theendorsement. Endorsements can be either positive (“friend”)or negative (“foe”). Apart from this distinction, no other in-formation is available; in particular, the creation date ofendorsements is not known.

In addition to the terms “friend” and “foe”, Slashdot alsouses the terms “fan” and “freak”: A user is always the fan ofhis friends and the freak of his foes. Figure 2 summarizesthese relationships.

Figure 1 is a graphical representation of the Slashdot Zoocorpus. The sign of an edge is represented by its color, withgreen representing the “friend” relationship and red repre-senting the “foe” relationship. The graph is centered at userCmdrTaco, founder of Slashdot and active editor.

Since this data was retrieved between May and October2008, our corpus does not represent a true “snapshot” of thenetwork, and may exhibit anomalies. For instance, since itwas not possible to determine when a user added a tag, someusers in our corpus may have more than 400 friends or foes,although Slashdot generally limits the number of friends andfoes to 200 users and to 400 users for subscribers.

Slashdot is known for both having very popular and promi-nent users on the one hand, and rather unpopular users on

friend foe

fan freak

Figure 2: The two types of links allowed in the Slash-dot Zoo (friend and foe) give rise to four kinds ofrelationships: friends, fans, foes and freaks. A useris the fan of his friends and the freak of his foes.


742

Table 1: Statistics about the Slashdot Zoo corpus.The mean friend count and mean fan count are nec-essarily equal, as do the mean foe and freak counts.

Users 77,985Links 510,157Friend links 388,190Foe links 121,967Sparsity 0.000083884Mean link count 6.542Mean friend/fan count 4.978Mean foe/freak count 1.564Median links 3Median friend count 1Median foe count 0Median fans count 1Median freaks count 1

the other hand. Prominent and popular users of Slashdot in-clude CmdrTaco (Rob Malda, the founder of Slashdot and apopular editor), John Carmack (prominent computer gameprogrammer), Bruce Perens (prominent computer program-mer and open source advocate) and CleverNickName (WilWheaton, Star Trek actor). In addition, Slashdot is wellknown for having a rich tradition of trolling, i.e. the postingof disruptive, false or offensive information to fool and pro-voke readers. The high number of such trolls may explainwhy the “foe” feature is useful on Slashdot. It allows fortagging known trolls and reducing their visibility.

3.1 DefinitionsFor our graph-based representation of this corpus, we use

the following definitions:

• n, the number of users

• u, v are specific users

• A ∈ {−1, 0, +1}n×n is the adjacency matrix with val-ues Auv = +1 when user u marked user v as a friendand Auv = −1 when user u marked user v as a foe.A is sparse, square and asymmetric.

• A, the absolute adjacency matrix defined by Aij =|Aij |

• B = A + AT , the symmetric adjacency matrix

• B = A+AT , the absolute symmetric adjacency matrix

• D, the absolute diagonal degree matrix defined by Dii =P

j|Aij |

• E, the absolute symmetric diagonal degree matrix de-fined by Eii =

P

j|Bij |

3.2 StatisticsTable 1 displays basic statistics of the corpus. Figure 3

shows the degree distributions in the Slashdot Zoo. As ex-pected, the degree distribution in the Slashdot Zoo followsa power law. Figure 4 shows the in-degree plotted againstthe out-degree.

Table 2 shows graph statistics based on shortest path dis-tances. All distances were calculated without taking intoaccount the edge directions and sign. In parentheses, we

Figure 3: Logarithmic plot of the degree distri-bution showing that the degree distribution in theSlashdot Zoo follows a power law. The limit of 200friends and foes is visible.

Figure 4: The in-degree plotted against the out-degree. The limits of 200 and 400 friends and foesare visible.

show the average distance as computed in [25]. As observedin that article, the measured average distance is less thanthe average distance in a random graph, confirming thatthe Slashdot Zoo is a small-world network.

Figure 5 shows three variants of principal componentsanalysis, visualizing the dataset. The first two plots showthe users plotted by the ratings they have given and the rat-ings they have received, respectively. The third plot showsthe principal component analysis as given by the Laplaciankernel of Equation 11 in Section 6. As we will see, the pro-jection using the Laplacian kernel gives a plot preserving theresistance distance between nodes in the network.

In the next three sections, we study the Slashdot Zoo cor-pus on the global, node and link levels.

4. CLUSTERING COEFFICIENTTo test our hypothesis of multiplicative transitivity, we be-

gin by defining and studying a global network statistic thatdenotes to what extent the multiplication rule is valid in asigned network. To that end, we extend the clustering co-efficient to networks with signed, directed edges to give the


743

Table 2: The Slashdot Zoo’s graph diameter, radiusand mean shortest-path distance. The sign and di-rection of edges was ignored in the calculation ofthese values. In parentheses, we show the averagedistance in a random graph, as defined by Watts andStrogatz.

Diameter 6Radius 3Average distance 3.86 (5.82)

signed clustering coefficient, introduce the relative signedclustering coefficient, and give the values observed for theSlashdot Zoo corpus. As we will see, the signed clusteringcoefficient denotes the transitivity of edge signs, giving an in-dication whether our multiplicative transitivity assumptionis justified. The two next sections will then study popularityand link prediction algorithms that directly make use of mul-tiplicative transitivity. We begin by defining multiplicativetransitivity.

Definition. A signed network exhibits multiplicative tran-

sitivity when any two incident edges tend to be completed by

a third edge having as a weight the product of the two edges’

weights.

Multiplicative transitivity is motivated by the fact thattriangles of users connected by an even number of negativelyweighted edges can be considered balanced [7], which can besummarized by the phrase the enemy of my enemy is my

friend and its permutations. As we will see in later sections,the assumption of transitive multiplicativity lends itself tousing algebraic methods based on the adjacency graph of the

(a) PCA by ratings given (b) PCA by ratings received

(c) Laplacian PCA

Figure 5: Three variations on principal componentanalysis (PCA). In these graphs, every user is rep-resented by a point. (a) Users are represented bythe ratings they gave to other users. (b) Users arerepresented by the ratings they received from otherusers. (c) Laplacian PCA approximating the resis-tance distance in the unsigned underlying graph.

network. To see why this is true, consider that the squareA2 of the signed adjacency matrix contains at its entry (i, j)a sum of paths of length two between i and j weighted pos-itively or negatively depending on whether a third positiveedge between i and j would lead to a balanced or unbal-anced triangle. As a measure of multiplicative transitivity,this section begins by proposing the signed clustering coef-ficient.

The clustering coefficient was introduced in [25]. An ex-tension was proposed in [12] that works with positively weightededges. The clustering coefficient is a characteristic numberof a graph taking values between zero and one, denoting thetendency of the graph nodes to form small clusters. Thesigned clustering coefficient we define denotes the tendencyof small clusters to be coherent. Therefore, the signed clus-tering coefficient will take on values between −1 and +1.The relative signed clustering coefficient will be defined asthe quotient between the two.

The clustering coefficient is defined as the proportion ofall incident edge pairs that are completed by a third edge toform a triangle. Figure 6 gives an illustration. Alternatively,the clustering coefficient can be defined algebraically usingthe adjacency matrix of the network. Given an undirected,unweighted loopless graph G with unsigned adjacency ma-trix A, its clustering coefficient is given by

C(G) =‖ A ◦ A2 ‖+

‖ A2 ‖+

(1)

where A ◦ B represents the Hadamard product (entrywiseproduct) of two matrices, and ‖ A ‖+ denotes the sum of allmatrix elements. Since G is undirected and unweighted, Ais a nonnegative symmetric matrix with a diagonal of zero.

If we now suppose that G is a directed graph, A will nolonger be symmetric, but Expression 1 is still defined. Inthis case, only edges completing two directed edges in thecorrect orientation will be counted.

To extend the clustering coefficient to negative edges, weassume a multiplication rule for two incident signed edges.The multiplication rule captures the intuition that the en-

emy of my enemy is my friend. As shown in Figure 6, anedge c completing two incident edges a and b to form a tri-angle must fulfill the equation c = ab. Because A alreadycontains the value −1 for negative edges, the square A2 willcontain the sum of such products, leading to the followingexpression for calculating the signed clustering coefficient:

Cs(G) =‖ A ◦ A2 ‖+

‖ A2 ‖+

(2)

•c

aTTTTT • •

c//

a **TTTTT •a) b)

• b

jjjjj • b

44jjjjj

•c=ab

aTTTTT • •

c=ab//

a **TTTTT •c) d)

• b

jjjjj • b

44jjjjj

Figure 6: The four kinds of clustering coefficients.a) Regular clustering coefficient. b) Directed clus-tering coefficient. c) Signed clustering coefficient.d) Signed directed clustering coefficient. Edge c iscounted when edges a and b are present, and for thesigned variants, weighted by sgn(abc).


744

Table 3: The values for all variants of the clusteringcoefficient for the Slashdot Zoo corpus. In parenthe-ses, we give the clustering coefficient of a randomgraph of the same size, computed using the methodof Watts and Strogatz, which applies only to un-signed networks. Together with the results fromTable 2, these results show that the Slashdot Zooexhibits the small-world phenomenon.Clustering coefficient 3.19% (0.0095%)Directed clustering coefficient 5.62% (0.0191%)Signed clustering coefficient 2.44%Directed signed clustering coefficient 4.44%

Relative signed clustering coefficient 76.4%Directed relative signed clustering coefficient 79.0%

Therefore, the signed clustering coefficient denotes to whatextent the graph exhibits multiplicative transitivity. In ac-tual social networks, we expect it to be positive.

Additionally, we define the relative signed clustering coef-ficient as the quotient of the signed and unsigned clusteringcoefficients. Graphs with a high relative signed clusteringcoefficient are thus graphs for which the multiplication ruleapplies.

S(G) =Cs(G)

C(G)=

‖ A ◦ A2 ‖+

‖ A ◦ A2 ‖+

(3)

The relative signed clustering coefficient takes on values be-tween −1 and +1. It is +1 when all triangles are orientedcoherently. In networks with negative relative signed clus-tering coefficients, the sign multiplication rule does not hold.The directed signed clustering coefficient and directed rela-tive signed clustering coefficient can be defined analogouslywith Expressions (2) and (3).

The signed clustering coefficient and relative signed clus-tering coefficient are zero in random networks, when the signof edges is distributed equally. The signed clustering coeffi-cients are by definition smaller than their unsigned counter-parts. The unsigned clustering coefficient overestimates thenumber of “correctly” placed edges by ignoring their weight.

Table 3 gives all four variants of the clustering coefficientmeasured in the Slashdot Zoo corpus, along with the rela-tive signed clustering coefficients. We also give the clusteringcoefficient of a random graph of the same size, as describedin [25]. The clustering coefficient of the Slashdot Zoo is sig-nificantly larger than that of a random graph of equal size.Together with the obsersvation in Table 2 that the aver-age distance between nodes is less than in that of a randomgraph, we follow Watts and Strogatz [25] and conclude thatthe Slashdot Zoo exhibits the small-world phenomenon. Thehigh values for the relative clustering coefficients show thatour multiplication rule is valid in the Slashdot Zoo, and jus-tifies the use of algebraic methods in the next two sectionsfor popularity and link prediction methods.

5. POPULARITY AND CENTRALITYIn this section, we review measures that apply to single

nodes in the network. We show how standard centrality andpopularity measures can be extended to the case of negativeedges, and how these perform on the task of identifying un-popular users.

Centrality in the broadest sense is a measure computed

for each node in a graph, denoting to what extent the nodeis central to the graph. In social networks, such measuresare also called importances. Central nodes are usually wellconnected to other nodes, and at a short distance to mostother nodes. On the other hand, decentral nodes are poorlyconnected and at a greater distance to other nodes. Cen-trality measures can be defined by taking a given distancemeasure on the graph, and measuring the average distancefrom all other nodes. Other centrality measures are definedby considering a flow or diffusion process in the network, andcomputing the amount of flow going through each node. Thenotion of trust is usually personalized, i.e. it applies to pairsof users. If however trust is taken as a global measure, thenit corresponds largely to the concept of popularity we studyhere.

While centrality can be defined independently of edgesigns, giving measures of who is central regardless of anysign, we will focus on centrality measures that are signed,and can be used to identify both top and bottom users, cor-responding to well-liked and much-unliked users. We willfirst describe the various centrality measures we used, thengive the top six users for each, and then evaluate the mea-sures on the task of identifying trolls.

5.1 Popularity and Centrality MeasuresWe now describe the centrality and popularity measures

we evaluated.

5.1.1 Fans Minus Freaks (FMF)

As a baseline popularity measure, we use the number offans and freaks a user has for calculating his reputation. Wesubtract the number of freaks from the number of fans, giv-ing a signed number. We expect popular users to have a highnumber of fans and unpopular users a high number of freaks,making this measure a valid indicator of popularity. Whilethis measure is simple, it can also be exploited easily us-ing multiple accounts. Malicious users may create accountswith the sole purpose of marking oneself as a friend, just asmalicious websites may try to create linkfarms in order toboost their ranking in search engines.

5.1.2 PageRank (PR)

PageRank is a spectral popularity measure defined on di-rected graphs with nonnegative edge weights [22]. It modelsthe path of a random“surfer” following the directed edges ofthe graph randomly, and “teleporting” to a random node atrandomly chosen intervals. Equivalently, it can be defined asthe dominant left eigenvector of the Google matrix G, givenby

G = (1 − α)D−1A + (α/n)Jn×n (4)

where Jn×n is a matrix full of ones of the specified size, and0 < α < 1 is the teleportation parameter. The matrix Gis left-stochastic (each row sums to one). Since PageRankonly applies to nonnegative edge weights, we have to use theunsigned adjacency matrix A instead of A. The resultingrank is thus not an indication of popularity, but more anindication of centrality, denoting the tendency of users to becentral, without distinguishing friend and foe links.

5.1.3 Signed Spectral Ranking (SR)

PageRank was extended to the case of negative edge weightsseveral times [3, 13], modeling the resulting ranking as a


745

Figure 7: A scatter plot of all users showingPageRank against Signed Spectral Rank, withknown trolls marked in red. While the dominanttrend is a positive correlation, we observe that theusers with lowest Signed Spectral Rank have a (rel-atively) high PageRank. For most users, both ranksare almost equal while for trolls, Signed SpectralRank is smaller.

measure of popularity. We compute the signed spectralranking as the dominant left eigenvector of the signed matrixGs:

Gs = (1 − α)D−1A + (α/n)Jn×n (5)

The resulting popularity measure admits both positive andnegative values, and represents a measure of popularity inthe network, with positive edges corresponding to a positiveendorsement and negative edges to negative endorsements.This interpretation is consistent with the semantics of the“friend” and “foe” relationships.

5.1.4 Signed Symmetric Spectral Ranking (SSR)

We apply spectral ranking to the symmetric matrix B =A + AT . Thus, outlinks by users are counted towards thepopularity of a user. While this measure could be exploitedby users to gain a high rank, it models the idea that “popu-lar” users will refrain from having many foes, and that neg-ative edges are more prevalent between unpopular users, re-gardless of the edge direction.

5.1.5 Negative Rank (NR)

Comparing PageRank with Signed Spectral Ranking inFigure 7 shows that these two rankings correlate highly. Weobserve that for most users, both measures are almost equal.For known trolls however, the Signed Spectral Rank seemsto be much less than PageRank. Therefore, we propose tosubtract the two to define the Negative Rank:

NR = SR − β · PR (6)

The parameter β determines the influence of PageRank onthe ranking, and SR and PR are both of unit-length. Forβ = 0, Negative Rank corresponds to the Signed SpectralRank. We will first assume a value of β = 1 for now, andstudy the influence of β later.

Figure 8: The five popularity and centrality mea-sures plotted against each other. From top/left tobottom/right the figure shows: Fans Minus Freaks,PageRank, Signed Spectral Rank, Signed Symmet-ric Spectral Rank and Negative Rank. A positivecorrelation can be observed between the spectralmethods (excluding Negative rank), while Fans Mi-nus Freaks and Negative Rank do not correlate sig-nificantly with any other measure.

5.2 Top and Central UsersThe top six users for each popularity and centrality mea-

sure are shown in Table 4. We do not show the bottom usersto discourage users abusing the system. However, we notethat in the list of users with most freaks, place five is takenby user JonKatz, an editor that left Slashdot in 2002. Thebottom users in the Fans Minus Freaks and Signed SpectralClustering measures are known Slashdot trolls. The bottomusers for Signed Symmetric Spectral Rank and PageRankare not known trolls.

The top users for Negative Rank are actually neither promi-nent nor very central in the Slashdot community. We explainthis by the fact that many users have a high Negative Rankof approximately zero, just as if we used a low number offreaks as a measure for finding popular users: While userswithout freaks may be popular, there are so many of themthat the metric is not useful to find the very popular users.However, as we will see, Negative Rank is very good at iden-tifying unpopular users.

Table 4 shows that the top five users for Fans Minus Freaksare prominent. For the other measures however, the topusers are frequent users, but not prominent. We suspectthat these users have a greater involvement in the Slashdotcommunity, leading to a tightly clustered network of friends.Prominent users such as CmdrTaco on the other hand mayhave many fans, but are not central to the Slashdot commu-nity.

Figure 8 plots the five centrality and popularity measuresagainst each other. We observe a positive correlation be-tween the three spectral methods, and no significant corre-lation between Fans Minus Freaks, Negative Rank and theother methods.


746

Table 4: The top six users for the five centralityand popularity measures. Five of the six top usersby the Fans Minus Freaks measure are prominent,while the top users by the spectral measures arefrequent but not prominent users. We conjecturethat these frequent users are more “in the commu-nity” than the prominent users, making the spectralcentrality and popularity measures more suited foridentifying central users on web communities. WhileNegative Rank is able to indentify unpopular usersin the Slashdot community, it is not useful for find-ing central or prominent users.

Fans Minus Freaks

CleverNickName 2460Bruce Perens 2143

CmdrTaco 2005John Carmack 1663

NewYorkCountryLawyer 1179$$$$$exyGal 1170

PageRank

FortKnox 0.1510SamTheButcher 0.1342Ethelred Unraed 0.1280

turg 0.1266Some Woman 0.1254

gmhowell 0.1247

Signed Spectral Ranking

FortKnox 0.1527SamTheButcher 0.1388

turg 0.1319Some Woman 0.1310

Ethelred Unraed 0.1297gmhowell 0.1263

Signed Symmetric Spectral Ranking

Ethelred Unraed 0.1395johndiii 0.1376

FortKnox 0.1358turg 0.1351

Some Woman 0.1309SamTheButcher 0.1276

Negative Rank

johndiii 0.006279SolemnDragon 0.006181Some Woman 0.005515KshGoddess 0.005487

turg 0.005299btlzu2 0.005179

5.3 Predicting TrollsTo evaluate whether our popularity measures are able

to make good predictions for unpopular users, we evaluatethem on the task of identifying trolls. Such an automaticidentification of trolls could for instance be used to maintaina list of trolls more accurately than maintaining it by hand.

As a benchmark, we use the foes of the user No More

Trolls. This user account was created with the purpose ofcollecting names of trolls. Any Slashdot user can tag thisuser as his friend, and then, in the settings, give a malusto foes-of-a-friend, which reduces the visibility of No More

Figure 9: The mean average precision of the fourpopularity and centrality measures at the task ofidentifying trolls. For comparison, we show the pre-cision of predicting trolls at random. Freaks: num-ber of freaks. FMF: Friends Minus Freaks. PR:PageRank. SR: Signed Spectral Ranking. SSR:Signed Symmetric Ranking. NR: Negative Rank.Negative Rank performs significantly better that allother popularity measures at the task of identifyingSlashdot trolls.

0 0.5 1 1.5 20

0.05

0.1

0.15

0.2

0.25

β (PageRank importance parameter)

MA

P (

mea

n av

erag

e pr

ecis

ion)

Figure 10: The performance of Negative Rank infunction of the parameter β. We observe that em-pirically, the best performance is attained at β = 1.

Trolls’ foes. We use only known trolls having a minimum of20 incident edges, as trolls with fewer edges would be hardto predict. There are 162 known trolls found this way.

The task consists of predicting who these trolls are byusing a popularity measure defined in the corpus excludingedges incident to No More Trolls. We use each measure forpredicting the unpopular users, and give the mean averageprecision (MAP) for each [20].

In addition to the five popularity measures above, we usethe negated number of freaks as a further popularity mea-sure. For Negative Rank, we use β = 1.

The results are shown in Figure 9. We observe NegativeRank to perform best. We also observe that the negatednumber of freaks performs better than any spectral mea-sures except Negative Rank. Signed spectral ranking aloneperforms very badly, indicating that taken alone, it is not agood indicator of popularity, but has to be combined withPageRank to give Negative Rank.

We study the influence of the parameter β on the perfor-mance of Negative Rank in Figure 10. We observe that theoptimal performance is attained at β = 1, confirming ourprevious assessment.


747

6. LINK PREDICTIONIn this section, we analyse measures defined on node pairs.

This type of measure covers similarities, distances, mutualtrust, and other related concepts. These measures are usu-ally the most important in social network studies, first be-cause they describe the most structure in the network, sec-ond because they can be applied to link prediction and rec-ommendation. Since most link prediction methods can beformulated algebraically using matrix multiplication, we willbe able to verify our assumption of multiplicative transitiv-ity.

We will thus focus on the task of predicting links in signedsocial networks. In particular, we cover the prediction of thesign of edges. Because link prediction algorithms definedfor networks with nonnegative edge weights are only able topredict the presence or absence of an edge, we will have todefine new link prediction algorithms that take into accountthe edge sign, and are able to predict missing edges’ sign.

In social networking websites, link prediction allows theimplementation of recommender systems that can recom-mend new edges. In the case of Slashdot, a possible appli-cation would be the recommendation of new users to one’sfriend and foe list. For these reasons, link prediction is oneof the most important machine learning task that applies tolarge networks.

6.1 Problem FormulationA common way to model the link prediction task is via

the notion of similarity: If two nodes are similar, then theyare likely to be connected by an edge already, and if not, tobe connected soon. Thus, a measure of similarity defined ona given edge set can be used to implement a link predictionalgorithm. We will therefore focus on the definition of sim-ilarity measures between nodes: functions that, given twonodes in the network, return a similarity value.

Similarity functions are usually symmetric. Also, they areoften positive semi-definite. If both conditions are met, asimilarity function is a kernel. However, similarity functionsneed not necessarily be symmetric and positive semi-definitein our setting: We may predict a link between u and v with-out predicting a link between v and u. In this case, we maycall such a function a proximity measure.

6.2 Baseline AlgorithmsAs baseline prediction algorithms, we use the three fol-

lowing strategies:

• (1) Always predict a positive edge. This strategy hasan accuracy greater than zero because there are morepositive than negative edges.

• (AT ) If there is an edge in the opposite direction, pre-dict the sign of that edge. Else predict a positive edge.

• (A2) Use the squared adjacency matrix for prediction.This simple strategy makes use of multiplicative tran-sitivity.

We found similarity measures based on the shortest-pathdistance to have no better accuracy than these three. Theaccuracy of the baseline algorithms is shown in Table 5.

6.3 Algebraic Similarity MeasuresIn the previous section, we used the square of the adja-

cency matrix A to make predictions about the sign of new

Table 5: The three baseline algorithms for link signprediction. The accuracy is measured on a scalefrom −1 to +1. Greater values denote higher pre-diction accuracy.

1 0.517AT 0.536A2 0.552

edges. While calculating the square of A is possible given thecorpus’ size, computing the cube of A is already too expen-sive: While A itself is sparse, A3 contains non-zero valuesfor all node pairs that are separated by at most three edges.Given that the graph’s radius is three and its diameter issix, computing other powers of A after A2 results in non-sparse matrices, which are too big for practical calculations.Therefore, we resort to dimensionality reduction.

6.3.1 Dimensionality Reduction (A)

The matrix A can be reduced dimensionally by performinga sparse singular value decomposition, resulting in a low-rank approximation of the original matrix:

A ≈ Ak = UkDkV Tk (7)

Uk, Dk and Vk are of dimensions n × k, k × k and n × krespectively, and Dk is symmetric. The integer k is thereduced dimension.

While the matrix A contains zero entries for edges notpresent in the original graph, the approximation Ak is nonzeroat these entries, and the sign of these entries can be usedas a prediction for the sign of missing edges. In contrastto calculating sparse matrix powers as we do for our base-line algorithms, dimensional reduction is efficient and canbe applied to large, sparse matrices.

6.3.2 Symmetric Dimensionality Reduction (A sym)

As shown in the previous section, using edges going in theopposite direction gives more accurate prediction than onlypredicting positive edges. Therefore, we apply dimensional-ity reduction to the symmetric matrix A+AT . In the case ofsymmetric matrices, we use the eigenvalue decomposition:

A + AT ≈ UkDkUTk (8)

Just as for (asymmetric) dimensionality reduction, the inte-ger k represents a parameter of the prediction algorithm.

6.3.3 Matrix Exponential (A exp, A sym exp)

The exponential kernel is a similarity measure based onthe matrix exponential function [11]. It is based on theobservation that, analogously to the real exponential, thematrix exponential can be expressed as an infinite sum ofmatrix powers, with weights that decay with the inversefactorial:

exp(A) =

∞X

i=0

1

i!Ai (9)

Because the n-th power of the adjacency matrix of a graphcontains, for each node pair, the number of paths of lengthn between the two nodes, the matrix exponential representsa weighted mean of path counts between any two nodes,with weights decaying as the inverse factorial. With thegraph containing negative edges, the powers of A represent


748

signed path counts, where paths with an odd number of neg-ative edges are counted negatively, thus implementing ourenemy-of-an-enemy multiplication rule generalized to arbi-trarily long paths. Although in the case of an asymmetricA the exponential does not represent a kernel, we can stilluse it for edge sign prediction.

Therefore, the matrix exponential will in many cases givethe same predictions for node pairs that have a distance of2, but unlike simply using A2 will also give sensible predic-tions for nodes further apart, instead of resorting to simplypredicting a positive edge.

To compute the matrix exponential, we use the fact thatthe matrix exponential of a dimensionally reduced matrixcan be computed by applying the exponential to the diagonalmatrix of eigenvalues. For the symmetric case, we have:

exp(UkDkUTk ) = Uk exp(Dk)UT

k (10)

The exponential of the diagonal matrix Dk can be computedby applying the real exponential to every diagonal element.

6.3.4 Inverted Laplacian (Ls sym)

Laplacian kernels are defined mathematically by the pseu-doinversion of the graph’s Laplacian matrix L. Dependingon the precise definition, Laplacian kernels are known asresistance distance kernels [15], random forest kernels [2],random walk or mean passage time kernels [4] and von Neu-mann kernels [14].

Laplacian kernels have been applied to the social networkanalysis problems of co-authorship graph mining [1], collab-orative recommender systems and community detection [4].In these cases however, the underlying graphs have only pos-itive edges. In order to apply Laplacian kernels to graphswith negative edges, we use the measure described as thesigned resistance distance in [17], defined as:

K = (E − B)+ (11)

Where B = A + AT is the symmetric adjacency matrix andE is the diagonal degree matrix.

As with the matrix exponential, the matrix pseudoinversecan be computed easily in conjunction with dimensionalityreduction, by pseudoinverting the diagonal matrix, whichamounts to inverting those diagonal elements that are (nu-merically) non-zero.

Although asymmetric (“directed”) Laplacians can be de-fined [24], we found through extensive experimentation thatthey perform poorly for our task.

6.4 Experimental EvaluationTo evaluate the link prediction algorithms, we split the set

of edges into a training set and a test set, train a similaritymeasure on the training set, and use that similarity measureto predict the sign of the edges in the test set. We simplyuse the sign of the similarity function as the prediction: Ifthe similarity function returns a positive value, we predicta positive edge, else we predict a negative edge. In ourevaluation, we hold out 30% of the edges as the test set.

We measure the prediction accuracy on a scale from −1 to+1, where +1 denotes all correct prediction and −1 denotesall wrong sign prediction. Thus, an algorithm that predictspositive and negative edges randomly with equal probabilitywould have an accuracy of zero.

The evaluation results are shown in Figure 11. Each sim-ilarity measure is evaluated using a varying dimensional re-

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0 2 4 6 8 10 12 14 16 18 20

Accura

cy

k (dimensional reduction)

AA expA sym

A sym expLs sym

Figure 11: The accuracy of the various algebraicsimilarity measures on the link sign prediction task.Each similarity measure is tested using a varyingdimensional reduction parameter k. Greater valuesdenote higher prediction accuracy.

duction parameter k.Our empirical study shows that the exponential kernels

perform better than their non-exponential counterparts. Also,they do not suffer from overfitting the data as k grows: Theexponential kernels attain their best performance asymptot-ically for large k, while A and A sym attain their greatestaccuracy for specific, small values of k. Numerically, thisis best explained by the fact that the exponential functionlets the eigenvalues of the exponential kernel become smallvery fast for growing k. The overall best prediction accu-racy is achieved by the asymmetric matrix exponential. Thegood performance of the exponential kernels indicates thatthe multiplicative transitivity in the Slashdot Zoo can be ex-tended to paths of length greater than two if correspondingdamping factors are used, as done implicitly in the expo-nential kernel. The fact that the asymmetric exponentialkernel outperforms the symmetric exponential kernel is ahint that the assumption of symmetry is wrong in the studyof multiplicative transitivity in this signed network.

The signed Laplacian similarity matrix provides compa-rable accuracy to simple dimensionally reduction. However,it does not suffer from overfitting for large k, and it alreadyattains it’s good performance for very small k.

7. CONCLUSIONThis paper considered social network analysis on graphs

with negative edge weights. We studied the Slashdot Zoo,a social network that is well-known for containing negativelinks. Our analysis of that social network was carried out onthree levels. On the global level, we defined the signed clus-tering coefficient and relative signed clustering coefficient.On the node level we defined Negative Rank, a new popu-larity measure and showed how it can be used to identifytroll users in the Slashdot community. On the link level, westudied the task of link sign prediction using various signedspectral similarity measures. The study of the Slashdot Zooon these three levels showed that the network exhibits mul-tiplicative transitivity, a property of signed social networksthat can be summarized by the phrase the enemy of my


749

enemy is my friend. We showed that these methods foranalysing a network with negative edge weights apply tolarge social networks and reveal facts that cannot be uncov-ered using common, unsigned techniques.

Currently, we analyse further social networks with positiveand negative endorsements. The software running Slashdotis called Slash2 and is nowadays also used on other websitessuch as Barrapunto, Slashdot Japan, and Use Perl. Thesesites also use the Zoo feature, and could be analysed justas the Slashdot Zoo. Another source of negative links insocial networks are sites such as Digg where users can ratecontent by other users, giving rise to indirect negative edges.The sites Essembly [9] and Epinions [5] also represent socialnetworks with negative edges.

Some social network analysis methods were not covered inthis paper, but could conceivably be extended to the signedcase: betweenness centrality measures, spectral clustering,random graph modeling of signed social networks, stochasticapproaches for prediction (which are similar to the resistancedistance approach), and others.

8. REFERENCES

[1] A. T. Balaban and D. J. Klein. Co-authorship,rational Erdos numbers, and resistance distances ingraphs. Scientometrics, 55(1):59–70, 2002.

[2] P. Chebotarev and E. V. Shamis. On proximitymeasures for graph vertices. Automation and Remote

Control 10, 59:1443–1459, 1998.

[3] C. de Kerchove and P. V. Dooren. The PageTrustalgorithm: how to rank web pages when negative linksare allowed? In Proc. SIAM Int. Conf. on Data

Mining, pages 346–352, 2008.

[4] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens.Random-walk computation of similarities betweennodes of a graph with application to collaborativerecommendation. IEEE Trans. on Knowledge and

Data Engineering, 19(3):355–369, 2007.

[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins.Propagation of trust and distrust. In Proc. Int. Conf.

on World Wide Web, pages 403–412, 2004.

[6] V. Gomez, A. Kaltenbrunner, and V. Lopez.Statistical analysis of the social network anddiscussion threads in Slashdot. In Proc. Int. Conf. on

World Wide Web, pages 645–654, 2008.

[7] P. Hage and F. Harary. Structural Models in

Anthropology. Cambridge University Press, 1983.

[8] J. L. Herlocker, J. A. Konstan, A. Borchers, andJ. Riedl. An algorithmic framework for performingcollaborative filtering. In Proc. Int. Conf. on Research

and Development in Information Retrieval, pages230–237, 1999.

[9] T. Hogg, D. Wilkinson, G. Szabo, and M. Brzozowski.Multiple relationship types in online communities andsocial networks. In Proc. AAAI Spring Symposium on

Social Information Processing, 2008.

[10] Y. Hou. Bounds for the least Laplacian eigenvalue of asigned graph. Acta Mathematica Sinica,21(4):955–960, 2005.

[11] T. Ito, M. Shimbo, T. Kudo, and Y. Matsumoto.Application of kernels to link analysis. In Proc. Int.

2http://www.slashcode.com/

Conf. on Knowledge Discovery in Data Mining, pages586–592, 2005.

[12] G. Kalna and D. J. Higham. A clustering coefficientfor weighted networks, with application to geneexpression data. AI Commun., 20(4):263–271, 2007.

[13] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina.The EigenTrust algorithm for reputation managementin P2P networks. In Proc. Int. Conf. on World Wide

Web, pages 640–651, 2003.

[14] J. Kandola, J. Shawe-taylor, and N. Cristianini.Learning semantic similarity. In Advances in Neural

Information Processing Systems, pages 657–664, 2002.

[15] D. J. Klein and M. Randic. Resistance distance.Mathematical Chemistry, 12(1):81–95, 1993.

[16] R. Kondor and J. Lafferty. Diffusion kernels on graphsand other discrete structures. In Proc. Int. Conf. on

Machine Learning, pages 315–322, 2002.

[17] J. Kunegis, S. Schmidt, C. Bauckhage, M. Mehlitz,and S. Albayrak. Modeling collaborative similaritywith the signed resistance distance kernel. In Proc.

Eur. Conf. on Artificial Intelligence, pages 261–265,2008.

[18] C. Lampe and E. Johnston. Follow the (Slash) dot:Effects of feedback on new members in an onlinecommunity. In Proc. Int. Conf. on Supporting Group

Work, pages 11–20, 2005.

[19] D. Liben-Nowell and J. Kleinberg. The link predictionproblem for social networks. In Proc. Int. Conf. on

Information and Knowledge Management, pages556–559, 2003.

[20] M. A. Najork, H. Zaragoza, and M. J. Taylor. Hits onthe web: how does it compare? In Proc. Int. Conf. on

Research and Development in Information Retrieval,pages 471–478, 2007.

[21] M. E. J. Newman. A measure of betweennesscentrality based on random walks. Social Networks,27(1):39–54, 2005.

[22] L. Page, S. Brin, R. Motwani, and T. Winograd. ThePageRank citation ranking: Bringing order to theweb. Technical report, Stanford Digital LibraryTechnologies Project, 1998.

[23] J. Scott. Social Network Analysis: a Handbook. SagePublications, 2000.

[24] H. Tong, C. Faloutsos, and Y. Koren. Fastdirection-aware proximity for graph mining. In Proc.

Int. Conf. on Knowledge Discovery and Data Mining,pages 747–756, 2007.

[25] D. J. Watts and S. H. Strogatz. Collective dynamics of‘small-world’ networks. Nature, 393(6684):440–442,June 1998.

[26] S. White and P. Smyth. Algorithms for estimatingrelative importance in networks. In Proc. Int. Conf.

on Knowledge Discovery and Data Mining, pages266–275, 2003.

[27] B. Yang, W. Cheung, and J. Liu. Community miningfrom signed social networks. IEEE Trans. on Knowl.

and Data Eng., 19(10):1333–1348, 2007.


750

http://www.slashcode.com/

Date post:	10-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

The Slashdot Zoo: Mining a Social Network with Negative Edges · Learning; H.4.0 [Information...

Documents