Privacy and Spectral Analysis on Social Network Randomization

Slide 1

Xiaowei Ying, Leting Wu, Xintao Wu

University of North Carolina at CharlottePrivacy and Spectral Analysis on Social Network Randomization1Framework2Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationReconstruction from Randomized GraphsSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work2Background & Motivation33Background & MotivationSocial Network4

Friendship in Karate club [Zachary, 77]

Biological association network of dolphins [Lusseau et al., 03]

Collaboration network of scientists [Newman, 06]

Network of US political books(105 nodes, 441 edges)Books about US politics sold by Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. Nodes have been given colors of blue, white, or red to indicate whether they are "liberal", "neutral", or "conservative". 4

7Public/ Third party/ Research Inst.Data OwnerThe original graph datareleaseBackground & MotivationPublish/outsource data for mining/analysisData miner: discover patterns/features of the data (utility)-- find central nodes, community partition, link prediction

Attacker: breach sensitive information the data (privacy)-- identity of nodes (and sensitive attributes), sensitive relation between two individuals7Privacy issues in publishing social network data:Anonymization is not enough for protecting the privacy. Active/passive attacks[1], subgraph attacks [2].

[1] L. Backstrom, et. al., Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. WWW07[2] M. Hay et. al. Resisting Structural Reidentification in Anonymized Social Networks, VLDB08

Background & Motivation8

8Background & MotivationPrivacy Preserving Social Network PublishingNode-anonymization cannot guarantee identity/link privacy due to subgraph queries.K-anonymity generalizationThe released graph has at least k nodes with the same degree/subgraph/neighorhood[Liu&Terzi SIGMOD08, Zhou&Pei ICDE08, Chen VLDB09]Graph (edge) randomizationRandom Add/Del & Random SwitchUtility preserving randomizationSuper graph generalizationGenerate nodes into supper nodes, and edges into supper edges9

9Background & MotivationGraph Randomization/Perturbation:Random Add/Del edges (no. of edges unchanged)

Random Switch edges (nodes degree unchanged)10

10Background & MotivationGraph Randomization/Perturbation:

Data privacy:How graph randomization prevents privacy disclosure?

Data utility:How will the graph structure change due to randomization?How to preserve graph structural features better?11In our work, we try to answer following questions. First, while doing perturbation, how will the graph structure change? This is an important issue if we want to publish the data for analysis. On the other hand, is it effective for protecting the privacy, and to what extent we should do the perturbation.11Background & MotivationNumerous topological measures of networks

Harmonic mean of shortest distance

Transitivity(cluster coefficient)

Subgraph centrality

Modularity (community structure);And many others12

12Background & MotivationSpectral measures adjacency matrixAdjacency Matrix A (symmetric)

Adjacency Spectrum13

13Laplacian Matrix and Spectrum:

Normal Matrix and Spectrum14

Background & Motivation

14Background & MotivationMany topological features are related to spectral measures:

No. of triangles:

Subgraph centrality:

Graph diameter:

k disconnected parts in the graph k 0s in the Laplacian spectrum.

15

15Background & MotivationTwo important eigenvalues: andThe maximum degree, chromatic number, clique number etc. are related to ;Epidemic threshold for virus propagates in the network is related to [Wang et al., KDD03]; indicates the community structure of the graph: clear community structure 0.16

16

0.00 0.00 0.00 1.27 2.59 3.00 3.00 3.00 4.00 4.00 4.00 4.73 5.00 5.41 6.00 6.00 6.00 6.00 6.00 The Laplacian eigenvalues0.00 0.11 0.34 1.31 2.60 3.00 3.10 3.36 4.00 4.13 4.59 4.79 5.31 5.58 6.00 6.00 6.00 6.66 7.12 Basic Facts of Graph SpectrumGraph from: A. Capocciet. al., Detecting communities in large networks17

Basic Facts of Graph Spectrum

The Laplacian eigenvectors18Privacy in Randomized Graph1919Framework20Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work20Link Privacy: Prior & Posterior BeliefsQuantify attackers belief (assume that node identities are known)Prior probabilities:

Posterior probability for node pair (i, j):

Serious jeopardize the privacy when 21

21Link Privacy: Prior & Posterior BeliefsMethod I [Ying, Wu, SDM08]Add & Del k links

Switch k times22

22Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]A common phenomenon: in real-world graphs similar nodes tend to connect to each other23

23Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]Even after moderate randomization, the phenomenon still exists:24

24Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]

25

25Add/Del:True links are deleted w.p.False links are added w.p.Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]26

With Bayes theorem

26Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]Evaluation (add/del 50% true links)

27

27Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]The total sum of prior and posterior probabilities is the same:28

prior prob.posterior prob. Iposterior prob. II28Link Privacy: Prior & Posterior BeliefsMethod III [Ying, Wu, SDM09]Intuition: degree sequence specifies a graph space, and the true graph is just one member of the space.29

Example: switch graph with degree sequence {3,2,2,2,3}Is node 1 and 5 connected?

29Link Privacy: Prior & Posterior BeliefsMethod III [Ying, Wu, SDM09]Graph space = {G: with a given degree sequence}Impractical to enumerate all members in the spaceSample the graph space through Markov chain:30

30Link Privacy: Prior & Posterior BeliefsMethod III [Ying, Wu, SDM09]Evaluation Polbooks (r=8%) Enron (r=8%)31

31Identity Privacy:Re-identify nodes in the anonymous graphsbased on some background information (e.g. degree)Randomization reduces attackers beliefs

Node Identity Privacy32

Polbooks: degree distribution

After randomization

32Node Identity PrivacyNodes prior and posterior risksGiven an individual with degree d and a randomized graph

Prior risk:

Posterior risks

33

33Node Identity PrivacyOngoing work:Compare randomization and k-anonymity approach:-- to achieve the same privacy protection level, which approach can achieve better utility?Combine identity privacy and node privacy.Node identity privacy issue under different background information (e.g., sub-graph, neighborhood).34K-degree generalization [Liu et. al.]

34Feature Preserving Randomization3535Framework36Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationReconstruction from Randomized GraphsSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work36Feature Preserving RandomizationTopological and spectral features change a lot along the perturbation.37

(Networks of US political books, 105 nodes and 441 edges)Can we better preserve the network structure?37Features in Social Network DataTwo important eigenvalues: andThe maximum degree, chromatic number, clique number etc. are related to ;Epidemic threshold for virus propagates in the network is related to [Wang et al., KDD03]; indicates the community structure of the graph: clear community structure 0.38

38Spectrum Preserving RandomizationSpectrum preserving approach [Ying, Wu, SDM08]

Intuition: since spectrum is related to many graph topological features, can we preserve more structural features by controlling the movement of eigenvalues?

3939Spectrum Preserving RandomizationSpectral Switch (apply to adjacency matrix):

To increase the eigenvalue:

To decrease the eigenvalue:40

40Spectrum Preserving RandomizationSpectral Switch (apply to Laplacian matrix):

To decrease the eigenvalue:

To increase the eigenvalue:41

41Spectrum Preserving RandomizationEvaluation:42

(Networks of US political books, 105 nodes and 441 edges)42Markov Chain Based Feature Preserving RandomizationMarkov chain generation [Ying, Wu, SDM09]Data owner puts feature range constrains in switchingFeature range constrains:

The data owner publish the feature range constraint.43

43Markov chain generation [Ying, Wu, SDM09]Markov chain with feature range constraint (uniformity for accessible graphs)

Markov Chain Based Feature Preserving Randomization4444Markov chain generation [Ying, Wu, SDM09]Problem: accessibility is not guaranteed

We propose the relaxed algorithm with feature range constraint (accessibility, approximate uniformity)The relaxed algorithm also has applications in testing the significance data mining results

Markov Chain Based Feature Preserving Randomization45

45Data owner puts feature range constrains in switchingFeature range constrains:

Can attackers utilize the feature constrains to breach link privacy?46

Attacks to Feature Preserving Randomization46Markov chain approach [Ying, Wu, SDM09]Markov chain with feature range constraintGraph space = {G: with a given deg. seq. & S(G) in R}Starting with the randomized data, repeat the switch procedure many times and get one sample graphGenerate N graphsAttacks in Feature Preserving Randomization47

47Attacks in Utility Preserving RandomizationMarkov chain approach [Ying, Wu, SDM09]Evaluation Polbooks (r=8%) Enron (r=8%)48

Future work: what cause the difference? What features will (not) release privacy?48Reconstruction from Randomized Graphs49MotivationLow Rank Approximation on Graph DataReconstruction from Randomized GraphPrivacy Issue SDM10 paper49MotivationWe focus on whether we can reconstruct a grpah from s.t.50

Our Focus

Revisit of LRA in Numerical DataSpectral Filter derive estimation of U from perturbed dataCalculate covariance matrix which is symmetric and positive definiteApply spectral decomposition to Derive the eigenvalues information from the covariance matrix of noise V and choose a proper number of dimensions, r Let and , obtain the estimated data set using 51

52Why it worksOriginal data are correlatedNoise are not correlated

noise

2nd principal vector1st principal vectororiginal signal

perturbed+=

2-d estimation

1-d estimation5253Determining rStrategy 1: (Huang and Du SIGMOD05 )

Strategy 2:(Guo, Wu and Li, PKDD 2006) The estimated data using is approximate optimal

53The first strategy indicated in Huang and Dus paper is to simply compare the perturbed eigenvalues with the noise eigenvalue. If it is larger, than the corresponding eigenvector is included in the projection space.

The second strategy is to include the eigenvector only its perturbed eigenvector is larger than twice of noise eigen values. Graph DataMatrix Representation of NetworkAdjacency Matrix A (symmetric)

Adjacency Spectrum

54

54Low Rank ApproximationLow Rank Approximation by eigen-docomposition:

This provide a best r rank approximation to A To keep the structure of adjacency matrix, discretize

55

New ChallengesA is a 0-1 adjacency matrix whereas U is a numerical matrix and is positive covariance matrix has only non-negative eigenvalues whereas A has both positive and negative eigenvalues.Can not define the covariance matrix for graph dataThe strategy of determining the number of eigen components to use in numerical data does not work for graph data since the first eigenvalue of the noise matrix could be very large.56

Leading Eigenpairs vs. Graph TopologyHere we examine the role of positive and negative eigenvalues in graph topologyWithout loss of generality, we partition the node set into two groups and the adjacency matrix can be partitioned as

where and represent the edges within the two groups and represents the edges between the groups 57

Leading Eigenpairs vs. Graph Topology58

r = 1r = 2Original


Originalr = 1r = 2


Originalr = 1r = 4r = 2Algorithm61

Reconstructed Features (Political Blogs 40% Noise)62

Determine Number of Eigenpairs It is essential to find a best number of r with the randomized graph and the perturbation magnitude.Choose as the indicator since it is closely related to the other features and there exists an explicit moment estimator 63

Data SetsPolitical BlogsBased on incoming and outgoing links and posts during the time of 2004 presidential election16714 links among 1222 US political blogs Political BooksBased on the political books sold by Amazon.com where nodes represent the books and edges represent the co-purchasing of books105 nodes and 441 edgesEnronBased on email corpus of a real organization covering 3 years period where an edge represents there are at least 5 emails sent between two people151 nodes and 869 edges64Effect of Noise (Political Blogs)The method works well to a certain level of noiseEven with high level of noise, the reconstructed features are still closer to the original than the randomized ones65

Reconstructed Features on 3 real network data66Reconstruction Quality

When , the reconstructed features are closer to the original ones than the randomized onesAll positive for the three data sets

Privacy IssueQuestion 1:Can this reconstruction be used by attackers?Define the normalized Frobenius distance between A and as67 Political Books Enron

Political Blogs

Normalized F NormNormalized F NormNormalized F NormPrivacy IssueQuestion 2: Which type of graphs would have privacy breached?

For low rank graphs which have , the distance between the reconstructed graph and the original graph can be very small

68Randomizing Social Network: a Spectrum Preserving Approach, SDM08

Synthetic Low Rank GraphsHere is a set of synthetic low rank graphs generated from Political Blogs and you can see that the reconstruction works on both the distance and features69

ConclusionWe have shown the close relationship between graph topological structure and spectral spaces determined by eigen-pairs of the adjacency matrixWe have presented a low rank approximation based reconstruction algorithm and a novel solution to determine the optimal rank in reconstructionWe find for most social networks, the reconstructed networks do not incur further disclosure risks of individual privacy than the released randomized graphs, only networks with low ranks or a small number of dominant eigenvalues may incur further privacy disclosure due to reconstruction

70Spectrum Based Fraud Detection7171Framework72Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationReconstruction from Randomized GraphsSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work72A Spectral Framework to Quantify Graph Non-randomnessAdjacency Matrix A (symmetric)

Adjacency Spectrum73

73A Spectral Framework on Quantifying Graph Non-randomness74Graph non-randomness [Ying, Wu, SDM09]Spectral coordinates:Link non-randomness:

Node non-randomness:

Graph non-randomness:

74A Spectral Framework to Quantify Graph Non-randomness75

Graph non-randomness [Ying, Wu, SDM09]Spectral coordinates:

75Background & Motivation76

Laplacian spectral spaceNormal spectral space76Graph randomness [Ying, Wu, SDM09]Link non-randomness:A Spectral Framework to Quantify Graph Non-randomness77

77

Graph randomness [Ying, Wu, SDM09]Node non-randomness:A Spectral Framework to Quantify Graph Non-randomness78

78Graph randomness [Ying, Wu, SDM09]Graph non-randomness:A Spectral Framework to Quantify Graph Non-randomness79

PropertyNormally distributed with mean equals to ER-graph;The complete and regular graph reach the positive and negative extreme values;Randomization reduces the non-randomness value.

Normalized by the mean and standard deviation for ER-graphs79A Spectral Framework to Quantify Graph Non-randomnessApplication: spectral switch (apply to adjacency matrix):

To preserve the non-randomness of the whole graph (eigenvalues), deleted edges and added fake edges has comparable edge non-randomness values.80

8081

Collaborative AttacksSome attackers join the social network

Attackers create links to regular users (victims)

Attacks form some inner structure among themselves81Graph Perturbation82

Collaborative Attacks83

first ordersecond order84Regular nodes are approximately unchangedCollaborative Attacks

Approximate the entries in the eigenvector

84

Collaborative Attacks85Regular nodes are approximately unchanged

first ordersecond orderThe entry is expressed by the victims approximatelyInner structure among attackers affects the eigenvector in the second order termApproximate the entries in the eigenvector85ProblemWe do not know attackers/victims in advance, hence their specific spectral coordinates are unknown.

For Random Link Attacks, we can derive the distribution of attacking nodes spectral coordinates.8687The attacker creates some fake nodes, and control the fake nodes to connect to randomly selected regular nodes;Fake nodes can mimic the real graph structure among themselves to evade detection.Random Link Attacks

8788Idea count out triangles around nodes --- regular connections produce many triangles, random connections do not create many trianglesAlgorithm Detecting suspects clustering test and neighborhood independence testDetecting RLAsGREEDY and TRWALKLimitationdifficult to detect when attackers create a dense subgraph among themToo many parameters

Topology approach -- Shirvastava et al. icde0888For Random Link Attacks (RLA): has the normal distribution with mean and variance bounded by:

We can get the region in the spectral space where RLA attackers appear in high probabilitySpectrum based RLA detection89

Inner structure of attackers does not affect the region!!!

89

For Random Link Attacks (RLA): has the normal distribution with mean and variance bounded by:

We can get the region in the spectral space where RLA attackers appears in high probabilitySpectrum based RLA detection90

Inner structure of attackers does not affect the region!!!20 attackers, each attacks 30 victims averagely90

Combine k dimensions together:

We can get the upper bounds of mean and variance of R and get the decision line:91

Using node non-randomness

Nodes below the decision line are suspects91Example I92Spectral properties of normal nodes and attackers

20 attackers join the Polblogs network. Each attacker connects 50 randomly selected victims. Attackers form a random graph among themselves92

Example II93Spectral properties of normal nodes and attackers40 attackers join the Polblogs network. They totally attack 1000 randomly selected victims. Attackers mimic real network structure among themselves93Comparison Topology based RLA detection approach Shrivastava et al. ICDE08 clustering test and neighborhood independence testGREEDY and TRWALKExperimental Setting Web Spam Challenge data (114K nodes and 1.8M links) Add 8 RLAs with varied sizes and connection patterns. 94Accuracy95

Execution time96

Distributed Denial Of Service Attacks97Spectral properties of victim nodes

Attacker controls 200 normal nodes to attack one victim node.97Fraud Detection: Bipartite Core AttacksAttacker creates two type of nodes:Accomplices: connect to normal nodes and pretend to be normal. Accomplices also connect to fraudsters (and enhance fraudsters rating).

Fraudsters: nodes that actually do frauds, mostly connect to accomplices

Figure from: Duen Horng Chau et. al., Detecting Fraudulent Personalities in Networks of Online Auctioneers98

Bipartite core98Future workCompare randomization and k-anonymityCombine link privacy and node privacyLink and node privacy issue for feature preserving randomizationSpectral based fraud detection for various random attacks9999Thank you! Questions?X. Wu, X.Ying, K. Liu and L. Chen. "A Survey of Algorithms for Privacy-Preservation of Graphs and Social Networks". Invited book chapter. Managing and Mining Graph Data. August 2009.X. Ying, X. Wu, K.Pan, and L. Guo. "On the Quantification of Identity and Link Disclosures in Randomizing Social Networks". Invited book chapter. Advances in Information & Intelligent Systems. Springer, 2009.X. Wu, X. Ying and L. Wu. "Analyzing Socio-technical Networks: a Spectrum Perspective". Invited book chapter. Socio-technical Networks: Science and Engineering Design, 2009.X. Ying, K. Pan,X. Wu and L. Guo. "Comparisons of Randomization and K-degree Anonymization Schemes for Privacy Preserving Social Network Publishing ", (SNA-KDD09).X. Ying and X. Wu. Graph Generation with Prescribed Feature Constraints, (SDM09).X. Ying and X. Wu. "On Randomness Measures for Social Networks", (SDM09).X. Ying and X. Wu. "On Link Privacy in Randomizing Social Networks". (PAKDD09, Best Student Paper Runner-up Award)X. Ying and X. Wu. "Randomizing Social Networks: a Spectrum Preserving Approach". (SDM08).100100Evaluation101

101Node randomness:Future Work: Random Attack Detection102

102Fraud Detection: Bipartite Attacks103Algorithm outline:Find the suspect according to node non-randomness measure;Compute the common neighbor (CN) matrix of suspects:Susp_CN(i,j) = # CN of i and jSusp_CN is a weighted undirected graph!Find dense subgraphs in Susp_CN graph.103Fraud Detection: Bipartite Attacks104

Spectral space of Susp_CN graphPolblogs network, 20 accomplices, and 15 fraudsters104Future Work: Node Identity PrivacyRe-identification risks reduces as k increases;Add/Del strategy can efficiently reduce the risk.105

105Link Privacy: Prior & Posterior Beliefs106Method III [Ying, Wu, SDM09]Uniform switch procedure [Taylor, 1981]Starting with the randomized data, repeat the uniform switch procedure many times and get one sample graphGenerate N graphs

106Link Privacy: Prior & Posterior Beliefs107Method III [Ying, Wu, SDM09]Uniform switch procedure [Taylor, 1981]Starting with the randomized data, repeat the uniform switch procedure many times and get one sample graphGenerate N graphs

10712345123451234512345

Method I:Method II:If similarity is large, link (i. j) is more likely to be a true linkijRandomized grapht (xt)t (xt)v (xv)u (xu)u (xu)w (xw)v (xv)w (xw)t (yt)

t (yt)

v (yv)

u (yu)

u (yu)

w (yw)

v (yv)

w (yw)

t (xt)t (xt)v (xv)u (xu)u (xu)w (xw)v (xv)w (xw)

Date post:	24-Feb-2016
Category:	Documents
Upload:	merlin
View:	38 times
Download:	0 times

Privacy and Spectral Analysis on Social Network Randomization

Documents