of 104
Slide 1
Xiaowei Ying, Leting Wu, Xintao Wu
University of North Carolina at CharlottePrivacy and Spectral Analysis on Social Network Randomization1Framework2Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationReconstruction from Randomized GraphsSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work2Background & Motivation33Background & MotivationSocial Network4
Friendship in Karate club [Zachary, 77]
Biological association network of dolphins [Lusseau et al., 03]
Collaboration network of scientists [Newman, 06]
Network of US political books(105 nodes, 441 edges)Books about US politics sold by Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. Nodes have been given colors of blue, white, or red to indicate whether they are "liberal", "neutral", or "conservative". 4
7Public/ Third party/ Research Inst.Data OwnerThe original graph datareleaseBackground & MotivationPublish/outsource data for mining/analysisData miner: discover patterns/features of the data (utility)-- find central nodes, community partition, link prediction
Attacker: breach sensitive information the data (privacy)-- identity of nodes (and sensitive attributes), sensitive relation between two individuals7Privacy issues in publishing social network data:Anonymization is not enough for protecting the privacy. Active/passive attacks[1], subgraph attacks [2].
[1] L. Backstrom, et. al., Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. WWW07[2] M. Hay et. al. Resisting Structural Reidentification in Anonymized Social Networks, VLDB08
Background & Motivation8
8Background & MotivationPrivacy Preserving Social Network PublishingNode-anonymization cannot guarantee identity/link privacy due to subgraph queries.K-anonymity generalizationThe released graph has at least k nodes with the same degree/subgraph/neighorhood[Liu&Terzi SIGMOD08, Zhou&Pei ICDE08, Chen VLDB09]Graph (edge) randomizationRandom Add/Del & Random SwitchUtility preserving randomizationSuper graph generalizationGenerate nodes into supper nodes, and edges into supper edges9
9Background & MotivationGraph Randomization/Perturbation:Random Add/Del edges (no. of edges unchanged)
Random Switch edges (nodes degree unchanged)10
10Background & MotivationGraph Randomization/Perturbation:
Data privacy:How graph randomization prevents privacy disclosure?
Data utility:How will the graph structure change due to randomization?How to preserve graph structural features better?11In our work, we try to answer following questions. First, while doing perturbation, how will the graph structure change? This is an important issue if we want to publish the data for analysis. On the other hand, is it effective for protecting the privacy, and to what extent we should do the perturbation.11Background & MotivationNumerous topological measures of networks
Harmonic mean of shortest distance
Transitivity(cluster coefficient)
Subgraph centrality
Modularity (community structure);And many others12
12Background & MotivationSpectral measures adjacency matrixAdjacency Matrix A (symmetric)
Adjacency Spectrum13
13Laplacian Matrix and Spectrum:
Normal Matrix and Spectrum14
Background & Motivation
14Background & MotivationMany topological features are related to spectral measures:
No. of triangles:
Subgraph centrality:
Graph diameter:
k disconnected parts in the graph k 0s in the Laplacian spectrum.
15
15Background & MotivationTwo important eigenvalues: andThe maximum degree, chromatic number, clique number etc. are related to ;Epidemic threshold for virus propagates in the network is related to [Wang et al., KDD03]; indicates the community structure of the graph: clear community structure 0.16
16
0.00 0.00 0.00 1.27 2.59 3.00 3.00 3.00 4.00 4.00 4.00 4.73 5.00 5.41 6.00 6.00 6.00 6.00 6.00 The Laplacian eigenvalues0.00 0.11 0.34 1.31 2.60 3.00 3.10 3.36 4.00 4.13 4.59 4.79 5.31 5.58 6.00 6.00 6.00 6.66 7.12 Basic Facts of Graph SpectrumGraph from: A. Capocciet. al., Detecting communities in large networks17
Basic Facts of Graph Spectrum
The Laplacian eigenvectors18Privacy in Randomized Graph1919Framework20Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work20Link Privacy: Prior & Posterior BeliefsQuantify attackers belief (assume that node identities are known)Prior probabilities:
Posterior probability for node pair (i, j):
Serious jeopardize the privacy when 21
21Link Privacy: Prior & Posterior BeliefsMethod I [Ying, Wu, SDM08]Add & Del k links
Switch k times22
22Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]A common phenomenon: in real-world graphs similar nodes tend to connect to each other23
23Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]Even after moderate randomization, the phenomenon still exists:24
24Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]
25
25Add/Del:True links are deleted w.p.False links are added w.p.Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]26
With Bayes theorem
26Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]Evaluation (add/del 50% true links)
27
27Link Privacy: Prior & Posterior BeliefsMethod II [Ying, Wu, PAKDD09]The total sum of prior and posterior probabilities is the same:28
prior prob.posterior prob. Iposterior prob. II28Link Privacy: Prior & Posterior BeliefsMethod III [Ying, Wu, SDM09]Intuition: degree sequence specifies a graph space, and the true graph is just one member of the space.29
Example: switch graph with degree sequence {3,2,2,2,3}Is node 1 and 5 connected?
29Link Privacy: Prior & Posterior BeliefsMethod III [Ying, Wu, SDM09]Graph space = {G: with a given degree sequence}Impractical to enumerate all members in the spaceSample the graph space through Markov chain:30
30Link Privacy: Prior & Posterior BeliefsMethod III [Ying, Wu, SDM09]Evaluation Polbooks (r=8%) Enron (r=8%)31
31Identity Privacy:Re-identify nodes in the anonymous graphsbased on some background information (e.g. degree)Randomization reduces attackers beliefs
Node Identity Privacy32
Polbooks: degree distribution
After randomization
32Node Identity PrivacyNodes prior and posterior risksGiven an individual with degree d and a randomized graph
Prior risk:
Posterior risks
33
33Node Identity PrivacyOngoing work:Compare randomization and k-anonymity approach:-- to achieve the same privacy protection level, which approach can achieve better utility?Combine identity privacy and node privacy.Node identity privacy issue under different background information (e.g., sub-graph, neighborhood).34K-degree generalization [Liu et. al.]
34Feature Preserving Randomization3535Framework36Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationReconstruction from Randomized GraphsSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work36Feature Preserving RandomizationTopological and spectral features change a lot along the perturbation.37
(Networks of US political books, 105 nodes and 441 edges)Can we better preserve the network structure?37Features in Social Network DataTwo important eigenvalues: andThe maximum degree, chromatic number, clique number etc. are related to ;Epidemic threshold for virus propagates in the network is related to [Wang et al., KDD03]; indicates the community structure of the graph: clear community structure 0.38
38Spectrum Preserving RandomizationSpectrum preserving approach [Ying, Wu, SDM08]
Intuition: since spectrum is related to many graph topological features, can we preserve more structural features by controlling the movement of eigenvalues?
3939Spectrum Preserving RandomizationSpectral Switch (apply to adjacency matrix):
To increase the eigenvalue:
To decrease the eigenvalue:40
40Spectrum Preserving RandomizationSpectral Switch (apply to Laplacian matrix):
To decrease the eigenvalue:
To increase the eigenvalue:41
41Spectrum Preserving RandomizationEvaluation:42
(Networks of US political books, 105 nodes and 441 edges)42Markov Chain Based Feature Preserving RandomizationMarkov chain generation [Ying, Wu, SDM09]Data owner puts feature range constrains in switchingFeature range constrains:
The data owner publish the feature range constraint.43
43Markov chain generation [Ying, Wu, SDM09]Markov chain with feature range constraint (uniformity for accessible graphs)
Markov Chain Based Feature Preserving Randomization4444Markov chain generation [Ying, Wu, SDM09]Problem: accessibility is not guaranteed
We propose the relaxed algorithm with feature range constraint (accessibility, approximate uniformity)The relaxed algorithm also has applications in testing the significance data mining results
Markov Chain Based Feature Preserving Randomization45
45Data owner puts feature range constrains in switchingFeature range constrains:
Can attackers utilize the feature constrains to breach link privacy?46
Attacks to Feature Preserving Randomization46Markov chain approach [Ying, Wu, SDM09]Markov chain with feature range constraintGraph space = {G: with a given deg. seq. & S(G) in R}Starting with the randomized data, repeat the switch procedure many times and get one sample graphGenerate N graphsAttacks in Feature Preserving Randomization47
47Attacks in Utility Preserving RandomizationMarkov chain approach [Ying, Wu, SDM09]Evaluation Polbooks (r=8%) Enron (r=8%)48
Future work: what cause the difference? What features will (not) release privacy?48Reconstruction from Randomized Graphs49MotivationLow Rank Approximation on Graph DataReconstruction from Randomized GraphPrivacy Issue SDM10 paper49MotivationWe focus on whether we can reconstruct a grpah from s.t.50
Our Focus
Revisit of LRA in Numerical DataSpectral Filter derive estimation of U from perturbed dataCalculate covariance matrix which is symmetric and positive definiteApply spectral decomposition to Derive the eigenvalues information from the covariance matrix of noise V and choose a proper number of dimensions, r Let and , obtain the estimated data set using 51
52Why it worksOriginal data are correlatedNoise are not correlated
noise
2nd principal vector1st principal vectororiginal signal
perturbed+=
2-d estimation
1-d estimation5253Determining rStrategy 1: (Huang and Du SIGMOD05 )
Strategy 2:(Guo, Wu and Li, PKDD 2006) The estimated data using is approximate optimal
53The first strategy indicated in Huang and Dus paper is to simply compare the perturbed eigenvalues with the noise eigenvalue. If it is larger, than the corresponding eigenvector is included in the projection space.
The second strategy is to include the eigenvector only its perturbed eigenvector is larger than twice of noise eigen values. Graph DataMatrix Representation of NetworkAdjacency Matrix A (symmetric)
Adjacency Spectrum
54
54Low Rank ApproximationLow Rank Approximation by eigen-docomposition:
This provide a best r rank approximation to A To keep the structure of adjacency matrix, discretize
55
New ChallengesA is a 0-1 adjacency matrix whereas U is a numerical matrix and is positive covariance matrix has only non-negative eigenvalues whereas A has both positive and negative eigenvalues.Can not define the covariance matrix for graph dataThe strategy of determining the number of eigen components to use in numerical data does not work for graph data since the first eigenvalue of the noise matrix could be very large.56
Leading Eigenpairs vs. Graph TopologyHere we examine the role of positive and negative eigenvalues in graph topologyWithout loss of generality, we partition the node set into two groups and the adjacency matrix can be partitioned as
where and represent the edges within the two groups and represents the edges between the groups 57
Leading Eigenpairs vs. Graph Topology58
r = 1r = 2Original
Leading Eigenpairs vs. Graph Topology59
Originalr = 1r = 2
Leading Eigenpairs vs. Graph Topology60
Originalr = 1r = 4r = 2Algorithm61
Reconstructed Features (Political Blogs 40% Noise)62
Determine Number of Eigenpairs It is essential to find a best number of r with the randomized graph and the perturbation magnitude.Choose as the indicator since it is closely related to the other features and there exists an explicit moment estimator 63
Data SetsPolitical BlogsBased on incoming and outgoing links and posts during the time of 2004 presidential election16714 links among 1222 US political blogs Political BooksBased on the political books sold by Amazon.com where nodes represent the books and edges represent the co-purchasing of books105 nodes and 441 edgesEnronBased on email corpus of a real organization covering 3 years period where an edge represents there are at least 5 emails sent between two people151 nodes and 869 edges64Effect of Noise (Political Blogs)The method works well to a certain level of noiseEven with high level of noise, the reconstructed features are still closer to the original than the randomized ones65
Reconstructed Features on 3 real network data66Reconstruction Quality
When , the reconstructed features are closer to the original ones than the randomized onesAll positive for the three data sets
Privacy IssueQuestion 1:Can this reconstruction be used by attackers?Define the normalized Frobenius distance between A and as67 Political Books Enron
Political Blogs
Normalized F NormNormalized F NormNormalized F NormPrivacy IssueQuestion 2: Which type of graphs would have privacy breached?
For low rank graphs which have , the distance between the reconstructed graph and the original graph can be very small
68Randomizing Social Network: a Spectrum Preserving Approach, SDM08
Synthetic Low Rank GraphsHere is a set of synthetic low rank graphs generated from Political Blogs and you can see that the reconstruction works on both the distance and features69
ConclusionWe have shown the close relationship between graph topological structure and spectral spaces determined by eigen-pairs of the adjacency matrixWe have presented a low rank approximation based reconstruction algorithm and a novel solution to determine the optimal rank in reconstructionWe find for most social networks, the reconstructed networks do not incur further disclosure risks of individual privacy than the released randomized graphs, only networks with low ranks or a small number of dominant eigenvalues may incur further privacy disclosure due to reconstruction
70Spectrum Based Fraud Detection7171Framework72Background & MotivationPrivacy in Randomized GraphLink privacy (3 method to quantify link privacy)Node privacyFeature Preserving RandomizationSpectrum preserving randomizationGeneral feature preserving randomization (Markov chain based)Attacks to feature preserving randomizationReconstruction from Randomized GraphsSpectrum Based Fraud DetectionA spectral framework to quantify non-randomness of social networksSpectrum based fraud detectionFuture Work72A Spectral Framework to Quantify Graph Non-randomnessAdjacency Matrix A (symmetric)
Adjacency Spectrum73
73A Spectral Framework on Quantifying Graph Non-randomness74Graph non-randomness [Ying, Wu, SDM09]Spectral coordinates:Link non-randomness:
Node non-randomness:
Graph non-randomness:
74A Spectral Framework to Quantify Graph Non-randomness75
Graph non-randomness [Ying, Wu, SDM09]Spectral coordinates:
75Background & Motivation76
Laplacian spectral spaceNormal spectral space76Graph randomness [Ying, Wu, SDM09]Link non-randomness:A Spectral Framework to Quantify Graph Non-randomness77
77
Graph randomness [Ying, Wu, SDM09]Node non-randomness:A Spectral Framework to Quantify Graph Non-randomness78
78Graph randomness [Ying, Wu, SDM09]Graph non-randomness:A Spectral Framework to Quantify Graph Non-randomness79
PropertyNormally distributed with mean equals to ER-graph;The complete and regular graph reach the positive and negative extreme values;Randomization reduces the non-randomness value.
Normalized by the mean and standard deviation for ER-graphs79A Spectral Framework to Quantify Graph Non-randomnessApplication: spectral switch (apply to adjacency matrix):
To preserve the non-randomness of the whole graph (eigenvalues), deleted edges and added fake edges has comparable edge non-randomness values.80
8081
Collaborative AttacksSome attackers join the social network
Attackers create links to regular users (victims)
Attacks form some inner structure among themselves81Graph Perturbation82
Collaborative Attacks83
first ordersecond order84Regular nodes are approximately unchangedCollaborative Attacks
Approximate the entries in the eigenvector
84
Collaborative Attacks85Regular nodes are approximately unchanged
first ordersecond orderThe entry is expressed by the victims approximatelyInner structure among attackers affects the eigenvector in the second order termApproximate the entries in the eigenvector85ProblemWe do not know attackers/victims in advance, hence their specific spectral coordinates are unknown.
For Random Link Attacks, we can derive the distribution of attacking nodes spectral coordinates.8687The attacker creates some fake nodes, and control the fake nodes to connect to randomly selected regular nodes;Fake nodes can mimic the real graph structure among themselves to evade detection.Random Link Attacks
8788Idea count out triangles around nodes --- regular connections produce many triangles, random connections do not create many trianglesAlgorithm Detecting suspects clustering test and neighborhood independence testDetecting RLAsGREEDY and TRWALKLimitationdifficult to detect when attackers create a dense subgraph among themToo many parameters
Topology approach -- Shirvastava et al. icde0888For Random Link Attacks (RLA): has the normal distribution with mean and variance bounded by:
We can get the region in the spectral space where RLA attackers appear in high probabilitySpectrum based RLA detection89
Inner structure of attackers does not affect the region!!!
89
For Random Link Attacks (RLA): has the normal distribution with mean and variance bounded by:
We can get the region in the spectral space where RLA attackers appears in high probabilitySpectrum based RLA detection90
Inner structure of attackers does not affect the region!!!20 attackers, each attacks 30 victims averagely90
Combine k dimensions together:
We can get the upper bounds of mean and variance of R and get the decision line:91
Using node non-randomness
Nodes below the decision line are suspects91Example I92Spectral properties of normal nodes and attackers
20 attackers join the Polblogs network. Each attacker connects 50 randomly selected victims. Attackers form a random graph among themselves92
Example II93Spectral properties of normal nodes and attackers40 attackers join the Polblogs network. They totally attack 1000 randomly selected victims. Attackers mimic real network structure among themselves93Comparison Topology based RLA detection approach Shrivastava et al. ICDE08 clustering test and neighborhood independence testGREEDY and TRWALKExperimental Setting Web Spam Challenge data (114K nodes and 1.8M links) Add 8 RLAs with varied sizes and connection patterns. 94Accuracy95
Execution time96
Distributed Denial Of Service Attacks97Spectral properties of victim nodes
Attacker controls 200 normal nodes to attack one victim node.97Fraud Detection: Bipartite Core AttacksAttacker creates two type of nodes:Accomplices: connect to normal nodes and pretend to be normal. Accomplices also connect to fraudsters (and enhance fraudsters rating).
Fraudsters: nodes that actually do frauds, mostly connect to accomplices
Figure from: Duen Horng Chau et. al., Detecting Fraudulent Personalities in Networks of Online Auctioneers98
Bipartite core98Future workCompare randomization and k-anonymityCombine link privacy and node privacyLink and node privacy issue for feature preserving randomizationSpectral based fraud detection for various random attacks9999Thank you! Questions?X. Wu, X.Ying, K. Liu and L. Chen. "A Survey of Algorithms for Privacy-Preservation of Graphs and Social Networks". Invited book chapter. Managing and Mining Graph Data. August 2009.X. Ying, X. Wu, K.Pan, and L. Guo. "On the Quantification of Identity and Link Disclosures in Randomizing Social Networks". Invited book chapter. Advances in Information & Intelligent Systems. Springer, 2009.X. Wu, X. Ying and L. Wu. "Analyzing Socio-technical Networks: a Spectrum Perspective". Invited book chapter. Socio-technical Networks: Science and Engineering Design, 2009.X. Ying, K. Pan,X. Wu and L. Guo. "Comparisons of Randomization and K-degree Anonymization Schemes for Privacy Preserving Social Network Publishing ", (SNA-KDD09).X. Ying and X. Wu. Graph Generation with Prescribed Feature Constraints, (SDM09).X. Ying and X. Wu. "On Randomness Measures for Social Networks", (SDM09).X. Ying and X. Wu. "On Link Privacy in Randomizing Social Networks". (PAKDD09, Best Student Paper Runner-up Award)X. Ying and X. Wu. "Randomizing Social Networks: a Spectrum Preserving Approach". (SDM08).100100Evaluation101
101Node randomness:Future Work: Random Attack Detection102
102Fraud Detection: Bipartite Attacks103Algorithm outline:Find the suspect according to node non-randomness measure;Compute the common neighbor (CN) matrix of suspects:Susp_CN(i,j) = # CN of i and jSusp_CN is a weighted undirected graph!Find dense subgraphs in Susp_CN graph.103Fraud Detection: Bipartite Attacks104
Spectral space of Susp_CN graphPolblogs network, 20 accomplices, and 15 fraudsters104Future Work: Node Identity PrivacyRe-identification risks reduces as k increases;Add/Del strategy can efficiently reduce the risk.105
105Link Privacy: Prior & Posterior Beliefs106Method III [Ying, Wu, SDM09]Uniform switch procedure [Taylor, 1981]Starting with the randomized data, repeat the uniform switch procedure many times and get one sample graphGenerate N graphs
106Link Privacy: Prior & Posterior Beliefs107Method III [Ying, Wu, SDM09]Uniform switch procedure [Taylor, 1981]Starting with the randomized data, repeat the uniform switch procedure many times and get one sample graphGenerate N graphs
10712345123451234512345
Method I:Method II:If similarity is large, link (i. j) is more likely to be a true linkijRandomized grapht (xt)t (xt)v (xv)u (xu)u (xu)w (xw)v (xv)w (xw)t (yt)
t (yt)
v (yv)
u (yu)
u (yu)
w (yw)
v (yv)
w (yw)
t (xt)t (xt)v (xv)u (xu)u (xu)w (xw)v (xv)w (xw)