Exploring Linkability of User Reviews
Mishari Almishari and Gene Tsudik Computer Science DepartmentUniversity of California, Irvine
malmisha,[email protected]
Increasing Popularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews
in 2010
categoryRating
Rising Awareness of Privacy
How Privacy apply to Reviews?
TraceabilityLinkability of Ad hoc ReviewsLinkablility of Several Accounts
Contribution
Extensive Study to Measure privacy/linakability in user reviews
Propose models that adequately identify authors
Settings & Problem Formulation
IR: Identified RecordIR
IR
IR
IR
AR
AR
AR
AR
AR: Anonymous Record
Anonymous Record Size (AR)
Identified Record Size (IR)
Matching Model
TOP-X LinkabilityX: 1 and 101, 5, 10, 20,
…60
Dataset
1 Million Reviews 2000 Users more than 300 review
Methodology
Naïve Bayesian Model Kullback-Leibler Model
Symmetric Version
Naïve Bayesian (NB)
Identified Record(IR)
Anonymous Record(AR)
Decreasing Sorted List of IRs
Kullback-Leibler Divergence
(KLD)
Identified Record(IR)
Anonymous Record(AR)
Increasing Sorted List of IRs
Maximum Likelihood Estimation
Tokens
Unigram: ‘a’, ….’z’ Digram: ‘aa’, ‘ab’,…,’zz’ Rating :1,2,3,4,5 Category: restaurant, Beauty and Spa,
Education
Lexical Token Results
NB -Unigram
Size 60, LR 83%/ Top-
1LR 96% Top-
10
KLD - Unigram
Size 60, LR 83%/ Top-
1LR 96% Top-
10
NB Digram
Size 20, LR 97%/ Top-
1Size10,
LR 88%/ Top-1
KLD Digram
Size 60, LR 99%/ Top-
1Size 30,
LR 75%/ Top-1
Improvement (1): Combining Lexical and
non-Lexical ones
Combining in NB model Straightforward P(Rating|IR), P(Category|IR)
But for KLD? Weighted Average
First, Combine Rating and Category
Second, Combine non-lexical and lexical
0.5
0.997/0.97 for Unigram/Digram
Token Combining Results
Rating, Category, and Unigram - NB
Gain, up to 20%Size 30,
60 % To 80%Size 60,
83 % To 96%
Rating, Category, and Unigram - KLD
Gain, up to 12%Size 40,
68 % To 80%Size 60,
83 % To 92%
Rating, Category, and Digram - NB
Rating, Category, and Digram - KLD
What about Restricting Identified Record (IR) Size?
Anonymous Record Size (AR)
Identified Record Size (IR)
Matching Model
TOP-X LinkabilityX: 1 and 10
Anonymous Record Size (AR)
Identified Record Size (IR)
Matching Model
TOP-X LinkabilityX: 1 and 10
Restricted IR - NB
Affected by IR size
Restricted IR - KLD
Performed better for smaller IRSize 20 or less, improved
The rest, comparable
What about Matching All AR’s at once?
Anonymous Record Size (AR)
Identified Record Size (IR)
Matching Model
TOP-X LinkabilityX: 1 and 10
Anonymous Records (AR’s)
Identified Records (IR’s)
Matching Model
Improvement (2): Matching All IR’s At
Once
✔
✔
✔
✔
✖
✖
✖✖
✖
✖
MatchAll - Restricted
Gain, up to 16%
Size 30, From 74% To 90%
Matchall - Full
Gain, up to 23%
Size 20, From 35% To 55%
Improvement (3): For Small IR Size
Changing it to:0.5 + Review
Length
Results – Improvement (3)
Size 10, 89% To 92%Size 7, 79% To 84%
Gain up to 5%
Discussion Implications
Cross-Referencing Review Spam
Non-Prolific Users Gradually becomes prolific IR of 20, Link Around 70%
Anonymous Record Size Linkability high even for small (92% for AR
of 10) 60 only 20% of min user contribution
Discussion (cont.) Unigram Token
Very Comparable for larger AR Entail less resources in the attach 26 VS
676
Future Directions
• Improving more for Small AR’s• Other Probabilistic Models• Using Stylometry
• Exploring Linkability in other Preference Databases
• More than one AR for different Users: Exploring it more
Conclusion
Extensive Study to Assess Linkability of User ReviewsFor large set of usersUsing very simple features
Users are very exposed even with simple features and large number of authors
Thank you all!