Exploring Linkability of User Reviews

Exploring Linkability of User Reviews

Mishari Almishari and Gene Tsudik Computer Science DepartmentUniversity of California, Irvine

malmisha,[email protected]

Increasing Popularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews

in 2010

categoryRating

Rising Awareness of Privacy

How Privacy apply to Reviews?

TraceabilityLinkability of Ad hoc ReviewsLinkablility of Several Accounts

Contribution

Extensive Study to Measure privacy/linakability in user reviews

Propose models that adequately identify authors

Settings & Problem Formulation

IR: Identified RecordIR

IR

IR

IR

AR

AR

AR

AR

AR: Anonymous Record

Anonymous Record Size (AR)

Identified Record Size (IR)

Matching Model

TOP-X LinkabilityX: 1 and 101, 5, 10, 20,

…60

Dataset

1 Million Reviews 2000 Users more than 300 review

Methodology

Naïve Bayesian Model Kullback-Leibler Model

Symmetric Version

Naïve Bayesian (NB)

Identified Record(IR)

Anonymous Record(AR)

Decreasing Sorted List of IRs

Kullback-Leibler Divergence

(KLD)

Identified Record(IR)

Anonymous Record(AR)

Increasing Sorted List of IRs

Maximum Likelihood Estimation

Tokens

Unigram: ‘a’, ….’z’ Digram: ‘aa’, ‘ab’,…,’zz’ Rating :1,2,3,4,5 Category: restaurant, Beauty and Spa,

Education

Lexical Token Results

NB -Unigram

Size 60, LR 83%/ Top-

1LR 96% Top-

10

KLD - Unigram


1LR 96% Top-

10

NB Digram


1Size10,

LR 88%/ Top-1

KLD Digram


1Size 30,

LR 75%/ Top-1

Improvement (1): Combining Lexical and

non-Lexical ones

Combining in NB model Straightforward P(Rating|IR), P(Category|IR)

But for KLD? Weighted Average

First, Combine Rating and Category

Second, Combine non-lexical and lexical

0.5

0.997/0.97 for Unigram/Digram

Token Combining Results

Rating, Category, and Unigram - NB

Gain, up to 20%Size 30,

60 % To 80%Size 60,

83 % To 96%

Rating, Category, and Unigram - KLD

Gain, up to 12%Size 40,

68 % To 80%Size 60,

83 % To 92%

Rating, Category, and Digram - NB

Rating, Category, and Digram - KLD

What about Restricting Identified Record (IR) Size?



Matching Model

TOP-X LinkabilityX: 1 and 10



Matching Model


Restricted IR - NB

Affected by IR size

Restricted IR - KLD

Performed better for smaller IRSize 20 or less, improved

The rest, comparable

What about Matching All AR’s at once?



Matching Model


Anonymous Records (AR’s)

Identified Records (IR’s)

Matching Model

Improvement (2): Matching All IR’s At

Once

✔

✔

✔

✔

✖

✖

✖✖

✖

✖

MatchAll - Restricted

Gain, up to 16%

Size 30, From 74% To 90%

Matchall - Full

Gain, up to 23%

Size 20, From 35% To 55%

Improvement (3): For Small IR Size

Changing it to:0.5 + Review

Length

Results – Improvement (3)

Size 10, 89% To 92%Size 7, 79% To 84%

Gain up to 5%

Discussion Implications

Cross-Referencing Review Spam

Non-Prolific Users Gradually becomes prolific IR of 20, Link Around 70%

Anonymous Record Size Linkability high even for small (92% for AR

of 10) 60 only 20% of min user contribution

Discussion (cont.) Unigram Token

Very Comparable for larger AR Entail less resources in the attach 26 VS

676

Future Directions

• Improving more for Small AR’s• Other Probabilistic Models• Using Stylometry

• Exploring Linkability in other Preference Databases

• More than one AR for different Users: Exploring it more

Conclusion

Extensive Study to Assess Linkability of User ReviewsFor large set of usersUsing very simple features

Users are very exposed even with simple features and large number of authors

Thank you all!

Date post:	24-Feb-2016
Category:	Documents
Upload:	reba
View:	56 times
Download:	0 times

Exploring Linkability of User Reviews

Documents