Date post: | 20-Jun-2015 |
Category: |
Technology |
Upload: | academia-sinica |
View: | 993 times |
Download: | 3 times |
2009/2/2
Ieng-Fat Lam, Kuan-Ta Chen, and Ling-Jyh ChenInstitute of Information Science, Academia Sinica
Presenter: Ieng-Fat Lam
Involuntary Information Leakage inSocial Network Services
Involuntary Information Leakage inSocial Network Services
2
OutlineOutline
Introduction
Motivation
Research Method
Results
Discussion
Conclusion
3
Social Networking Services (SNSs)Social Networking Services (SNSs)
For example• Myspace, Facebook, Orkut, Yahoo! 360
• Mixi, GREE (Japan)
• Wretch (Taiwan)
Become very popular
Hosts millions of profiles
Introduction ::
4
Users in SNSsUsers in SNSs
Social Activities• Meet new friends, contact existing friends
• Share resources over the Internet
Personal Information is usually published• Photos
• identity information
• Contact information
Introduction ::
5
Disclosing personal informationDisclosing personal information
Double‐edged sword • Let other people know / search you
• But some people may not respond nicely
• Risk of personal information used by malicious people
I am Lee-Da Nu!I love movieI am 23 years old, single!!
I am Lee-Da Nu!I love movieI am 23 years old, single!!
Introduction ::
6
Not revealing person information?Not revealing person information?
I never disclose my info to the Internet!
I never disclose my info to the Internet!
Introduction ::
7
Information revealed by friendsInformation revealed by friendsIntroduction ::
8
Information revealedInformation revealedIntroduction ::
[I got it!]Real Name : Andrew RichmanGender:MaleAge: 20 ~ 22Education record:Sunrise elementary schoolSt. John secondary schoolSt. Paul University
[I got it!]Real Name : Andrew RichmanGender:MaleAge: 20 ~ 22Education record:Sunrise elementary schoolSt. John secondary schoolSt. Paul University
9
Involuntary Information leakageInvoluntary Information leakage
A User may want to protect his/her identity• But it may unintentionally revealed by friends
• Hard to detect such leakageDue to distributed nature of Internet
• Becoming a serious threat to privacy
Motivation ::
10
In this studyIn this study
We would like to • Investigate the extent of involuntary information leakage
• Gather data from Wretch (http://www.wretch.cc)The most popular SNS in Taiwan
About 4 millions user profiles
• Quantify the degree of such leakageReal Name, age and education record
• Discuss potential means to mitigate the problem
Motivation ::
11
Data CollectionData Collection
User ID List (Crawl)john123Aronroseroseiamboy…..
User ID List (Crawl)john123Aronroseroseiamboy…..
1. Pick ID randomly
2. Obtain user profileand friend list (HTML)
AndyOrange…
Frn List
4. Add user IDTo ID list
3. Parse and save crawled user data
5. UpdateID List
Research Method ::
AndyOrange…
Frn List
12
An exampleAn example
User ProfileUser Profile
Friend listFriend list
Research Method ::
13
Overview of Crawled DataOverview of Crawled Data
Wretch Data
Number of users 766,972 (20%)
Number of Effective users 592,548 (15%)
Number of Connections 7,619,212
Avg. Connections per user 11.5
*Effective user at least have one “outgoing” friend connection
Research Method ::
14
Analysis of Name LeakageAnalysis of Name Leakage
Friend annotations in Wretch• A free‐form text to describe a friend
• It is used forClassification
Real name or nickname of a friend
The feature of a friend
For example• *Beauty Cathy Brown – The hottest girl of Nightingale High School
• [[ School Mate ]] Tony MY BUDDY
Research Method ::
15
Name Inference ProcessName Inference ProcessResearch Method ::
1. Obtain friend annotations
(for each profile)
2. GenerateName Candidates
Infer First Name
16
Generate name candidatesGenerate name candidates
To infer real name of a profile• Collect all of its incoming annotations
• Extract name candidates from annotations
Research Method ::
Andy
Aron
Andrew!!Andrew!!
Yo~ Bros. Andrew!!Yo~ Bros. Andrew!!
Sammy
Old Mr. Richman!!Old Mr. Richman!!
Cool~~ Andrew Richman!!Cool~~ Andrew Richman!!
17
Generate name candidates (cont.)Generate name candidates (cont.)
Extract method• Break the text into tokens by
Symbols: <space>, <tab>, ‘#’, ‘@’, etc.Punctuation marks: ‘ ” , . () []Connective words (in Chinese)
• Chinese‐specific naming rules陳寬達 (Chen Kuan‐Ta)Two‐word tokens as first name candidatesThree‐word tokens as full name candidates
• Duplication Count is associated
Research Method ::
18
An exampleAn example
Andy
Andrew!!德榮!!Andrew!!德榮!!
Yo~ Andrew~Bros Andrew!!喔~德榮~德榮兄!!Yo~ Andrew~Bros Andrew!!喔~德榮~德榮兄!!
Old Mr. Richman~!!老劉~!!Old Mr. Richman~!!老劉~!!
Cool~~ Andrew Richman!!超帥~~ 劉德榮!!Cool~~ Andrew Richman!!超帥~~ 劉德榮!!
Name Candidates
德榮 (Andrew) [1]超帥 (Cool) [0]劉德榮 (Andrew Richman) [0]德榮兄 (Bros Andrew) [0]喔 (Yo) [0]老劉 (Old Mr. Richman) [0]
Name Candidates
德榮 (Andrew) [1]超帥 (Cool) [0]劉德榮 (Andrew Richman) [0]德榮兄 (Bros Andrew) [0]喔 (Yo) [0]老劉 (Old Mr. Richman) [0]
Research Method ::
Full name candidates
First name candidates
Duplication count
Full name candidates
First name candidates
Duplication count
19
Inference of full name (1 / 5)Inference of full name (1 / 5)
Common family name• Family name part is a common family name
• Duplication count is greater than 1
• For exampleFor full name candidate “Andrew Richman”
If “Andrew Richman” exists in more than 1 annotations
If “Richman” is a common family name
Research Method ::
[1] Chih-Hao Tsai, “Common Chinese Names”, http://technology.chtsai.org/namefreq/
20
Inference of full name (2 / 5)Inference of full name (2 / 5)
First name as a substring of full name• A first name candidate as a substring
In the right position
• Duplication count is greater than 1
• For exampleFor full name candidate “Andrew Richman”
If “Andrew Richman” exists in more than 1 annotations
If “Andrew” is also a first name candidate
Research Method ::
21
Inference of full name (3 / 5)Inference of full name (3 / 5)
Common full name• Compare with existing full name list
• National college exam enrollment listList maintained from 1994 to 2007
574, 010 distinguished full names
Research Method ::
[2] Chih-Hao Tsai, “A list of Chinese Names”, http://technology.chtsai.org/namelist/
22
Inference of full name (4 / 5)Inference of full name (4 / 5)
Nickname decomposition• In Chinese name
FN GN1‐GN2 (陳寬達)
• Possible nicknames:Prefix + X
Prefix + X + X
X + postfix
Where X can be FN, GN1 or GN2
Research Method ::
For “Andrew Richman”
We also have “Bros Andrew”
“Bros” is a predefined prefix
Removed “Bros” we got “Andrew”
“Andrew” is in “Andrew Richman”
For “Andrew Richman”
We also have “Bros Andrew”
“Bros” is a predefined prefix
Removed “Bros” we got “Andrew”
“Andrew” is in “Andrew Richman”
23
Inference of full name (5 / 5)Inference of full name (5 / 5)
Common words removal• If no match candidates found in above rules
• If duplicate count greater than 1
• If the full name candidate is not a nicknameDoes not contain any nickname prefix or postfix
• Not a ( or based on a ) common wordCompare to 100,511 common words
• Select the one with the highest duplication count
Research Method ::
24
Inference of First NameInference of First Name
Use same method as inference of full name• Common first name
Compare with 208,581 first names
Required duplication count greater than 1
• Nickname decomposition
• Common word removal
Research Method ::
25
Name Inference ResultsName Inference Results
Ratio of inferred names
Type of name Ratio of name inference
Nickname 60%
Real name (full name) 30%
First name 72%
Real name or first name 78%
Results ::
26
ValidationValidation
Examine real name by manual• Randomly Select 1,000 profiles
• 738 of them are unique and correctMore examine is performed, similar result
• Wrong case: User’s nickname
• Sufficient to support the conjectureInvoluntary real name leakage occurs in real‐life social network systems, and the degree of leakage is significant
Results ::
27
Ratio of Name LeakageRatio of Name Leakage
Figure 2: Ratio of name leakage based on users’ gender
Figure 3: Relation of users’ age and ratio of name leakage
Results ::
28
Risk AnalysisRisk Analysis
To confirm the identity leakage is involuntary• We check the inferred name with user’s profile
Only less than 0.1% users reveal their real names
To quantify the tendency of using real name• Degree of Using Real name (DUR)
Ratio of a user’s outgoing annotation that contain real name of annotation target
• Degree of being Called by Real name (DCR)Ratio of incoming annotations containing user’s real name
Results ::
29
Example of DUR and DCRExample of DUR and DCR
DUR and DCR
“Andrew”
[Friend] Raymond Aron
Our King!
[Friend]John Lennon
Yo~What’sup man
[Friend] Jay leno
[Friend]David Jones
Cool~Andrew Richman
[Friend]Sammy Hagar
Bros Andrew
Criteria DUR
First name 4/5
Full name 1/5
Either 5/5
Criteria DCR
First name 1/5
Full name 1/5
Either 2/5
Results ::
30
Positive relation between DUR and DCRPositive relation between DUR and DCR
Figure 4: Relation of DUR and DCR
Results ::
31
Involuntary leakage of age and education records
Involuntary leakage of age and education records
Inferring age• Round‐based manner
• If X disclosed age, and have a friend Y
• If X and Y have relation of “classmate”, “same class”…
• Assign age of X to Y
• Then check Y’s “classmate”
Research Method ::
32
Involuntary leakage of age and education records
Involuntary leakage of age and education records
Inferring Education records• Same as inferring age
• Divided into four education level, infer separatelyElementary School
Junior high school
Senior high school
College
• Define relation by keyword “same school”, “same college”, etc.
Research Method ::
33
Inference resultsInference results
Figure 5: Inference results of users' ages
Results ::
Figure 6: Inference results of users' education records
34
ValidationValidation
Cross‐validation• Verify inferred ages
Based on self‐disclosed education records
• Verify inferred education recordsBased on self‐disclosed ages
• Difference of age should be smallTo verify our infer result are accurate
Results ::
35
Validation ResultsValidation ResultsResults ::
Figure 7: The inferred age differences between pairs of self-disclosed
schoolmates in the four education levels
Figure 8: The self-disclosed age differences between pairs of inferred
schoolmates in the four education levels
36
Threads caused by identity leakageThreads caused by identity leakage
StalkingSpamming• In our data set
46% users disclosed valid email addressSpam with friends’ (spoofed) email address
Phishing• Spear phishing / Social phishing
Includes personal information in phishing emailSpoof friend’s email address
Discussion ::
37
Spear Phishing or SpamSpear Phishing or Spam
Dear Mr. Richman, We are eBay customer service, we concern about your security, please update your personal information.
Dear Mr. Richman, We are eBay customer service, we concern about your security, please update your personal information.
Dear Mr. Andrew RichmanYou win 100,000,000 USD!!Which from lottery of St. Paul University fund.
Dear Mr. Andrew RichmanYou win 100,000,000 USD!!Which from lottery of St. Paul University fund.
Discussion ::
38
Social Phishing or SpamSocial Phishing or Spam
Hay, Andrew, I am Sammy, I recommend you a cool site!!http://spam.com
Hay, Andrew, I am Sammy, I recommend you a cool site!!http://spam.com
Bros, I am David!St. Paul University student association have a party on next month, you need to transfer the registration fee ASAP, see you there.
Bros, I am David!St. Paul University student association have a party on next month, you need to transfer the registration fee ASAP, see you there.
Discussion ::
39
Potential SolutionsPotential Solutions
Three possible ways to mitigate the problemA. Personal privacy settings
B. Browsing scope settings
C. Owner’s confirmation
D. Applying Disclosure Control of Natural Language information (DNCL)‧ Proposed by Haruno Kataoka et al.
Discussion ::
40
Personal Privacy SettingsPersonal Privacy Settings
1. Hide personal information
2. Hide social connections (in level)
3. Deny annotations using certain words
4. Limit specific users to access friend relations or annotations
Don’t call my real name, call me 007!
Don’t call my real name, call me 007!
ProfileProfile
Discussion ::
41
Browsing Scope SettingsBrowsing Scope Settings
Prevent large scale download of user profiles• Includes Third‐party API
Limit browsing scope• Group partitioning / “invitation letter” mechanism
Malicious man
Discussion ::
42
Owner’s ConfirmationOwner’s Confirmation
Every operation related to friend relation
At least prevent unintentional personal information leakage
I want to use “Cool Andrew Richman”, may I ?
Sure!!!
Malicious man
Hay Mr. Richman, you are the lucky winner!Hay Mr. Richman, you are the lucky winner!
My name is public, everyone knows me!!
My name is public, everyone knows me!!
Discussion ::
43
Applying DNCL (Haruno Kataoka et al.)Applying DNCL (Haruno Kataoka et al.)
Ideal way to preserve • Search ability
• Availability
• Connected
• While no sensitive information is disclosed
• Rather than “Insecure” or “Un‐enjoyable”
Implementation is expected• Different language support is the best
Discussion ::
44
ConclusionConclusion
We quantify the extent of name leakage • Using Wretch data set
• 78% of users suffer from risk of involuntary name leakage
• Users’ age and education records are also in riskReason by friends’ disclosed information
Beware of Internet scams and phishing
Conclusion ::
45
Questions?Questions?Thank you! ::
46
Ratio of self-disclosureRatio of self-disclosureResearch Method ::
Figure 1: Ratio of Self-disclosure