Date post: | 19-Jan-2018 |
Category: |
Documents |
Upload: | garey-banks |
View: | 216 times |
Download: | 0 times |
Don’t Follow me : Spam Detection in Twitter
January 12, 2011In-seok An
SNU Internet Database Lab.
Alex Hai WangThe Pensylvania State UniversityInternational Conference on Security and Cryptogra-phy, 2010
Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion
2 / 37
Introduction Social Network Service ( SNS )
– An online service, platform, or site that focuses on building and reflecting of social networks or social relations among people
– The most popular applications of Web 2.0
Twitter– Founded in 2006– One of the fastest growing SNSs
Surging more than 2,800% in 2009– Social networking site and microblogging service
3 / 37
Introduction Twitter
You can post your latest updates
Messages(Tweets) from twitter that you are follow-ing( describing )
4 / 37
Introduction Spammer in Twitter
– The goal of Twitter Allow friends to communicate and stay connected through the
exchange of short message– Spammer also use Twitter as a tool to post malicious links– More than 3% messages are spam on Twitter ( Analytics,
2009 )– The offensive trending topic Attack on February 20 ( CNET,
2009 )
5 / 37
Introduction Method to report spam
– By clicking on the “report as spam”– To post a tweet in the “@spam @username”
This report service is also abused by both hoaxes and spam Legitimate user can be mistakenly suspended by Twitter’s anti
spam action
6 / 37
Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion
7 / 37
Social Graph model Twitter can be modeled as a directed graph
– G = ( V , A )– V : a set of nodes ( vertices )– A : a set of arcs ( Edges )
Four types of relationships on Twitter can be defined– Follower
Node is a follower of node if the arc a = ( j , i ) is contained in A– Friend
Node is a friend of node if the arc a = ( i, j ) is contained in A– Mutual Friend
Node and node are mutual friends if both arcs a = ( i , j ) and a = ( j , i ) are contained in A
– Stranger Node and node are strangers if neither arcs a = ( I , j ) nor
a = ( j , I ) is contained in A
jv iv
jv iv
iv jv
iv iv
8 / 37
Social Graph model A simple Twitter graph
A follows B
A is follower of BB is friend of A
B follows C,C follows B
B and C are Mutual friendA doesn’t follow C,
C doesn’t follow A
A and C are stranger 9 / 37
Social Graph model Twitter Social Graph
10 / 37
Outline Introduction Social Graph model Features
– Graph-based features– Content-based features
Data Set Spam Detection Experiments Evaluation Conclusion
11 / 37
FeaturesGraph-based features Twitter’s spam and abuse policy
– “if you have a small number of followers compared to the amount of people you are following, it may be considered as a spam account”
Three features– The number of friends
The indegree of a node
– The number of followers The outdegree of a node
– The reputation of a user
)( iI vd iv
)( iO vd iv
)()()()(
iOiI
iIi vdvd
vdvR
12 / 37
FeaturesContent-based features Duplicate Tweets
– An account may be considered as a spam if you post dupli-cate content on one account
– Detected by measuring the Levenshtein distance ( edit dis-tance )
Minimum cost of transforming one string into another through a sequence of edit operations ( deletion , insertion and substitu-tion of individual symbols )
Clean the data by stopping the words containing “@”, “#”, “http://” and “www.”
– The number of duplicate tweets can be measurement In the user’s 20 most recent tweets Two tweets are considered as duplicate only when the are ex-
actly the same
13 / 37
FeaturesContent-based features Need for cleaning
14 / 37
FeaturesContent-based features HTTP Links
– It is considered as spam if your updates consist mainly of links and not personal updates
– Twitter filters out the URLs linked to known malicious sites URL shorten services like bit.ly provides opportunity for attacker
to spam– The number of tweets containing HTTP links can be mea-
surement
http://porno.-com
Tweet with HTTP link Malicious Site
http://bit.ly/ab3cd
Tweet with HTTP link Malicious Site
http://bit.ly/ab3cd
↓http:// porno.-
com
URL shorten service
??
15 / 37
FeaturesContent-based features Replies and Mentions
– You can send a reply message to another user @username + message
– You can also mention another @username anywhere in the tweet
Message + @username + message– Twitter automatically collects all tweets containing your
username
– You can reply anyone no matterthey are your friends/followersor not
– Spammer abuses this feature
– The number of Tweets contain-ing mention or reply can be measurement 16 / 37
FeaturesContent-based features Spam tweets using mention or reply
17 / 37
FeaturesContent-based features Trending topic
– The most-mentioned terms on Twitter at that moment, week, month
– User can use the hashtag to a tweet #tagname
– If there are many tweets containing the same term, It may become a trending topic
– Twitter considers an account as spam If you post multiple unrelated updates to a topic using the #
symbols
18 / 37
Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion
19 / 37
Data Set Data Set
– 3 weeks from January 3 to January 24, 2010– 25,847 users– 500k tweets– 49M follower/friend relationships
20 / 37
Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion
21 / 37
Spam Detection Several classification algorithms
– Decision tree– Neural network– Support vector machines– K – nearest neighbers– Naïve Bayesian
Naïve Bayesian outperform all other method– Bayesian classifier is noise robust
It uses posterior probability– A spam probability is calculated for each individual user
based its behaviors, instead of giving a general rule
22 / 37
Spam Detection Naïve Bayesian classifier
– X : each Twitter account is considered as a vector X with fea-ture values
– Y : one of two classes, spam and non-spam– The features are conditionally independent
)()()|()|(
XPYPYXPXYP
23 / 37
Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion
24 / 37
Experiments To evaluate the detection method
– 500 Twitter user accounts are labeled manually to two classes( spam or not )
By reading the 20 most recent tweets Checking the friends and followers of the user
– Result show that there are around 1% spam account in the data set
Additional spam data are added to the data set To simulate the reality and avoid the bias in the crawling and la-
bel methods– The study in Analytics, 2009, shows there is 3% spam on Twitter
Search @spam on Twitter and collect additional spam data– Only small number of result report real spam
– The data set is mixed to contain around 3% spam data
25 / 37
Experiments Graph-based features
– The number of friends for each Twitter account
– Only 30% of spam accounts follow a large amount of user Spammer doesn’t need to follow other user
26 / 37
Experiments Graph-based features
– The number of followers for each Twitter account
– Usually the spam accounts do not have a large amount of fol-lowers
Some spam accounts having a relatively large amount of follow-ers
27 / 37
Experiments Graph-based features
– The reputation for each Twitter account
– The reputation of most legitimate users is between 30% to 90%
Some spam accounts have a 100% reputation28 / 37
Experiments Content-based Features
– The number of pairwise duplication
– Not all spam accounts post multiple duplicate tweets We can not only depend on this feature
29 / 37
Experiments Content-based Features
– The number of mentions and replies
– Most spam accounts have the maximum 20 “@” symbol This will lure legitimate users to read their spam messages or
click their link
30 / 37
Experiments Content-based Features
– The number of links
– Some legitimate users also include links in all tweets, some companies join Twitter to promote their own web sites
31 / 37
Experiments Content-based Features
– The number of Hash tag signs
32 / 37
Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion
33 / 37
Evaluation The evaluation of the overall process
– Confusion matrix
– Precision : P = a / ( a + c )– Recall : R = a / ( a + b )– F-measure : F = 2PR / ( P + R )
Each classifier is trained 10 times– Each time using the 9 out of the 10 partitions as training
data– Computing the confusion matrix using the tenth partition as
test data
34 / 37
Evaluation The evaluation results
– Naïve Bayesian classifier has the best overall performance
Finally, the Bayesian classifier learned from the la-beled data is applied to the entire data set– Information about totally 25,817 users– Precision of the spam detection system
392 users are classified as spam 348 users are real spam account and 44 users are false alarms 89% precision
35 / 37
Conclusion The spam behavior in a popular online SNS, Twitter
– To formalize the problem, social graph model is proposed
Novel content-based and graph-based features are proposed– Graph-based features
The number of friends The number of followers The reputation of the user
– Content-based features The number of pairwise duplications The number of Mention and Replies The number of Links The number of Hashtags
Analyze the data set and evaluate the performance of the detection system
36 / 37
Conclusion Among the graph-based features
– The proposed reputation features has the best performance– No many spam follow large amount of users – Some spammers have many followers
For the content-based features– Most spam accounts have multiple duplicate tweets– But not all spam account post multiple duplicate tweets
We can not rely on this feature
Several popular classification algorithms are studied and evaluated
The naïve classifier achieve a 89% precision
37 / 37