Dont Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab....

Don’t Follow me : Spam Detection in Twitter

January 12, 2011In-seok An

SNU Internet Database Lab.

Alex Hai WangThe Pensylvania State UniversityInternational Conference on Security and Cryptogra-phy, 2010

Outline Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion

2 / 37

Introduction Social Network Service ( SNS )

– An online service, platform, or site that focuses on building and reflecting of social networks or social relations among people

– The most popular applications of Web 2.0

Twitter– Founded in 2006– One of the fastest growing SNSs

Surging more than 2,800% in 2009– Social networking site and microblogging service

3 / 37

Introduction Twitter

You can post your latest updates

Messages(Tweets) from twitter that you are follow-ing( describing )

4 / 37

Introduction Spammer in Twitter

– The goal of Twitter Allow friends to communicate and stay connected through the

exchange of short message– Spammer also use Twitter as a tool to post malicious links– More than 3% messages are spam on Twitter ( Analytics,

2009 )– The offensive trending topic Attack on February 20 ( CNET,

2009 )

5 / 37

Introduction Method to report spam

– By clicking on the “report as spam”– To post a tweet in the “@spam @username”

This report service is also abused by both hoaxes and spam Legitimate user can be mistakenly suspended by Twitter’s anti

spam action

6 / 37


7 / 37

Social Graph model Twitter can be modeled as a directed graph

– G = ( V , A )– V : a set of nodes ( vertices )– A : a set of arcs ( Edges )

Four types of relationships on Twitter can be defined– Follower

Node is a follower of node if the arc a = ( j , i ) is contained in A– Friend

Node is a friend of node if the arc a = ( i, j ) is contained in A– Mutual Friend

Node and node are mutual friends if both arcs a = ( i , j ) and a = ( j , i ) are contained in A

– Stranger Node and node are strangers if neither arcs a = ( I , j ) nor

a = ( j , I ) is contained in A

jv iv

jv iv

iv jv

iv iv

8 / 37

Social Graph model A simple Twitter graph

A follows B

A is follower of BB is friend of A

B follows C,C follows B

B and C are Mutual friendA doesn’t follow C,

C doesn’t follow A

A and C are stranger 9 / 37

Social Graph model Twitter Social Graph

10 / 37

Outline Introduction Social Graph model Features

– Graph-based features– Content-based features

Data Set Spam Detection Experiments Evaluation Conclusion

11 / 37

FeaturesGraph-based features Twitter’s spam and abuse policy

– “if you have a small number of followers compared to the amount of people you are following, it may be considered as a spam account”

Three features– The number of friends

The indegree of a node

– The number of followers The outdegree of a node

– The reputation of a user

)( iI vd iv

)( iO vd iv

)()()()(

iOiI

iIi vdvd

vdvR

12 / 37

FeaturesContent-based features Duplicate Tweets

– An account may be considered as a spam if you post dupli-cate content on one account

– Detected by measuring the Levenshtein distance ( edit dis-tance )

Minimum cost of transforming one string into another through a sequence of edit operations ( deletion , insertion and substitu-tion of individual symbols )

Clean the data by stopping the words containing “@”, “#”, “http://” and “www.”

– The number of duplicate tweets can be measurement In the user’s 20 most recent tweets Two tweets are considered as duplicate only when the are ex-

actly the same

13 / 37

FeaturesContent-based features Need for cleaning

14 / 37

FeaturesContent-based features HTTP Links

– It is considered as spam if your updates consist mainly of links and not personal updates

– Twitter filters out the URLs linked to known malicious sites URL shorten services like bit.ly provides opportunity for attacker

to spam– The number of tweets containing HTTP links can be mea-

surement

http://porno.-com

Tweet with HTTP link Malicious Site

http://bit.ly/ab3cd

Tweet with HTTP link Malicious Site

http://bit.ly/ab3cd

↓http:// porno.-

com

URL shorten service

??

15 / 37

FeaturesContent-based features Replies and Mentions

– You can send a reply message to another user @username + message

– You can also mention another @username anywhere in the tweet

Message + @username + message– Twitter automatically collects all tweets containing your

username

– You can reply anyone no matterthey are your friends/followersor not

– Spammer abuses this feature

– The number of Tweets contain-ing mention or reply can be measurement 16 / 37

FeaturesContent-based features Spam tweets using mention or reply

17 / 37

FeaturesContent-based features Trending topic

– The most-mentioned terms on Twitter at that moment, week, month

– User can use the hashtag to a tweet #tagname

– If there are many tweets containing the same term, It may become a trending topic

– Twitter considers an account as spam If you post multiple unrelated updates to a topic using the #

symbols

18 / 37


19 / 37

Data Set Data Set

– 3 weeks from January 3 to January 24, 2010– 25,847 users– 500k tweets– 49M follower/friend relationships

20 / 37


21 / 37

Spam Detection Several classification algorithms

– Decision tree– Neural network– Support vector machines– K – nearest neighbers– Naïve Bayesian

Naïve Bayesian outperform all other method– Bayesian classifier is noise robust

It uses posterior probability– A spam probability is calculated for each individual user

based its behaviors, instead of giving a general rule

22 / 37

Spam Detection Naïve Bayesian classifier

– X : each Twitter account is considered as a vector X with fea-ture values

– Y : one of two classes, spam and non-spam– The features are conditionally independent

)()()|()|(

XPYPYXPXYP

23 / 37


24 / 37

Experiments To evaluate the detection method

– 500 Twitter user accounts are labeled manually to two classes( spam or not )

By reading the 20 most recent tweets Checking the friends and followers of the user

– Result show that there are around 1% spam account in the data set

Additional spam data are added to the data set To simulate the reality and avoid the bias in the crawling and la-

bel methods– The study in Analytics, 2009, shows there is 3% spam on Twitter

Search @spam on Twitter and collect additional spam data– Only small number of result report real spam

– The data set is mixed to contain around 3% spam data

25 / 37

Experiments Graph-based features

– The number of friends for each Twitter account

– Only 30% of spam accounts follow a large amount of user Spammer doesn’t need to follow other user

26 / 37


– The number of followers for each Twitter account

– Usually the spam accounts do not have a large amount of fol-lowers

Some spam accounts having a relatively large amount of follow-ers

27 / 37


– The reputation for each Twitter account

– The reputation of most legitimate users is between 30% to 90%

Some spam accounts have a 100% reputation28 / 37

Experiments Content-based Features

– The number of pairwise duplication

– Not all spam accounts post multiple duplicate tweets We can not only depend on this feature

29 / 37


– The number of mentions and replies

– Most spam accounts have the maximum 20 “@” symbol This will lure legitimate users to read their spam messages or

click their link

30 / 37


– The number of links

– Some legitimate users also include links in all tweets, some companies join Twitter to promote their own web sites

31 / 37


– The number of Hash tag signs

32 / 37


33 / 37

Evaluation The evaluation of the overall process

– Confusion matrix

– Precision : P = a / ( a + c )– Recall : R = a / ( a + b )– F-measure : F = 2PR / ( P + R )

Each classifier is trained 10 times– Each time using the 9 out of the 10 partitions as training

data– Computing the confusion matrix using the tenth partition as

test data

34 / 37

Evaluation The evaluation results

– Naïve Bayesian classifier has the best overall performance

Finally, the Bayesian classifier learned from the la-beled data is applied to the entire data set– Information about totally 25,817 users– Precision of the spam detection system

392 users are classified as spam 348 users are real spam account and 44 users are false alarms 89% precision

35 / 37

Conclusion The spam behavior in a popular online SNS, Twitter

– To formalize the problem, social graph model is proposed

Novel content-based and graph-based features are proposed– Graph-based features

The number of friends The number of followers The reputation of the user

– Content-based features The number of pairwise duplications The number of Mention and Replies The number of Links The number of Hashtags

Analyze the data set and evaluate the performance of the detection system

36 / 37

Conclusion Among the graph-based features

– The proposed reputation features has the best performance– No many spam follow large amount of users – Some spammers have many followers

For the content-based features– Most spam accounts have multiple duplicate tweets– But not all spam account post multiple duplicate tweets

We can not rely on this feature

Several popular classification algorithms are studied and evaluated

The naïve classifier achieve a 89% precision

37 / 37

Date post:	19-Jan-2018
Category:	Documents
Upload:	garey-banks
View:	216 times
Download:	0 times

Dont Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab....

Documents