+ All Categories
Home > Documents > WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta...

WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta...

Date post: 23-Dec-2015
Category:
Upload: gwendolyn-doyle
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University
Transcript
Page 1: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

WEB SPAM A By-Product Of The Search Engine Era

Web Enhanced Information Management

Aniruddha DuttaDepartment of Computer Science

Columbia University

Page 2: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Contents

• Model of the Web• Definition of Web Spam• History of Web Spam• Types of Web Spam• Counter measures• Conclusion

Page 3: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

The World Wide Web

The WebThe Web

•Huge

•Distributed content creation, linking (no coordination)

•Structured databases, unstructured text, semi-structured data.

•Content includes truth, lies, obsolete information, contradictions, …

Page 4: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Search Engine As Gateways

• Search has become the default gateway to the web

• Very high premium to appear on the first page of search results

– e.g., e-commerce sites

– advertising-driven sites

• This has important economic considerations

Page 5: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Definition Of Web Spam

• Web Spam can be defined as any intentional activity by a human to generate an unreasonably favorable result or importance for a web page that naturally should not have the weight or significance associated to it.[1]

• This is also called spamming or spamdexing.

Page 6: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

History Of Web Spam• It was introduced by the 1st Generation Search Engine Companies in the 1990’s - The technique came to be known as ‘Glittering Generalities’

• 2nd Generation Search Engine Companies - Neutralized Glittering Generalities - Ranked pages according to their popularity - Popularity determined by Links pointing to the Web page - Spammers made Link farms to circumvent it

• 3rd Generation Search Engine Companies - use page rank, HITS algorithm to rank pages - Spammers have found new ways as well!

Page 7: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Boosting Techniques

• These are the spamming techniques by which the ranking algorithm of the search engine is influenced.

• Can be classified into two main categories - Term Spamming :Manipulating the text of web pages in order to appear relevant to queries

- Link Spamming :Creating link structures that boost page rank or hubs and authorities scores

Page 8: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Taxonomy For Boosting Techniques

Page 9: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Term Spamming• Body Spam: The spam terms are present in the body of the page. This is the simplest and most common technique in term spamming.

• Title Spam: The spam terms are present in the title tag of the web page.

• Meta Tag Spam: The spam terms appear in the Meta tags of the web page. e.g. <meta name=\Flowers " content=\buy, cheap, roses, lilly, daffodils, flower vase, pink rose">

Page 10: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Term Spamming

• Anchor Text Spam: The spam term appears in the anchor texts found on the web pages. - The terms in anchor text given more importance - The words are indexed both for target as well as source page e.g. <a href=”buycheapflower.html”>Flowers, cheap deals, rose, daffodils, flower vase</a>

Page 11: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Term Spamming• URL spam: Spam terms appear in the URL of web pages

- Search engines sometimes parse the URL and use the terms in the URL to find whether the page is relevant or not.

• DNS spam: Spammers set up a dns server, which resolves any hostname to one domain only.

• Repetition: The term is repeated n number of times in the field of the web page to make it suitable for a specific query.

Page 12: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Term Spamming• Dumping: A large number of unrelated terms are put together in the fields of the web page. - Helps in answering a wide variety of queries

• Weaving: Duplication of content found on the web page by insertion of spam terms in between the content.

• Phrase Stitching: Different sentences from different source are concatenated to put in the fields of the web page. e.g. His article is about forests as communities of trees. Naco is the world leader in Rain forest protection

Page 13: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Link Spamming

• Outdegree: Spammers create web pages which have a high number of links pointing to well known pages. - Can be done easily by directory cloning • Indegree: Spammers create pages which has useful content but hidden links to spam pages. - These pages are called honey pots - Can be achieved by adding links in directory structures - Link farms

Page 14: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Hiding Technqiues

• Hiding Technqiues: Techniques to hide spam content on a web page .

- Content Hiding - Cloaking - Redirection

Page 15: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Hiding Technqiues• Content Hiding: Spam content on the page is hidden by using - Color Schemes - Images in place of anchor text e.g Using color for content hiding <body background=\RED "> <font color=\RED ">spam text</font> </body>

Using images in anchor text < a href=”spam.html”><img src=”tg.jpg”></a>

Page 16: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Hiding Technqiues

• Cloaking: Send different content to the crawlers and different content to the users. - Pages check the ip address of crawlers - check the agent field in the HTTP request

Page 17: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Cloaking Example

HTTP Request to the page

GET / HTTP/1.1[CRLF]Host: yahoo.com[CRLF]Connection: close[CRLF]User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: 1.8.0.9) Gecko/20061206 Firefox/1.5.0.9 Web-Sniffer/1.0.24[CRLF]Referer: http://web-sniffer.net/[CRLF]

Page 18: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Cloaking Example

Page 19: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Types Of Hiding Technqiues

• Redirection: Web pages redirect to spam pages on opening . - Search Engines index the normal page - User gets redirected to spam page on opening it

Page 20: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Trust Rank

• Basic principle: approximate isolation– It is rare for a “good” page to point to a “bad” (spam)

page

• Sample a set of “seed pages” from the web.• Set trust of each trusted page to 1• Propagate trust through links• Each page gets a trust value between 0 and 1• Use a threshold value and mark all pages below

the trust threshold as spam

Page 21: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Anti-Trust Approach

• Broadly based on the same “approximate isolation principle”• This principle also implies that the pages pointing to

spam pages are very likely to be spam pages themselves.

• Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages.

• A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.

Page 22: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Conclusion

• Web Spam is a by-product of the search engine era

• Identifying the structure of web spam is the first step to fighting it.

• Due to the inherent characterstic of the Web it is difficult to eliminate web spam all together.

• Combination of different web spam techniques can be combined together to detect spam in a better way

Page 23: WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Thank you

References

[1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005. http://citeseer.ist.psu.edu/article/gyongyi05web.html


Recommended