Learning to remove Internet advertisements

Post on 15-Jan-2016

22 views 0 download

description

Learning to remove Internet advertisements. Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland. Presented by Bo Zhang Department of Computer Science Michigan Technological University. - PowerPoint PPT Presentation

transcript

Dec 6, 2004 12004 Michigan Technological University

Nicholas KushmerickNicholas Kushmerick

Department of Computer Science,Department of Computer Science,

University College Dublin, IrelandUniversity College Dublin, Ireland

Learning to remove Internet Learning to remove Internet advertisementsadvertisements

Presented by Bo ZhangDepartment of Computer Science Michigan Technological University

Dec 6, 2004 22004 Michigan Technological University

OverviewOverview

BackgroundBackground

Introduction of ADEATERIntroduction of ADEATER

Design of ADEATERDesign of ADEATER

EvaluationEvaluation

Related WorkRelated Work

Conclusion and Future WorkConclusion and Future Work

Dec 6, 2004 32004 Michigan Technological University

BackgroundBackground Negative Impact of advertisement images on InternetNegative Impact of advertisement images on Internet

Slow down the speed of browsing Consume resources of computer Extra costs for users

Advertisement Image

Advertisement Image

Advertisement Image

Dec 6, 2004 42004 Michigan Technological University

Introduction of ADEATERIntroduction of ADEATER

Definition:Definition:

- A browsing assistant that automatically removes advertisement images from Internet pages.

Property:Property:

Rules generated from learning algorithm

Dec 6, 2004 52004 Michigan Technological University

Introduction of ADEATERIntroduction of ADEATER ExamplesExamples

Dec 6, 2004 62004 Michigan Technological University

Design of ADEATER Design of ADEATER

System ArchitectureSystem Architecture

Dec 6, 2004 72004 Michigan Technological University

Design of ADEATERDesign of ADEATER Encoding instanceEncoding instance

Fixed–width feature vector

Images enclosed in anchor tag <A> is a candidate advertisement

Geometric features of an image: -Height <IMG height=90> -Width <IMG width=90> -Aspect ratio (ratio of width to height)

Local feature: -Whether destination URL and image URL are in the same internet

domain www.ee.mtu.edu/page.html www.cs.mtu.edu/image.jpg YES

www.dell.com/notebook.html www.cs.mtu.edu/image.jpg No

Dec 6, 2004 82004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding instanceEncoding instance

Fixed–width feature vector

Caption feature: -Words occuring in enclosing <A> tag with phrase length<K

and phrase count >M -K is maximum phrase length -M is minimum phrase count

Alt Feature -Set of “alternate” words in the <IMG> tag (<IMG alt=“ad”>)

with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count

Dec 6, 2004 92004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding instanceEncoding instance

Fixed–width feature vector

Ubase, Udest, Uimg

-Words occuring in base URL, destination URL, image URL with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count

Stop list -Low-information terms (“http”, “www”, ”jpg”, etc.)

Dec 6, 2004 102004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding instanceEncoding instance

Samples of HTML page

Dec 6, 2004 112004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding of samples

Dec 6, 2004 122004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding of samples (cont)

Dec 6, 2004 132004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Gathering examplesGathering examples

AD samples are generated by ADGRABBER browsing assistant

Identifier candidate advertisements

Generate vector encoding

NON-AD samples are generated by a custom-built Internet spider

Extract images from randomly-generated URLs.

Dec 6, 2004 142004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Learning rules

Algorithm - C4.5 decision tree learning algorithm

Properties - Quick on-line execution of classifier - Not be overly sensitive to missing features or noises - Scale well and insensitive to irrelevant features

Examples of rules - If aspect ratio > 4.5833, alt doesn’t contain “to” but does

contain “click+here”,and Udest doesn’t contain “http+www”, then instance is an AD

- If Ubase does not contain “messier”, and Udest contains the “redirect+cgi”, then instance is an AD

Dec 6, 2004 152004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Removing advertisementsRemoving advertisements

Process

- Fetch HTML pages from Internet - Identify candidate advertisements - Classify instances with learned rules - Replace the image’s URL with the URL of an inconspicuous low-bandwidth image

Implementation

- Removal module as a proxy server

Dec 6, 2004 162004 Michigan Technological University

Evaluation

Speed and accuracySpeed and accuracy

Experiment setting

Total samples - AD: 459 examples

- NON-AD: 2820 examples

10-fold cross-validation - Training set: 90% examples - Test set: 10% examples

Off-line training phase: 5.8 minutes

On-line classification phase: 70 msec/image

Average accuracy: 97.1%

Dec 6, 2004 172004 Michigan Technological University

Evaluation Learning curvesLearning curves

Simple methodology - Not recalculate feature set Realistic methodology - Recalculate feature set

Dec 6, 2004 182004 Michigan Technological University

Evaluation

Alternative encodingsAlternative encodings

Dec 6, 2004 192004 Michigan Technological University

Related Work Muffin: Filtering web pages

ImageKill Filter: Hand-crafted rules

ImageKill.minheight

- Only remove images which are at least n pixels high

ImageKill.minwidth

- Only remove images which are at least n pixels wide

ImageKill.ratio

- Remove images which are more than n times as wide as

they are high

ImageKill.exclude

- Don't remove images that match the given string/regexp

Dec 6, 2004 202004 Michigan Technological University

Related Work

WebFilter: Filtering web pages

Solution

- User provides a list of URL templates and corresponding

filter scripts

Dec 6, 2004 212004 Michigan Technological University

Related Work

Junkbuster: Junkbuster: Filtering web pages

Solution

- User provides a block file

Dec 6, 2004 222004 Michigan Technological University

Related Work

Smokey: Detect abusive messagesSmokey: Detect abusive messages

Solution

- Training samples and generate rules by training - Parse messages and generate feature vector - Classify the feature vector with rules generated

Dec 6, 2004 232004 Michigan Technological University

Conclusion and Future Work

ConclusionConclusion

High accuracy

Modest resource cost (processing time, training samples)

Future WorkFuture Work

Incremental learning algorithm

More efficient feature selection mechanism

Dec 6, 2004 242004 Michigan Technological University

Thank you!Thank you!