+ All Categories
Home > Documents > Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout Information...

Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout Information...

Date post: 26-Dec-2015
Category:
Upload: catherine-johns
View: 214 times
Download: 1 times
Share this document with a friend
27
Robust Web Page Segmentation for Mob ile Terminal Using Content-Distances and Page Layout Information Gen Hattori, Keiichiro Hoashi, Kazunori Matsumoto, Fumiaki Sugaya KDDI R&D Laboratories WWW 2007
Transcript

Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout Information

Gen Hattori, Keiichiro Hoashi, Kazunori Matsumoto, Fumiaki SugayaKDDI R&D Laboratories

WWW 2007

INTRODUCTION How to browse general Web pages designe

d for PCs on the mobile phone Web pages are designed to be browsed on a PC

with a rich user interface, it is difficult for such pages to be displayed on the small display of mobile phones

Web page segmentation method

RELATED WORKS Web page layout modification

change the layout of a Web page is to adjust the width of the page to fit the small display of the mobile phone

Web page restructuring to create a set of “objects,” i.e., a set of high

ly correlated content elements Web page zooming

RELATED WORKS

SYSTEM REQUIREMENTS An effective system should be able to auto

matically limit the volume of information to be displayed

An effective system should be able to segment all Web pages, including pages with grammatical HTML errors ratio of Web pages with irregular HTML was 8.5

%-27.1%

SYSTEM REQUIREMENTS

WEB PAGE SEGMENTATION USING CONTENT-DISTANCE

Web page segmentation method based on contentdistance consists of the following three steps Content element extraction Content-distance calculation Web page segmentation

Content element extraction In the proposed method, we define the "co

ntent element" of a Web page as one of the following four types of data Anchor specified with <A> tag Image specified with <IMG> tag JavaScript with <SCRIPT> tag All text data other than (a)-(c)

deletes tags font tag (<FONT>) and layout tag (<TABLE>)

Content-distance calculation

Content-distance calculation First, the initial value of all tags is set to 0 Next, if the tag is an opening tag, we add 1

to the depth of the tag if the tag is a closing tag, we subtract 1

X = sequence orders of the tags Y=depth of the tags fab(i) indicates the value of y,where x=i.

Web page segmentation In the initial state, the entire Web page is c

onsidered as one object If Smax > N1 * Saverage, the system divides this o

bject at the position of Smax

Smax and Saverage = maximum value and average of content-distance in an object

Else, if the minimum number of tags in a group when it is segmented exceeds M, and Smax > N2 * Saverage, the system divides this object at the position of Smax

Web page segmentation If the object is divided in (Step 2) or (Step 3), Objec

tID moves to the left object of its division result and the process returns to (Step 2). Otherwise, the process advances to (Step 5).

When the object processed in (Step 2) or (Step 3) is a left object, ObjectID moves to the right object and the process returns to (Step 2). Otherwise, the process advances to (Step 6).

When the object processed in (Step 2) or (Step 3) is a right object and ObjectID != root, ObjectID moves to the parent object

Segmentation-Threshold Estimation with Standard Deviation of Content Distances

Nt1 and Nt2 note the optimal values of N1 and N2 for the targeted Web page

Nb1 and Nb2 note empirically selected values of N1 and N2 for the standard page

Problems Content-distance calculated based on the order

of HTML and the intuitional distance between content elements of the Web page do not necessarily correspond

Most news sites and portal sites adopt layouts which consist of separate components. Naturally, the tag structures ofeach component are different from each other. However, the previous method cannot correspond to those partial features.

HYBRID SEGMENTATION

EXPERIMENTS The 100 Web sites in the United States and

Japan are selected for our evaluation, based on information on the Alexa Web site

EXPERIMENTS

EXPERIMENTS

Usability Evaluation we calculated the estimated time necessar

y to display the required information on their terminals First, we selected five Web sites which contain

abundant information Next, we set imaginary “target” content ele

ments for each Web site time necessary to click the down button = 0.2 s

ec, the hyperlink tracing operation = 5 sec

Results

Google Wireless Transcoder

Performance Evaluation (a)analyzing request message (b)parsing HTML (c)r

escaling images (d)extracting tag depth (e)segmentation (f)rebuilding XHTML

We prepared a server with 3.4 GHz CPU and 2GB memory


Recommended