Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | catherine-johns |
View: | 214 times |
Download: | 1 times |
Robust Web Page Segmentation for Mobile Terminal Using Content-Distances and Page Layout Information
Gen Hattori, Keiichiro Hoashi, Kazunori Matsumoto, Fumiaki SugayaKDDI R&D Laboratories
WWW 2007
INTRODUCTION How to browse general Web pages designe
d for PCs on the mobile phone Web pages are designed to be browsed on a PC
with a rich user interface, it is difficult for such pages to be displayed on the small display of mobile phones
Web page segmentation method
RELATED WORKS Web page layout modification
change the layout of a Web page is to adjust the width of the page to fit the small display of the mobile phone
Web page restructuring to create a set of “objects,” i.e., a set of high
ly correlated content elements Web page zooming
SYSTEM REQUIREMENTS An effective system should be able to auto
matically limit the volume of information to be displayed
An effective system should be able to segment all Web pages, including pages with grammatical HTML errors ratio of Web pages with irregular HTML was 8.5
%-27.1%
WEB PAGE SEGMENTATION USING CONTENT-DISTANCE
Web page segmentation method based on contentdistance consists of the following three steps Content element extraction Content-distance calculation Web page segmentation
Content element extraction In the proposed method, we define the "co
ntent element" of a Web page as one of the following four types of data Anchor specified with <A> tag Image specified with <IMG> tag JavaScript with <SCRIPT> tag All text data other than (a)-(c)
deletes tags font tag (<FONT>) and layout tag (<TABLE>)
Content-distance calculation First, the initial value of all tags is set to 0 Next, if the tag is an opening tag, we add 1
to the depth of the tag if the tag is a closing tag, we subtract 1
X = sequence orders of the tags Y=depth of the tags fab(i) indicates the value of y,where x=i.
Web page segmentation In the initial state, the entire Web page is c
onsidered as one object If Smax > N1 * Saverage, the system divides this o
bject at the position of Smax
Smax and Saverage = maximum value and average of content-distance in an object
Else, if the minimum number of tags in a group when it is segmented exceeds M, and Smax > N2 * Saverage, the system divides this object at the position of Smax
Web page segmentation If the object is divided in (Step 2) or (Step 3), Objec
tID moves to the left object of its division result and the process returns to (Step 2). Otherwise, the process advances to (Step 5).
When the object processed in (Step 2) or (Step 3) is a left object, ObjectID moves to the right object and the process returns to (Step 2). Otherwise, the process advances to (Step 6).
When the object processed in (Step 2) or (Step 3) is a right object and ObjectID != root, ObjectID moves to the parent object
Segmentation-Threshold Estimation with Standard Deviation of Content Distances
Nt1 and Nt2 note the optimal values of N1 and N2 for the targeted Web page
Nb1 and Nb2 note empirically selected values of N1 and N2 for the standard page
Problems Content-distance calculated based on the order
of HTML and the intuitional distance between content elements of the Web page do not necessarily correspond
Most news sites and portal sites adopt layouts which consist of separate components. Naturally, the tag structures ofeach component are different from each other. However, the previous method cannot correspond to those partial features.
EXPERIMENTS The 100 Web sites in the United States and
Japan are selected for our evaluation, based on information on the Alexa Web site
Usability Evaluation we calculated the estimated time necessar
y to display the required information on their terminals First, we selected five Web sites which contain
abundant information Next, we set imaginary “target” content ele
ments for each Web site time necessary to click the down button = 0.2 s
ec, the hyperlink tracing operation = 5 sec