7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
1/40
to Advance Knowledge for Humanity
Aastha Madaan, Wanm ing Chu , Subhash
Bhal la
Universi ty o f Aizu
1
VisHue: Web Page Segmentation for an
Improved Query I nterface for M edlinePlus
Medical Encyclopedia
12/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
2/40
to Advance Knowledge for Humanity
Outl ine
1. In troduc t io n
2. Backg rounda) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmen tat ion A lgo rit hms
a) Features
b) Main focus
c) Comparison
4. The Proposal: The VisHue A lgo rit hm
5. Qu ery by Segmen t
6. Perfo rmance Analy sis
7. Discuss ions
8. Summary and Conc lu sio ns
212/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
3/40
to Advance Knowledge for Humanity
1. In troduction
WWW is a common and the largest source of
information
Deep Querying Gaining importance
Understanding web page semantics Improves Users
search experience
Within a web page Identify semantic groups
Important Discovering these semantic blocks
312/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
4/40
to Advance Knowledge for Humanity
1(i). The Statement [1]
A. Large variety of HTML pages suitable query and
search ?
B. Basic Requirements searching and querying
Simple querying and searching semantic querying and
searching
C. Significant Recognize the semantic and coherent
segments
Page-level Segment Level
D. Case Example Medical Encyclopedia
MedlinePlus various choices of medical encyclopedias
412/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
5/40
to Advance Knowledge for Humanity
1(i). The Statement [2]
12/10/2011 5
UML Class
Diagram
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
6/40
to Advance Knowledge for Humanity
Outl ine
1. In troduc t io n
2. Backg rounda) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmen tat ion A lgo rit hms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue A lgo rit hm
5. Qu ery by Segmen t
6. Perfo rmance Analy sis
7. Discuss ions
8. Summary and Conc lu sio ns
612/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
7/40
to Advance Knowledge for Humanity
2. Background : Med l inePlus Web page:
i. Relevant content ii. Irrelevant content
a. Relevant Content:
i. Topic headings ii. Topic wise contents
b. Irrelevant Content:Navigation bars, header, footer, advertisements
Headings Identify hierarchical structure
Distinct blocks What a usersperception identifies Main focus Skilled and Semi-skilled users
Assumption Headings Query attributes
712/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
8/40
to Advance Knowledge for Humanity
2(a). Hierarch ical Struc tu re
1. Hierarchical structure logical structure within the
Page(document)
2. Indicates the binary relationships (belongingness)
between a pair of segments
3. Accurate Hierarchical Representation User Level
Query Attributes (in segments)
4. Proposed hierarchical structure based on domainknowledge (skilled and semi-skilled users)
Captures users perception
812/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
9/40
to Advance Knowledge for Humanity
2(a).(i). Segmentation Semantic Query
9
User
Semantic query
and search
(In future)
Common
WebUser
12/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
10/40
to Advance Knowledge for Humanity
2 (b ). Page-Level Segmentat ion
Definition
A self-contained logical region within a Web page that is:
(i) not nested within any other segment;
(ii) represented by a pair (l; c)
Where, l label of the segment
c portion of textof the segment [1].
1012/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
11/40
to Advance Knowledge for Humanity
Outl ine
1. In troduc t io n
2. Backg rounda) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmen tat ion A lgo rit hms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue A lgo rit hm
5. Qu ery by Segmen t
6. Perfo rmance Analy sis
7. Discuss ions
8. Summary and Conc lu sio ns
1112/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
12/40
to Advance Knowledge for Humanity
3. Segmentat ion algo r i thms
i. History segmentation traces back to theyear 2001 (continues till 2011)
ii. Various application domains
iii. Various techniques for segmenting
iv. Various terminologies used
v. Proposed MedlinePlus items of users
focus Query Attributes
1212/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
13/40
to Advance Knowledge for Humanity
3 (a). Features of Segm entation A lgo r i thm
A. Match and Identify a users points of focus
B. Discover informative segments
i. Better search and query
ii. Segments become query-able attributes
iii. Skilled users aim to query the informative areas
(only)
C. Generate True hierarchical structure
D. Segmentation Process Low space and time
complexity
1312/10/2011
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
14/40
to Advance Knowledge for Humanity
3(b ). Main Focus
Find an algorithm best suited for:
1. Generate hierarchical structure
2. Convert segments to attributes in
database
3. Facilitates in-depth querying
12/10/2011 14
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
15/40
to Advance Knowledge for Humanity
3 (b). (i). Segmentation Methods Web Technologies
1512/10/2011
t Ad K l d f H it
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
16/40
to Advance Knowledge for Humanity
3 (b ). (i i ). Class if icat ion o f A lgori thms
1612/10/2011
t Ad K l d f H it
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
17/40
to Advance Knowledge for Humanity
3(b). (iii). Timeline Techniques
Algorithm Year
Technique
Template Detection [9], [6] 2002, 2007
Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006
Visual-DOM based
Rendering
[2] 2003
Visual-Heuristics based
Method
Proposed -
Graph-theoretic Method [3] 2008
Linguistics basedMethod [7] 2008
Image of the Web Page [4], [5] 2010,2009
Site-Oriented Method [1] 2011
1712/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
18/40
to Advance Knowledge for Humanity
3(c). Comparison
1812/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
19/40
to Advance Knowledge for Humanity
3(c ).(i). Main Focus
1912/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
20/40
to Advance Knowledge for Humanity
3.(c ).(i i ).Comparison : Vision based Mtds .
2012/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
21/40
to Advance Knowledge for Humanity
3(c ).(i i i ). Content Struc tu re by VisHue
2112/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
22/40
to Advance Knowledge for Humanity
Outl ine
1. In troduc t io n
2. Backg rounda) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmen tat ion A lgo rit hms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue A lgo rit hm
5. Qu ery by Segmen t
6. Perfo rmance Analy sis
7. Discuss ions
8. Summary and Conc lu sio ns
2212/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
23/40
to Advance Knowledge for Humanity
4. The Proposal: VisHue A lgo r i thm
12/10/2011 23
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
24/40
to Advance Knowledge for Humanity
4. (i). Query In ter faces
Querying v/s Searching
Searching: Recent Trends
1. Object based search2. Block based search
3. Entity based search
Querying: Recent Trends
Very few efforts have been done
2412/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
25/40
to Advance Knowledge for Humanity
Outl ine
1. In troduc t io n
2. Backg rounda) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmen tat ion A lgo rit hms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue A lgo rit hm
5. Qu ery by Segmen t
6. Perfo rmance Analy sis
7. Discuss ions
8. Summary and Conc lu sio ns
2512/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
26/40
g y
5. Query by Segment
Query by Segment as Query by Tag (Heading) QBT
Based on Content Structure (VisHue algorithm) :
Query by Attributes
MedlinePlus medical encyclopedia 3886 web pages
Target Focused and explicit querying
i. Beneficial skilled and semi-skilled users
ii. Medical encyclopedia result of years of efforts
by experts
2612/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
27/40
g y
5. (i). The QBT in ter face
27
Traditional search on MedlinePlus
medical encyclopedia
QBT interface
12/10/2011
Title Caus
es
Sympt
oms
Post-
Care
DB
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
28/40
5. (ii). QBT In terface Hierarch ical Struc tu re
Labels QueryAttributes
QBT interface: Search and Query
Child nodes search attributes
Left siblings limit the scope of search of right
siblings in the interface
Segments Attributes for Deep Query over allpages of MedlinePlus
2812/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
29/40
Outl ine
1. In troduc t io n
2. Backg rounda) Hierarchical structure
b) Segmentation
3. Web Page segmen tat ion A lgo rit hms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue A lgo rit hm
5. Qu ery by Segmen t
6. Perfo rmance Analy sis
7. Discuss ions
8. Summary and Conc lu sio ns
2912/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
30/40
6. Perfo rmance Analys is
i. Qualitative comparison with traditional
keyword search
ii. Query formulation and interpretation
iii. Quantitative performance analysis of theinterface
3012/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
31/40
6.(i). QBT vs . Keyword Search
3112/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
32/40
6. (i i ). Query Form ulat ion : A Comparison
3212/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
33/40
6. (i i i). Query Example
Query 1:Cases where patient has
hypertension but not high blood pressure
QBT query :
Symptoms: Hypertension
Symptoms:NOT High Blood Pressure
33
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
34/40
34
6. (iv ). Query A ttr ibu tes
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
35/40
35
6. (v ). Query Resu lts
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
36/40
6. (v i). Quanti tat ive Perform ance Analys is
36
QBT Query
Symptom: Hypertension
Symptom:NOT High
Blood Pressure
Before Procedure: Stop
After Procedure:Normal
Cause: HighBlood
Pressure
Symptom: Heart Attack
Food Source: Fish
Side Effect: Poisoning
12/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
37/40
7. Discuss ions
Content fragments as perceived by skilled and semi-
skilled domain users determined by web pagesegmentation process
Proposed effort Formulating a generic heuristic
design-rule and visual features based algorithm
The QBT interface Query over user identified
segments (attributes)
Aim Convert MedlinePlus pages DB
Contention web page good source easy to use
new query language interface for segments
3712/10/2011
to Advance Knowledge for Humanity
8 S C
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
38/40
8. Summary and Conclus ions
A. Heuristics + visual features based segmentation
turning point:
A. Provides independent solution
B. Improves Query interfaces for chosen domain
B. The medical domain need to make the informationaccessible to the end-users
C. Query by Segment or Tag (QBT) Anattempt
A. Aim return the users query-able attributes
3812/10/2011
to Advance Knowledge for Humanity
R f
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
39/40
References1. A Site Oriented Method for Segmenting Web Pages, David Fernandes, Edleno S. de Moura, Altigran S.
da Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR11, July 24-28, 2011.
2. Extracting Content Structure for Web Pages based on Visual Representation, Deng Cai, Shipeng Yu, Ji-
Rong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference,APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.
3. Graph-Theoretic Approach to Webpage Segmentation, Deepayan Chakrabarti, Ravi Kumar, Kunal
Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing,
China.
4. A segmentation method for web page analysis using shrinking and dividing, Jiuxin Cao, Bo Mao &
Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.
5. Web Page Layout via Visual Segmentation,Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP
Laboratories, 2009.
6. Page-level template detection via isotonic smoothing. D. Chakrabarti, R. Kumar, and K. Punera. In 16th
WWW, pages 6170, 2007.
7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschtter, Wolfgang Nejdl, CIKM08,
October 2630, 2008
8. HTML Page Analysis Based on Visual Cues , Yudong Yang and HongJiang Zhang, IEEE 2001
9. Template Detection via Data Mining and itsApplications , Ziv Bar Yossef, Sridhar Rajagopalan, In
Proceedings of WWW'02, May 711, 2002, Honolulu, Hawaii, USA.
10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu
Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid
(SKG 2005).
11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael
Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering
(ICDE02).
3912/10/2011
to Advance Knowledge for Humanity
7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia
40/40
Thank you
Quest ions
4012/10/2011