Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | aastha-madaan |
View: | 53 times |
Download: | 1 times |
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
Aastha Madaan, Wanming Chu, Subhash Bhalla
University of Aizu
1
VisHue: Web Page Segmentation for an VisHue: Web Page Segmentation for an Improved Query Interface for MedlinePlus Improved Query Interface for MedlinePlus
Medical Encyclopedia Medical Encyclopedia
11/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: The VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1.1. IntroductionIntroduction WWW is a common and the largest source of
information
Deep Querying → Gaining importance
Understanding web page semantics → Improves User’s search experience
Within a web page → Identify semantic groups
Important → Discovering these semantic blocks
311/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1(i). The Statement [1]1(i). The Statement [1]A. Large variety of HTML pages → suitable query and
search ?
B. Basic Requirements → searching and querying Simple querying and searching → → semantic querying and
searching
A. Significant → Recognize the semantic and coherent segments Page-level → Segment Level
B. Case Example → Medical Encyclopedia MedlinePlus → various choices of medical encyclopedias
411/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
1(i). The Statement [2]1(i). The Statement [2]
11/12/165
UML ClassDiagram
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions611/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2.2. Background: MedlinePlusBackground: MedlinePlus Web page:
i. Relevant content ii. Irrelevant content
a. Relevant Content: i. Topic headings ii. Topic wise contents
b. Irrelevant Content: Navigation bars, header, footer, advertisements
Headings → Identify hierarchical structure
Distinct blocks → What a user’s perception identifies
Main focus → Skilled and Semi-skilled usersi. Assumption → Headings → Query attributes
711/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2(a). Hierarchical Structure2(a). Hierarchical Structure1. Hierarchical structure → logical structure within the
Page(document)
2. Indicates the binary relationships (belongingness) → between a pair of segments
3. Accurate Hierarchical Representation → User Level Query Attributes (in segments)
4. Proposed hierarchical structure → based on domain knowledge (skilled and semi-skilled users) Captures users perception
811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2(a).(i). Segmentation ↔ Semantic Query2(a).(i). Segmentation ↔ Semantic Query
9
User →Semantic query
and search(In future)
Common Web User
11/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
2 (b). Page-Level Segmentation2 (b). Page-Level Segmentation
Definition
A self-contained logical region within a Web page that is:
(i) not nested within any other segment;
(ii)represented by a pair (l; c)Where, l → label of the segment
c → portion of text of the segment [1].
1011/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions1111/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3. Segmentation algorithms3. Segmentation algorithms
i. History → “segmentation” traces back to the year 2001 (continues till 2011)
ii. Various application domains
iii. Various techniques for segmenting
iv. Various terminologies used
v. Proposed → MedlinePlus → items of user’s focus → Query Attributes
1211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (a). Features of Segmentation Algorithm3 (a). Features of Segmentation Algorithm
A. Match and Identify → a user’s points of focus
B. Discover informative segments → i. Better search and query
ii. Segments become query-able attributes
iii. Skilled users aim to query the informative areas (only)
C. Generate → True hierarchical structure
D. Segmentation Process → Low space and time complexity
1311/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(b). Main Focus3(b). Main Focus
Find an algorithm best suited for:
1.Generate hierarchical structure
2.Convert segments to attributes in database
3.Facilitates in-depth querying
11/12/16 14
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (b). (i). Segmentation Methods ↔ Web Technologies3 (b). (i). Segmentation Methods ↔ Web Technologies
1511/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3 (b). (ii). Classification of Algorithms3 (b). (ii). Classification of Algorithms
1611/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(b). (iii). Timeline ↔Techniques3(b). (iii). Timeline ↔Techniques
Algorithm Year
Technique
Template Detection [9], [6] 2002, 2007
Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006
Visual-DOM based Rendering
[2] 2003
Visual-Heuristics based Method
Proposed -
Graph-theoretic Method [3] 2008
Linguistics based Method
[7] 2008
Image of the Web Page [4], [5] 2010,2009
Site-Oriented Method [1] 2011
1711/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c). Comparison3(c). Comparison
1811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c).(i). Main Focus3(c).(i). Main Focus
1911/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3.(c).(ii).Comparison: Vision based Mtds.3.(c).(ii).Comparison: Vision based Mtds.
2011/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
3(c).(iii). Content Structure by VisHue3(c).(iii). Content Structure by VisHue
2111/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions2211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
4. The Proposal: VisHue Algorithm4. The Proposal: VisHue Algorithm
11/12/16 23
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
4. (i). Query Interfaces4. (i). Query Interfaces
Querying v/s Searching
Searching: Recent Trends1. Object based search
2. Block based search
3. Entity based search
Querying: Recent Trends Very few efforts have been done
2411/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline1. Introduction
2. Background
a) Hierarchical structure
b) Page-Level Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions2511/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. Query by Segment5. Query by Segment Query by Segment as Query by Tag (Heading) → QBT
Based on → Content Structure (VisHue algorithm) : Query by Attributes
MedlinePlus medical encyclopedia → 3886 web pages
Target → Focused and explicit querying
i. Beneficial → skilled and semi-skilled users
ii. Medical encyclopedia → result of → years of efforts by experts
2611/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. (i). The QBT interface5. (i). The QBT interface
27
Traditional search on MedlinePlus medical encyclopedia
QBT interface
11/12/16
Title Causes
Symptoms
Post-Care
…
DB
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
5. (ii). QBT Interface ↔Hierarchical Structure5. (ii). QBT Interface ↔Hierarchical Structure
Labels → Query Attributes
QBT interface: Search and Query
Child nodes → search attributes
Left siblings → limit the scope of search of right siblings in the interface
Segments → Attributes for Deep Query over all pages of MedlinePlus
2811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
OutlineOutline1. Introduction
2. Background
a) Hierarchical structure
b) Segmentation
3. Web Page segmentation Algorithms
a) Features
b) Main focus
c) Comparison
4. The Proposal: VisHue Algorithm
5. Query by Segment
6. Performance Analysis
7. Discussions
8. Summary and Conclusions2911/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. Performance Analysis6. Performance Analysis
i. Qualitative comparison with traditional keyword search
ii. Query formulation and interpretation
iii. Quantitative performance analysis of the interface
3011/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6.(i). QBT vs. Keyword Search6.(i). QBT vs. Keyword Search
3111/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (ii). Query Formulation: A Comparison6. (ii). Query Formulation: A Comparison
3211/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (iii). Query Example6. (iii). Query Example
Query 1: Cases where patient has hypertension but not high blood pressure
QBT query :
Symptoms: “Hypertension”
Symptoms: NOT “High Blood Pressure”
33
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
34
6. (iv). Query Attributes6. (iv). Query Attributes
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
35
6. (v). Query Results 6. (v). Query Results
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
6. (vi). Quantitative Performance Analysis6. (vi). Quantitative Performance Analysis
36
QBT Query
Symptom: “Hypertension”
Symptom: NOT “High Blood Pressure
Before Procedure: “Stop”
After Procedure: “Normal”
Cause: “High Blood Pressure”
Symptom: “Heart Attack”
Food Source: “Fish”
Side Effect: “Poisoning”
11/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
7. Discussions7. Discussions Content fragments as perceived by skilled and semi-
skilled domain users → determined by web page segmentation process
Proposed effort → Formulating a generic heuristic design-rule and visual features based algorithm
The QBT interface → Query over user identified segments (attributes)
Aim → Convert MedlinePlus pages → DB
Contention → web page → good source → easy to use new query language interface for segments
3711/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
8. Summary and Conclusions8. Summary and Conclusions
A. Heuristics + visual features based segmentation → turning point:A. Provides → independent solution
B. Improves → Query interfaces for chosen domain
B. The medical domain → need to make the information accessible to the end-users
C. Query by Segment or Tag (QBT) → An attempt A. Aim → return the users query-able attributes
3811/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
ReferencesReferences1. A Site Oriented Method for Segmenting Web Pages”, David Fernandes, Edleno S. de Moura, Altigran S. da
Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR’11, July 24-28, 2011.
2. “Extracting Content Structure for Web Pages based on Visual Representation”, Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference, APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.
3. “Graph-Theoretic Approach to Webpage Segmentation”, Deepayan Chakrabarti, Ravi Kumar, Kunal Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing, China.
4. “A segmentation method for web page analysis using shrinking and dividing”, Jiuxin Cao, Bo Mao & Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.
5. “Web Page Layout via Visual Segmentation”, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP Laboratories, 2009.
6. Page-level template detection via isotonic smoothing”. D. Chakrabarti, R. Kumar, and K. Punera. In 16th WWW, pages 61–70, 2007.
7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschütter, Wolfgang Nejdl, CIKM’08, October 26–30, 2008
8. “HTML Page Analysis Based on Visual Cues” , Yudong Yang and HongJiang Zhang, IEEE 2001
9. “Template Detection via Data Mining and its Applications” , Ziv Bar Yossef, Sridhar Rajagopalan, In Proceedings of WWW'02, May 7–11, 2002, Honolulu, Hawaii, USA.
10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid (SKG 2005).
11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering (ICDE�02).
3911/12/16
to Advance Knowledge for Humanityto Advance Knowledge for Humanity
Thank youThank you
QuestionsQuestions
4011/12/16