+ All Categories
Home > Technology > Web Page Segmentation for Querying Healthcare Repository

Web Page Segmentation for Querying Healthcare Repository

Date post: 22-Jan-2018
Category:
Upload: aastha-madaan
View: 53 times
Download: 1 times
Share this document with a friend
40
to Advance Knowledge for Humanity to Advance Knowledge for Humanity Aastha Madaan, Wanming Chu, Subhash Bhalla University of Aizu 1 VisHue: Web Page Segmentation for an VisHue: Web Page Segmentation for an Improved Query Interface for MedlinePlus Improved Query Interface for MedlinePlus Medical Encyclopedia Medical Encyclopedia 11/12/16
Transcript
Page 1: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

Aastha Madaan, Wanming Chu, Subhash Bhalla

University of Aizu

1

VisHue: Web Page Segmentation for an VisHue: Web Page Segmentation for an Improved Query Interface for MedlinePlus Improved Query Interface for MedlinePlus

Medical Encyclopedia Medical Encyclopedia

11/12/16

Page 2: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

OutlineOutline1. Introduction

2. Background

a) Hierarchical structure

b) Page-Level Segmentation

3. Web Page segmentation Algorithms

a) Features

b) Main focus

c) Comparison

4. The Proposal: The VisHue Algorithm

5. Query by Segment

6. Performance Analysis

7. Discussions

8. Summary and Conclusions211/12/16

Page 3: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

1.1. IntroductionIntroduction WWW is a common and the largest source of

information

Deep Querying → Gaining importance

Understanding web page semantics → Improves User’s search experience

Within a web page → Identify semantic groups

Important → Discovering these semantic blocks

311/12/16

Page 4: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

1(i). The Statement [1]1(i). The Statement [1]A. Large variety of HTML pages → suitable query and

search ?

B. Basic Requirements → searching and querying Simple querying and searching → → semantic querying and

searching

A. Significant → Recognize the semantic and coherent segments Page-level → Segment Level

B. Case Example → Medical Encyclopedia MedlinePlus → various choices of medical encyclopedias

411/12/16

Page 5: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

1(i). The Statement [2]1(i). The Statement [2]

11/12/165

UML ClassDiagram

Page 6: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

OutlineOutline1. Introduction

2. Background

a) Hierarchical structure

b) Page-Level Segmentation

3. Web Page segmentation Algorithms

a) Features

b) Main focus

c) Comparison

4. The Proposal: VisHue Algorithm

5. Query by Segment

6. Performance Analysis

7. Discussions

8. Summary and Conclusions611/12/16

Page 7: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

2.2. Background: MedlinePlusBackground: MedlinePlus Web page:

i. Relevant content ii. Irrelevant content

a. Relevant Content: i. Topic headings ii. Topic wise contents

b. Irrelevant Content: Navigation bars, header, footer, advertisements

Headings → Identify hierarchical structure

Distinct blocks → What a user’s perception identifies

Main focus → Skilled and Semi-skilled usersi. Assumption → Headings → Query attributes

711/12/16

Page 8: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

2(a). Hierarchical Structure2(a). Hierarchical Structure1. Hierarchical structure → logical structure within the

Page(document)

2. Indicates the binary relationships (belongingness) → between a pair of segments

3. Accurate Hierarchical Representation → User Level Query Attributes (in segments)

4. Proposed hierarchical structure → based on domain knowledge (skilled and semi-skilled users) Captures users perception

811/12/16

Page 9: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

2(a).(i). Segmentation ↔ Semantic Query2(a).(i). Segmentation ↔ Semantic Query

9

User →Semantic query

and search(In future)

Common Web User

11/12/16

Page 10: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

2 (b). Page-Level Segmentation2 (b). Page-Level Segmentation

Definition

A self-contained logical region within a Web page that is:

(i) not nested within any other segment;

(ii)represented by a pair (l; c)Where, l → label of the segment

c → portion of text of the segment [1].

1011/12/16

Page 11: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

OutlineOutline1. Introduction

2. Background

a) Hierarchical structure

b) Page-Level Segmentation

3. Web Page segmentation Algorithms

a) Features

b) Main focus

c) Comparison

4. The Proposal: VisHue Algorithm

5. Query by Segment

6. Performance Analysis

7. Discussions

8. Summary and Conclusions1111/12/16

Page 12: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3. Segmentation algorithms3. Segmentation algorithms

i. History → “segmentation” traces back to the year 2001 (continues till 2011)

ii. Various application domains

iii. Various techniques for segmenting

iv. Various terminologies used

v. Proposed → MedlinePlus → items of user’s focus → Query Attributes

1211/12/16

Page 13: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3 (a). Features of Segmentation Algorithm3 (a). Features of Segmentation Algorithm

A. Match and Identify → a user’s points of focus

B. Discover informative segments → i. Better search and query

ii. Segments become query-able attributes

iii. Skilled users aim to query the informative areas (only)

C. Generate → True hierarchical structure

D. Segmentation Process → Low space and time complexity

1311/12/16

Page 14: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3(b). Main Focus3(b). Main Focus

Find an algorithm best suited for:

1.Generate hierarchical structure

2.Convert segments to attributes in database

3.Facilitates in-depth querying

11/12/16 14

Page 15: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3 (b). (i). Segmentation Methods ↔ Web Technologies3 (b). (i). Segmentation Methods ↔ Web Technologies

1511/12/16

Page 16: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3 (b). (ii). Classification of Algorithms3 (b). (ii). Classification of Algorithms

1611/12/16

Page 17: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3(b). (iii). Timeline ↔Techniques3(b). (iii). Timeline ↔Techniques

Algorithm Year

Technique

Template Detection [9], [6] 2002, 2007

Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006

Visual-DOM based Rendering

[2] 2003

Visual-Heuristics based Method

Proposed -

Graph-theoretic Method [3] 2008

Linguistics based Method

[7] 2008

Image of the Web Page [4], [5] 2010,2009

Site-Oriented Method [1] 2011

1711/12/16

Page 18: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3(c). Comparison3(c). Comparison

1811/12/16

Page 19: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3(c).(i). Main Focus3(c).(i). Main Focus

1911/12/16

Page 20: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3.(c).(ii).Comparison: Vision based Mtds.3.(c).(ii).Comparison: Vision based Mtds.

2011/12/16

Page 21: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

3(c).(iii). Content Structure by VisHue3(c).(iii). Content Structure by VisHue

2111/12/16

Page 22: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

OutlineOutline1. Introduction

2. Background

a) Hierarchical structure

b) Page-Level Segmentation

3. Web Page segmentation Algorithms

a) Features

b) Main focus

c) Comparison

4. The Proposal: VisHue Algorithm

5. Query by Segment

6. Performance Analysis

7. Discussions

8. Summary and Conclusions2211/12/16

Page 23: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

4. The Proposal: VisHue Algorithm4. The Proposal: VisHue Algorithm

11/12/16 23

Page 24: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

4. (i). Query Interfaces4. (i). Query Interfaces

Querying v/s Searching

Searching: Recent Trends1. Object based search

2. Block based search

3. Entity based search

Querying: Recent Trends Very few efforts have been done

2411/12/16

Page 25: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

OutlineOutline1. Introduction

2. Background

a) Hierarchical structure

b) Page-Level Segmentation

3. Web Page segmentation Algorithms

a) Features

b) Main focus

c) Comparison

4. The Proposal: VisHue Algorithm

5. Query by Segment

6. Performance Analysis

7. Discussions

8. Summary and Conclusions2511/12/16

Page 26: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

5. Query by Segment5. Query by Segment Query by Segment as Query by Tag (Heading) → QBT

Based on → Content Structure (VisHue algorithm) : Query by Attributes

MedlinePlus medical encyclopedia → 3886 web pages

Target → Focused and explicit querying

i. Beneficial → skilled and semi-skilled users

ii. Medical encyclopedia → result of → years of efforts by experts

2611/12/16

Page 27: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

5. (i). The QBT interface5. (i). The QBT interface

27

Traditional search on MedlinePlus medical encyclopedia

QBT interface

11/12/16

Title Causes

Symptoms

Post-Care

DB

Page 28: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

5. (ii). QBT Interface ↔Hierarchical Structure5. (ii). QBT Interface ↔Hierarchical Structure

Labels → Query Attributes

QBT interface: Search and Query

Child nodes → search attributes

Left siblings → limit the scope of search of right siblings in the interface

Segments → Attributes for Deep Query over all pages of MedlinePlus

2811/12/16

Page 29: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

OutlineOutline1. Introduction

2. Background

a) Hierarchical structure

b) Segmentation

3. Web Page segmentation Algorithms

a) Features

b) Main focus

c) Comparison

4. The Proposal: VisHue Algorithm

5. Query by Segment

6. Performance Analysis

7. Discussions

8. Summary and Conclusions2911/12/16

Page 30: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

6. Performance Analysis6. Performance Analysis

i. Qualitative comparison with traditional keyword search

ii. Query formulation and interpretation

iii. Quantitative performance analysis of the interface

3011/12/16

Page 31: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

6.(i). QBT vs. Keyword Search6.(i). QBT vs. Keyword Search

3111/12/16

Page 32: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

6. (ii). Query Formulation: A Comparison6. (ii). Query Formulation: A Comparison

3211/12/16

Page 33: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

6. (iii). Query Example6. (iii). Query Example

Query 1: Cases where patient has hypertension but not high blood pressure

QBT query :

Symptoms: “Hypertension”

Symptoms: NOT “High Blood Pressure”

33

Page 34: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

34

6. (iv). Query Attributes6. (iv). Query Attributes

Page 35: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

35

6. (v). Query Results 6. (v). Query Results

Page 36: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

6. (vi). Quantitative Performance Analysis6. (vi). Quantitative Performance Analysis

36

QBT Query

Symptom: “Hypertension”

Symptom: NOT “High Blood Pressure

Before Procedure: “Stop”

After Procedure: “Normal”

Cause: “High Blood Pressure”

Symptom: “Heart Attack”

Food Source: “Fish”

Side Effect: “Poisoning”

11/12/16

Page 37: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

7. Discussions7. Discussions Content fragments as perceived by skilled and semi-

skilled domain users → determined by web page segmentation process

Proposed effort → Formulating a generic heuristic design-rule and visual features based algorithm

The QBT interface → Query over user identified segments (attributes)

Aim → Convert MedlinePlus pages → DB

Contention → web page → good source → easy to use new query language interface for segments

3711/12/16

Page 38: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

8. Summary and Conclusions8. Summary and Conclusions

A. Heuristics + visual features based segmentation → turning point:A. Provides → independent solution

B. Improves → Query interfaces for chosen domain

B. The medical domain → need to make the information accessible to the end-users

C. Query by Segment or Tag (QBT) → An attempt A. Aim → return the users query-able attributes

3811/12/16

Page 39: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

ReferencesReferences1. A Site Oriented Method for Segmenting Web Pages”, David Fernandes, Edleno S. de Moura, Altigran S. da

Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR’11, July 24-28, 2011.

2. “Extracting Content Structure for Web Pages based on Visual Representation”, Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference, APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.

3. “Graph-Theoretic Approach to Webpage Segmentation”, Deepayan Chakrabarti, Ravi Kumar, Kunal Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing, China.

4. “A segmentation method for web page analysis using shrinking and dividing”, Jiuxin Cao, Bo Mao & Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.

5. “Web Page Layout via Visual Segmentation”, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP Laboratories, 2009.

6. Page-level template detection via isotonic smoothing”. D. Chakrabarti, R. Kumar, and K. Punera. In 16th WWW, pages 61–70, 2007.

7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschütter, Wolfgang Nejdl, CIKM’08, October 26–30, 2008

8. “HTML Page Analysis Based on Visual Cues” , Yudong Yang and HongJiang Zhang, IEEE 2001

9. “Template Detection via Data Mining and its Applications” , Ziv Bar Yossef, Sridhar Rajagopalan, In Proceedings of WWW'02, May 7–11, 2002, Honolulu, Hawaii, USA.

10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid (SKG 2005).

11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering (ICDE�02).

3911/12/16

Page 40: Web Page Segmentation for Querying Healthcare Repository

to Advance Knowledge for Humanityto Advance Knowledge for Humanity

Thank youThank you

QuestionsQuestions

4011/12/16


Recommended