+ All Categories
Home > Documents > Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru...

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru...

Date post: 30-Dec-2015
Category:
Upload: linette-gilmore
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
15
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru [email protected] http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA
Transcript

Automatically Extracting Data Records from Web Pages

Presenter: Dheerendranath Mundluru [email protected]

http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi

Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA

2

Agenda

Introduction Proposed Solution: Path-based Information Extractor Experiments Conclusions and Future Work

3

Introduction

World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds.

Few characteristics of Web include: Huge size Easily accessible Hyperlinked Dynamic Diverse coverage – science, politics, education, etc. Increasing at a tremendous rate Noisy - advertisements, mirror sites, etc.

4

Web Mining: Leverage the Value of Web

Web mining aims to discover useful knowledge from the Web Characteristics of Web such as heterogeneity, increasing size,

noise, etc. makes Web mining a challenging task Web mining can be classified into [Kosala 00, Liu 04]:

Web content mining: Extracting and discovering useful information or knowledge from Web page contents

Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google

Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com

Web mining is a multidisciplinary field: Data mining, information retrieval, databases, machine learning,

information extraction, natural language processing, etc.

5

Web Mining & Web Content Mining Classification

6

Structured Data Extraction

Structured data extraction deals with extracting information displayed in a regular structure as such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04]

Few example applications: Online comparative shopping engines (e.g., nextag.com) Metasearch engines (e.g., dogpile.com) Modern Business Intelligence systems (e.g., intelliseek.com)

7

Sample response page from Google

8

Sample response page from drugstore.com

9

Path-based Information Extractor (PIE)

PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b]

PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.

10

Few Observations

Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03]

Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03]

Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.

11

Record Extraction Algorithm

12

Experiments

Experiment Setup:

Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05]

All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources

The 60 Web sources include: general-purpose search engines e.g., Google, Yahoo e-commerce sites e.g., drugstore.com, clevershoppers.com other special-purpose search engines e.g., mit.edu, breastcancer.org

PIE was developed in Java

13

Experiments

Evaluation Measures Used: Recall = Total number of target data records correctly extracted

Total number of target data records Precision = Total number of target data records correctly extracted

Total number of data records extracted Results:

PIE MDR ViNTs

Recall 90.4% 69.9% 83.8%

Precision 95.5% 81.4% 93%

14

Conclusions & Future Work

Conclusions: Automatic data extraction is extremely important for systems such as

online comparative search engines, metasearch engines, business intelligence solutions, etc.

A very effective system called PIE has been proposed for automatically extracting data records from Web pages.

Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies.

Future Work: Improving the effectiveness in extracting records Extracting attributes in each data record e.g., product name, price, etc. Performing large-scale experiments Building applications such as online comparative shopping engines,

metasearch engines, etc.

15

References

[Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th

IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 .

[Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005.

[Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000.

[Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004.

[Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003.

[Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.


Recommended