Download - Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru [email protected] dnm8925.

Automatically Extracting Data Records from Web Pages

Presenter: Dheerendranath Mundluru [email protected]

http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi

Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA

2

Agenda

Introduction Proposed Solution: Path-based Information Extractor Experiments Conclusions and Future Work

3

Introduction

World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds.

Few characteristics of Web include: Huge size Easily accessible Hyperlinked Dynamic Diverse coverage – science, politics, education, etc. Increasing at a tremendous rate Noisy - advertisements, mirror sites, etc.

4

Web Mining: Leverage the Value of Web

Web mining aims to discover useful knowledge from the Web Characteristics of Web such as heterogeneity, increasing size,

noise, etc. makes Web mining a challenging task Web mining can be classified into [Kosala 00, Liu 04]:

Web content mining: Extracting and discovering useful information or knowledge from Web page contents

Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google

Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com

Web mining is a multidisciplinary field: Data mining, information retrieval, databases, machine learning,

information extraction, natural language processing, etc.

5

Web Mining & Web Content Mining Classification

6

Structured Data Extraction

Structured data extraction deals with extracting information displayed in a regular structure as such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04]

Few example applications: Online comparative shopping engines (e.g., nextag.com) Metasearch engines (e.g., dogpile.com) Modern Business Intelligence systems (e.g., intelliseek.com)

7

Sample response page from Google

8

Sample response page from drugstore.com

9

Path-based Information Extractor (PIE)

PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b]

PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.

10

Few Observations

Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03]

Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03]

Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.

11

Record Extraction Algorithm

12

Experiments

Experiment Setup:

Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05]

All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources

The 60 Web sources include: general-purpose search engines e.g., Google, Yahoo e-commerce sites e.g., drugstore.com, clevershoppers.com other special-purpose search engines e.g., mit.edu, breastcancer.org

PIE was developed in Java

13

Experiments

Evaluation Measures Used: Recall = Total number of target data records correctly extracted

Total number of target data records Precision = Total number of target data records correctly extracted

Total number of data records extracted Results:

PIE MDR ViNTs

Recall 90.4% 69.9% 83.8%

Precision 95.5% 81.4% 93%

14

Conclusions & Future Work

Conclusions: Automatic data extraction is extremely important for systems such as

online comparative search engines, metasearch engines, business intelligence solutions, etc.

A very effective system called PIE has been proposed for automatically extracting data records from Web pages.

Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies.

Future Work: Improving the effectiveness in extracting records Extracting attributes in each data record e.g., product name, price, etc. Performing large-scale experiments Building applications such as online comparative shopping engines,

metasearch engines, etc.

15

References

[Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th

IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 .

[Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005.

[Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000.

[Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004.

[Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003.

[Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.