Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | linette-gilmore |
View: | 214 times |
Download: | 0 times |
Automatically Extracting Data Records from Web Pages
Presenter: Dheerendranath Mundluru [email protected]
http://www.ucs.louisiana.edu/~dnm8925 Dheerendranath Mundluru Dr. Vijay Raghavan Dr. Zonghuan Wu Jayasimha R. Katukuri Saygin Celebi
Laboratory for Internet Computing Center for Advanced Computer Studies University of Louisiana at Lafayette, Lafayette, LA
2
Agenda
Introduction Proposed Solution: Path-based Information Extractor Experiments Conclusions and Future Work
3
Introduction
World Wide Web: Largest known repository of documents containing diverse content used by people from diverse backgrounds.
Few characteristics of Web include: Huge size Easily accessible Hyperlinked Dynamic Diverse coverage – science, politics, education, etc. Increasing at a tremendous rate Noisy - advertisements, mirror sites, etc.
4
Web Mining: Leverage the Value of Web
Web mining aims to discover useful knowledge from the Web Characteristics of Web such as heterogeneity, increasing size,
noise, etc. makes Web mining a challenging task Web mining can be classified into [Kosala 00, Liu 04]:
Web content mining: Extracting and discovering useful information or knowledge from Web page contents
Web structure mining: Discovering useful knowledge from the structure of hyperlinks e.g., used by Google
Web usage mining: Discovering useful knowledge from user access log files e.g., used by Amazon.com
Web mining is a multidisciplinary field: Data mining, information retrieval, databases, machine learning,
information extraction, natural language processing, etc.
6
Structured Data Extraction
Structured data extraction deals with extracting information displayed in a regular structure as such information is perceived to represent the essential content in a Web page e.g., list of products in an e-commerce Web page. [Liu 04]
Few example applications: Online comparative shopping engines (e.g., nextag.com) Metasearch engines (e.g., dogpile.com) Modern Business Intelligence systems (e.g., intelliseek.com)
9
Path-based Information Extractor (PIE)
PIE is an automatic data extraction system whose goal is to automatically extract data records present in Web search response pages. [Mundluru 05a, Mundluru 05b]
PIE also eliminates any “noisy” content such as advertisements, navigation links, etc.
10
Few Observations
Observation 1: Data records displayed in a particular region of a Web page are contiguous and are formatted using similar HTML tags. [Liu 03]
Observation 2: A group of similar data records belonging to a particular region are always present under the same parent node in the tag tree. [Liu 03]
Observation 3: Every record present in most search response pages has at least one hyperlink. Usually, title of the retrieved document is displayed in the form of a hyperlink, which points to the retrieved document. In this work, we refer to such a hyperlink as a record link.
12
Experiments
Experiment Setup:
Evaluated the proposed system by comparing it with two state-of-the-art record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05]
All three systems were tested on a total of 60 Web pages (having 873 data records) taken from 60 Web sources
The 60 Web sources include: general-purpose search engines e.g., Google, Yahoo e-commerce sites e.g., drugstore.com, clevershoppers.com other special-purpose search engines e.g., mit.edu, breastcancer.org
PIE was developed in Java
13
Experiments
Evaluation Measures Used: Recall = Total number of target data records correctly extracted
Total number of target data records Precision = Total number of target data records correctly extracted
Total number of data records extracted Results:
PIE MDR ViNTs
Recall 90.4% 69.9% 83.8%
Precision 95.5% 81.4% 93%
14
Conclusions & Future Work
Conclusions: Automatic data extraction is extremely important for systems such as
online comparative search engines, metasearch engines, business intelligence solutions, etc.
A very effective system called PIE has been proposed for automatically extracting data records from Web pages.
Experiments showed that PIE outperformed MDR and ViNTs, which are two state-of-the-art record extraction systems that are being used in two software companies.
Future Work: Improving the effectiveness in extracting records Extracting attributes in each data record e.g., product name, price, etc. Performing large-scale experiments Building applications such as online comparative shopping engines,
metasearch engines, etc.
15
References
[Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result Records from Search Engine Response Pages. Proceedings of 5th
IEEE International Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 .
[Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for Advanced Computer Studies, University of Louisiana at Lafayette, 2005.
[Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000.
[Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD Explorations, 6(2), 1-4, December 2004.
[Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 601-606, Washington, D.C., August 2003.
[Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proceedings of the 14th International World Wide Web Conference, 66-75, Chiba, Japan, May 2005.