+ All Categories
Home > Documents > Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998.

Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998.

Date post: 14-Dec-2015
Category:
Upload: lesly-gallagher
View: 221 times
Download: 1 times
Share this document with a friend
Popular Tags:
21
Introduction The amount of Web data has increased dramatically Two commonly used methods to retrieve data from the Web Browsing Keyword Searching Limitations inefficient easy to get lost while tracing links on the Webs unwanted data Database Retrieval Techniques Problem: Most Web data is unstructured or semistructured , and cannot be queried using traditional query languages
Transcript

Introduction The amount of Web data has increased dramatically

Two commonly used methods to retrieve data from the Web Browsing Keyword Searching

Limitations inefficient easy to get lost while tracing links on the Webs unwanted data

Database Retrieval Techniques Problem: Most Web data is unstructured or semistructured,

and cannot be queried using traditional query languages

A Sample Web Document<html><head><title>Classifieds</title></head><body bgcolor="#FFFFFF"><table><tr><td><h1 align="left">Funeral Notices - </h1> October 1, 1998<hr><b>Lemar K. Adamson</b><br> age 84, of Tucson, died September 30, 1998. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by <b>BRING'S MEMORIAL CHAPEL</b>, 236 S. Scott<br><hr>Our beloved <b>Brian Fielding Frost</b>, age 41, passed away Tuesday morning, September 30, 1998, due to injuries sustained in an automobile accident. He was born January 12, 1957 in Salt Lake City, to Donald Fielding and Helen Glade Frost. He married Susan Fox on June 1, 1981. He is survived by Susan; sons Alfred, Joseph; parents, and two sisters, Anne (Dale) Elkins and Sally (Kent) Britton. Funeral services will be held at 12 noon Tuesday, October 6, 1998 in the <b>Howard Stake Center</b>, 350 South 1600 East. Friends may call 5-7pm. Monday at <b> Carrillo's Tucson Mortuary</b>, 3401 S. Highland Drive. Interment at Holy Hope Cemetery.<br><hr><b>Leonard Kenneth Gunther</b><br> age 82. A resident of Tucson, passed away peacefully on September 30, 1998. He was born June 6, 1916 in Iowa. He joined the U.S. Navy serving during World War II. He remained a member of the U.S. Naval Reserve (USNR) for several years. He is survived by his wife, Gwendolyn; sons, Eric D. of San Francisco, CA, Vincent J. of Tucson; a daughter, Janet H. of Provo, UT; and one granddaughter, Sarah R. of Phoenix, AZ. Friends may call from 5:00 p.m. until 7:00 p.m. on Monday, October 5, 1998 at <b>HEATHER MORTUARY</b>, 1040 N. Columbus Blvd. Funeral services will be at 11:00 a.m. at <b>HEATHER MORTUARY</b>, on Tuesday, October 6, 1998. Burial will be private at South Lawn Cemetery.<br><hr></td></tr></table>All material is copyrighted.</body></html>

Building WrappersPart of the problem is record separation

The record-identification task in wrapper construction is nontrivial

Previous Work manually [AM97, GHR97, HGMC+97] semi-automatically [Ade98, AK97a, AK97b, DEW97, KWD97,

Sod97]

Our Work automatic with the following assumptions the Web document

has multiple records is in HTML contains at least one record-separator tag

Heuristic for Locating Groups of Records

<html><head><title>Classifieds</title></head>

<body bgcolor="#FFFFFF">

<table><tr><td>

<h1 align="left">Funeral Notices - </h1> October 1, 1998

<hr><b>Lemar K. Adamson</b><br> …

<b>BRING'S MEMORIAL CHAPEL</b>, … <br>

<hr>

Our beloved <b>Brian Fielding Frost</b>, …

<b>Howard Stake Center</b>, …

<b>Carrillo's Tucson Mortuary</b> ...<br>

<hr>

<b>Leonard Kenneth Gunther</b><br> …

<b>HEATHER MORTUARY</b>, …

<b>HEATHER MORTUARY</b>, ...<br>

<hr>

</td></tr></table>

All material is copyrighted.

</body></html>

Highest-count Tags (HT) Individual Heuristic

Observation: the candidate tag with the most appearances is a likely separator.

Rank the candidate tags based on number of appearances.

Identifiable “Separator” Tags (IT) Individual Heuristic

Observation: Both hand-created and tool-generated HTML documents tend to consistently use some few common separator tags.

Use a pre-determined list of likely HTML separator tags:

hr tr td a table p br h4 h1 strong b i

Standard Deviation (SD) Individual Heuristic

Observation: when multiple records about an entity appear in a document, the records are typically about the same size.

The candidate tag with the minimum standard deviation based on the size of the plain text between identical tags tends to be the separator.

Repeating-Tag Pattern (RP) Individual Heuristic

Observation: divisions between records often include several tags that consistently appear in the same order.

If an adjacent pair <a><b> occurs at a record boundary and <a> is the record separator, the count for this pair should be about the same as the count of number of occurrences of <a> alone.

Ontology-Matching (OM) Individual Heuristic

Observation: One or more fields of a record (called record-identifying fields) appears once and only once in a record.

If we can locate a value for the field or even just an indication that the value exists, we can count the number of such occurrences.

To choose record-identifying fields, we limit the number of fields to be at least 3 and no more than 20%

of the number of sets of objects in the ontology; choose fields with a 1-1 correspondence to the entity of interest

over fields functionally dependent on the entity of interest; choose keyword indicators over identifiable values.

Example: Ontology-Matching (OM) (Continue)

<td>

<h1 align="left">Funeral Notices - </h1> October 1, 1998

<hr>

<b>Lemar K. Adamson</b><br> ... died September 30, 1998.

... Funeral service at 10:00 a.m. Monday, October 5, 1998 ... Burial in City

Cemetery. Arrangements by <b>BRING'S MEMORIAL CHAPEL</b>, … <br>

<hr>

Our beloved <b>Brian Fielding Frost</b>, age 41, passed away Tuesday morning,

September 30, 1998, ... He was born January 12, 1957 … Funeral services will be

held at 12 noon Tuesday, October 6, 1998 in the <b>Howard Stake Center</b>, …

at <b> Carrillo's Tucson Mortuary</b>, ... Interment at Holy Hope Cemetery.<br>

<hr>

<b>Leonard Kenneth Gunther</b><br> ... passed away peacefully on September

30, 1998. He was born June 6, 1916 … at <b>HEATHER MORTUARY</b>, ...

Funeral services will be at 11:00 a.m. at <b>HEATHER MORTUARY</b>, on

Tuesday, October 6, 1998. Burial will be private at South Lawn Cemetery.<br>

<hr>

</td>

Combined Heuristic Each individual heuristic is independent of the others but

works well only for some particular Web documents.

Certainty Measure (Stanford certainty theory)

Suppose CF(E1) & CF(E2) are two certainty factors for the same observation B, then the compound certainty factor of B is CF(E1) + CF(E2) - CF(E1) * CF(E2).

By using this rule repeatedly, it is possible to combine the results of evidence from any number of independent events.

Initial Experiments Combined Heuristic

On-line Newspapers URL

The Salt Lake Tribune http://www.sltrib.com

The Arizona Daily Star http://www.azstarnet.com

The Houston Chronicle http://www.chron.com

The San Francisco Chronicle http://www.sfgate.com

The Seattle Times http://www.seatimes.com

GoCincinnati.com http://classifinder.gocinci.net/

The Standard Times http://www.s-t.com/

The Detroit Newspapers http://www.dnps.com

The Connecticut Post http://www.connpost.com

Access Atlanta http://www.accessatlanta.com

Results for Obituaries and Car Ads Initial Experiments

RankingHeuristicApproach

1 2 3 4

OM 83% 17% 0% 0%

RP 83% 7% 10% 0%

SD 59% 27% 14% 0%

IT 92% 8% 0% 0%

HT 58% 23% 17% 2%

RankingHeuristicApproach

1 2 3 4

OM 86% 8% 4% 2%

RP 72% 18% 8% 2%

SD 72% 18% 10% 0%

IT 100% 0% 0% 0%

HT 40% 42% 16% 2%

Certainty Factors Initial Experiments

1 2 3 4

OM 84.50% 12.50% 2.00% 1.00%RP 77.50% 12.50% 9.00% 1.00%SD 65.50% 22.50% 12.00% 0.00%IT 96.00% 4.00% 0.00% 0.00%HT 49.00% 32.50% 16.50% 2.00%

Experimental results for all the combined heuristics

Compound Heuristic Success Rate

OR 85.83%

OS 88.00%

OI 95.00%

… …

ORS 81.50%

ORI 93.33%

ORH 84.83%

… …

ORSI 100.00%

ORSH 82.50%

ORIH 100.00%

OSIH 95.00%

RSIH 100.00%

ORSIH 100.00%

Record-Boundary Discovery AlgorithmInput: A Web document D.

Output: The record separator of D.

Step1: Create the tag tree T of D.

Step2: Locate the highest-fan-out subtree HF in T.

Step3: Extract the set of candidate tags CT from HF.

Step4: Apply the five individual heuristics OM, SD, IT, HT, and RP to CT.

Step5: For each candidate tag in CT, apply Stanford certainty theory to the results of all five heuristics (ORSIH).

Step6: Choose the candidate tag with the highest compound certainty factor as the record separator for D.

Example The results of applying the five individual heuristics to the

sample document presented earlier are as follows:

OML = [(hr, 1), (br, 2), (b, 3)]

RPL = [(hr, 1), (br, 2), (b, 3)]

SDL = [(hr, 1), (b, 2), (br, 3)]

ITL = [(hr, 1), (br, 2), (b, 3)]

HTL = [(b, 1), (br, 2), (hr, 3)]

Combining these five individual heuristics together yields:

ORSIH: [(hr, 99.96%), (b, 64.75%), (br, 56.34%)]

Hence, ‘hr’ is chosen as the record separator.

Experimental Results: Obituaries and Car Ads

On-line Newspaper URL OM RP SD IT HT ORSIH

Alameda Newspaper http://www.adone.com/alameda 1 1 1 1 1 1

Idaho State Journal http://www.journalnet.com 1 1 2 1 2 1

Sacramento Bee http://www.sacbee.com 1 1 1 1 1 1

Tampa Tribune http://www.tampatrib.com 1 1 1 1 1 1

Shoals Timesdaily http://www.timesdaily.com 1 1 1 1 2 1

On-line Newspaper URL OM RP SD IT HT ORSIH

Arkansas Democrat-Gazette http://www.ardemgaz.com 1 1 1 1 2 1

Sioux City Journal http://www.siouxcityjournal.com 1 2 2 1 4 1

Knoxville News http://www.knoxnews.com 1 1 1 1 1 1

Lincoln Journal Star http://www.nebweb.com 1 1 1 1 1 1

Reno Gazette-Journal http://www.nevadanet.com/renogazette 3 3 1 1 3 1

Experimental Results: Job Ads and Course Descriptions

On-line Newspaper URL OM RP SD IT HT ORSIH

Arkansas Democrat-Gazette http://www.ardemgaz.com 1 1 1 1 2 1

Sioux City Journal http://www.siouxcityjournal.com 1 2 2 1 4 1

Knoxville News http://www.knoxnews.com 1 1 1 1 1 1

Lincoln Journal Star http://www.nebweb.com 1 1 1 1 1 1

Reno Gazette-Journal http://www.nevadanet.com/renogazette 3 3 1 1 3 1

University URL OM RP SD IT HT ORSIH

Brigham Young University http://www.byu.edu 2 2 1 1 1 1

MIT http://registrar.mit.edu 1 1 1 1 2 1

Kansas State University http://www.ksu.edu 1 1 2 2 2 1

USC http://www.usc.edu 1 1 2 1 1 1

University of Texas at Austin http://www.utexas.edu 1 2 2 1 1 1

Experimental Results: Success Rates

Heuristic Approach Success Rate

OM 80%

RP 75%

SD 65%

IT 95%

HT 45%

ORSIH 100%

Conclusions We described a heuristic approach to discover record

boundaries in unstructured Web documents.

Main contribution: we provided a set of individual heuristics and a way to combine these heuristics into a method for discovering record boundaries.

Under normal assumptions, the process is O(n), where n is the size of a document.

The experiments we conducted showed that this approach uniformly attained an accuracy of 100%.


Recommended