Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | lesly-gallagher |
View: | 221 times |
Download: | 1 times |
Introduction The amount of Web data has increased dramatically
Two commonly used methods to retrieve data from the Web Browsing Keyword Searching
Limitations inefficient easy to get lost while tracing links on the Webs unwanted data
Database Retrieval Techniques Problem: Most Web data is unstructured or semistructured,
and cannot be queried using traditional query languages
A Sample Web Document<html><head><title>Classifieds</title></head><body bgcolor="#FFFFFF"><table><tr><td><h1 align="left">Funeral Notices - </h1> October 1, 1998<hr><b>Lemar K. Adamson</b><br> age 84, of Tucson, died September 30, 1998. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by <b>BRING'S MEMORIAL CHAPEL</b>, 236 S. Scott<br><hr>Our beloved <b>Brian Fielding Frost</b>, age 41, passed away Tuesday morning, September 30, 1998, due to injuries sustained in an automobile accident. He was born January 12, 1957 in Salt Lake City, to Donald Fielding and Helen Glade Frost. He married Susan Fox on June 1, 1981. He is survived by Susan; sons Alfred, Joseph; parents, and two sisters, Anne (Dale) Elkins and Sally (Kent) Britton. Funeral services will be held at 12 noon Tuesday, October 6, 1998 in the <b>Howard Stake Center</b>, 350 South 1600 East. Friends may call 5-7pm. Monday at <b> Carrillo's Tucson Mortuary</b>, 3401 S. Highland Drive. Interment at Holy Hope Cemetery.<br><hr><b>Leonard Kenneth Gunther</b><br> age 82. A resident of Tucson, passed away peacefully on September 30, 1998. He was born June 6, 1916 in Iowa. He joined the U.S. Navy serving during World War II. He remained a member of the U.S. Naval Reserve (USNR) for several years. He is survived by his wife, Gwendolyn; sons, Eric D. of San Francisco, CA, Vincent J. of Tucson; a daughter, Janet H. of Provo, UT; and one granddaughter, Sarah R. of Phoenix, AZ. Friends may call from 5:00 p.m. until 7:00 p.m. on Monday, October 5, 1998 at <b>HEATHER MORTUARY</b>, 1040 N. Columbus Blvd. Funeral services will be at 11:00 a.m. at <b>HEATHER MORTUARY</b>, on Tuesday, October 6, 1998. Burial will be private at South Lawn Cemetery.<br><hr></td></tr></table>All material is copyrighted.</body></html>
Building WrappersPart of the problem is record separation
The record-identification task in wrapper construction is nontrivial
Previous Work manually [AM97, GHR97, HGMC+97] semi-automatically [Ade98, AK97a, AK97b, DEW97, KWD97,
Sod97]
Our Work automatic with the following assumptions the Web document
has multiple records is in HTML contains at least one record-separator tag
Heuristic for Locating Groups of Records
<html><head><title>Classifieds</title></head>
<body bgcolor="#FFFFFF">
<table><tr><td>
<h1 align="left">Funeral Notices - </h1> October 1, 1998
<hr><b>Lemar K. Adamson</b><br> …
<b>BRING'S MEMORIAL CHAPEL</b>, … <br>
<hr>
Our beloved <b>Brian Fielding Frost</b>, …
<b>Howard Stake Center</b>, …
<b>Carrillo's Tucson Mortuary</b> ...<br>
<hr>
<b>Leonard Kenneth Gunther</b><br> …
<b>HEATHER MORTUARY</b>, …
<b>HEATHER MORTUARY</b>, ...<br>
<hr>
</td></tr></table>
All material is copyrighted.
</body></html>
Highest-count Tags (HT) Individual Heuristic
Observation: the candidate tag with the most appearances is a likely separator.
Rank the candidate tags based on number of appearances.
Identifiable “Separator” Tags (IT) Individual Heuristic
Observation: Both hand-created and tool-generated HTML documents tend to consistently use some few common separator tags.
Use a pre-determined list of likely HTML separator tags:
hr tr td a table p br h4 h1 strong b i
Standard Deviation (SD) Individual Heuristic
Observation: when multiple records about an entity appear in a document, the records are typically about the same size.
The candidate tag with the minimum standard deviation based on the size of the plain text between identical tags tends to be the separator.
Repeating-Tag Pattern (RP) Individual Heuristic
Observation: divisions between records often include several tags that consistently appear in the same order.
If an adjacent pair <a><b> occurs at a record boundary and <a> is the record separator, the count for this pair should be about the same as the count of number of occurrences of <a> alone.
Ontology-Matching (OM) Individual Heuristic
Observation: One or more fields of a record (called record-identifying fields) appears once and only once in a record.
If we can locate a value for the field or even just an indication that the value exists, we can count the number of such occurrences.
To choose record-identifying fields, we limit the number of fields to be at least 3 and no more than 20%
of the number of sets of objects in the ontology; choose fields with a 1-1 correspondence to the entity of interest
over fields functionally dependent on the entity of interest; choose keyword indicators over identifiable values.
Example: Ontology-Matching (OM) (Continue)
<td>
<h1 align="left">Funeral Notices - </h1> October 1, 1998
<hr>
<b>Lemar K. Adamson</b><br> ... died September 30, 1998.
... Funeral service at 10:00 a.m. Monday, October 5, 1998 ... Burial in City
Cemetery. Arrangements by <b>BRING'S MEMORIAL CHAPEL</b>, … <br>
<hr>
Our beloved <b>Brian Fielding Frost</b>, age 41, passed away Tuesday morning,
September 30, 1998, ... He was born January 12, 1957 … Funeral services will be
held at 12 noon Tuesday, October 6, 1998 in the <b>Howard Stake Center</b>, …
at <b> Carrillo's Tucson Mortuary</b>, ... Interment at Holy Hope Cemetery.<br>
<hr>
<b>Leonard Kenneth Gunther</b><br> ... passed away peacefully on September
30, 1998. He was born June 6, 1916 … at <b>HEATHER MORTUARY</b>, ...
Funeral services will be at 11:00 a.m. at <b>HEATHER MORTUARY</b>, on
Tuesday, October 6, 1998. Burial will be private at South Lawn Cemetery.<br>
<hr>
</td>
Combined Heuristic Each individual heuristic is independent of the others but
works well only for some particular Web documents.
Certainty Measure (Stanford certainty theory)
Suppose CF(E1) & CF(E2) are two certainty factors for the same observation B, then the compound certainty factor of B is CF(E1) + CF(E2) - CF(E1) * CF(E2).
By using this rule repeatedly, it is possible to combine the results of evidence from any number of independent events.
Initial Experiments Combined Heuristic
On-line Newspapers URL
The Salt Lake Tribune http://www.sltrib.com
The Arizona Daily Star http://www.azstarnet.com
The Houston Chronicle http://www.chron.com
The San Francisco Chronicle http://www.sfgate.com
The Seattle Times http://www.seatimes.com
GoCincinnati.com http://classifinder.gocinci.net/
The Standard Times http://www.s-t.com/
The Detroit Newspapers http://www.dnps.com
The Connecticut Post http://www.connpost.com
Access Atlanta http://www.accessatlanta.com
Results for Obituaries and Car Ads Initial Experiments
RankingHeuristicApproach
1 2 3 4
OM 83% 17% 0% 0%
RP 83% 7% 10% 0%
SD 59% 27% 14% 0%
IT 92% 8% 0% 0%
HT 58% 23% 17% 2%
RankingHeuristicApproach
1 2 3 4
OM 86% 8% 4% 2%
RP 72% 18% 8% 2%
SD 72% 18% 10% 0%
IT 100% 0% 0% 0%
HT 40% 42% 16% 2%
Certainty Factors Initial Experiments
1 2 3 4
OM 84.50% 12.50% 2.00% 1.00%RP 77.50% 12.50% 9.00% 1.00%SD 65.50% 22.50% 12.00% 0.00%IT 96.00% 4.00% 0.00% 0.00%HT 49.00% 32.50% 16.50% 2.00%
Experimental results for all the combined heuristics
Compound Heuristic Success Rate
OR 85.83%
OS 88.00%
OI 95.00%
… …
ORS 81.50%
ORI 93.33%
ORH 84.83%
… …
ORSI 100.00%
ORSH 82.50%
ORIH 100.00%
OSIH 95.00%
RSIH 100.00%
ORSIH 100.00%
Record-Boundary Discovery AlgorithmInput: A Web document D.
Output: The record separator of D.
Step1: Create the tag tree T of D.
Step2: Locate the highest-fan-out subtree HF in T.
Step3: Extract the set of candidate tags CT from HF.
Step4: Apply the five individual heuristics OM, SD, IT, HT, and RP to CT.
Step5: For each candidate tag in CT, apply Stanford certainty theory to the results of all five heuristics (ORSIH).
Step6: Choose the candidate tag with the highest compound certainty factor as the record separator for D.
Example The results of applying the five individual heuristics to the
sample document presented earlier are as follows:
OML = [(hr, 1), (br, 2), (b, 3)]
RPL = [(hr, 1), (br, 2), (b, 3)]
SDL = [(hr, 1), (b, 2), (br, 3)]
ITL = [(hr, 1), (br, 2), (b, 3)]
HTL = [(b, 1), (br, 2), (hr, 3)]
Combining these five individual heuristics together yields:
ORSIH: [(hr, 99.96%), (b, 64.75%), (br, 56.34%)]
Hence, ‘hr’ is chosen as the record separator.
Experimental Results: Obituaries and Car Ads
On-line Newspaper URL OM RP SD IT HT ORSIH
Alameda Newspaper http://www.adone.com/alameda 1 1 1 1 1 1
Idaho State Journal http://www.journalnet.com 1 1 2 1 2 1
Sacramento Bee http://www.sacbee.com 1 1 1 1 1 1
Tampa Tribune http://www.tampatrib.com 1 1 1 1 1 1
Shoals Timesdaily http://www.timesdaily.com 1 1 1 1 2 1
On-line Newspaper URL OM RP SD IT HT ORSIH
Arkansas Democrat-Gazette http://www.ardemgaz.com 1 1 1 1 2 1
Sioux City Journal http://www.siouxcityjournal.com 1 2 2 1 4 1
Knoxville News http://www.knoxnews.com 1 1 1 1 1 1
Lincoln Journal Star http://www.nebweb.com 1 1 1 1 1 1
Reno Gazette-Journal http://www.nevadanet.com/renogazette 3 3 1 1 3 1
Experimental Results: Job Ads and Course Descriptions
On-line Newspaper URL OM RP SD IT HT ORSIH
Arkansas Democrat-Gazette http://www.ardemgaz.com 1 1 1 1 2 1
Sioux City Journal http://www.siouxcityjournal.com 1 2 2 1 4 1
Knoxville News http://www.knoxnews.com 1 1 1 1 1 1
Lincoln Journal Star http://www.nebweb.com 1 1 1 1 1 1
Reno Gazette-Journal http://www.nevadanet.com/renogazette 3 3 1 1 3 1
University URL OM RP SD IT HT ORSIH
Brigham Young University http://www.byu.edu 2 2 1 1 1 1
MIT http://registrar.mit.edu 1 1 1 1 2 1
Kansas State University http://www.ksu.edu 1 1 2 2 2 1
USC http://www.usc.edu 1 1 2 1 1 1
University of Texas at Austin http://www.utexas.edu 1 2 2 1 1 1
Experimental Results: Success Rates
Heuristic Approach Success Rate
OM 80%
RP 75%
SD 65%
IT 95%
HT 45%
ORSIH 100%
Conclusions We described a heuristic approach to discover record
boundaries in unstructured Web documents.
Main contribution: we provided a set of individual heuristics and a way to combine these heuristics into a method for discovering record boundaries.
Under normal assumptions, the process is O(n), where n is the size of a document.
The experiments we conducted showed that this approach uniformly attained an accuracy of 100%.