Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 219 times |
Download: | 0 times |
Crawling the Hidden Web
Sriram Raghavan
Hector Garcia-Molina
@ Stanford University
Introdution
What’s the problem? Current-day crawlers retrieve only Publicly
Indexable Web (PIW)
Why is it a problem? Large amounts of high quality information are
‘hidden’ behind search forms The hidden Web is 500 times as large as PIW
Introduction (cont’d)
What’s the solution?– Design a crawler capable of extracting content
from the hidden Web– A generic operational model of a hidden Web
crawler, Hidden Web Exposer (HiWE)
Why is HiWE a solution?
User Form Interaction
Challenges and Simplifications
Challenges Parse, process and interact with search forms Fill out forms for submission
Simplifications Application dependant With user assistance Only address content retrieval and resource
discovery step is done
Crawler Form Interaction MSEEEF n ,,},...,,{ 21
]...,,[,,,},...,{ 111 nnn vEvEDMSEEMatch
Performance Metrics
Coverage Metric
Submission Efficiency
Lenient Submission Efficiency
SubmissionTotal
SubmissionSuccessful
N
N
PagesHiddenTotal
pagesretrieved
N
N
SubmissionTotal
SubmissionValid
N
N
Design Issues
Internal Form Representation Task-specific Database Matching Function Response Analysis
HiWE Architecure
HiWE – Form Representaion
,,},...,,{ 21 SEEEF n
)( 2EDom)( 2ELable
HiWE – Sample Forms
HiWE – Task-Specific Database
Label Value-Set (LVS) Tables
Vaule Set
is a fuzzy set of element values
is a membership function to assign weights [0, 1] to the member of the set
},...,{ 1 nvvV
)( iv vM
HiWE – Populating the LVS Table
Explicit Initialization Built-in Entries Wrapped Data Sources Crawling Experience
HiWE – Computing Weights Values from explicit initialization and built-in
categories have weight 1 Values from external data sources assigned
weights by wrappers [0, 1] Values gathered by crawlers
Extract and Match the label – add new values Extract and can not match the label – add new
entries (L,V) Can not extract the label – find closest entry and
add new values
HiWE – Matching Function Enumerate values for finite domain
elements Label matching
step 1: string normalization step 2: string matching
Evaluate value assignment Fuzzy Conjunction
Average
Probabilistic
Configuring HiWE
HiWE – extraction from pages
Prune form page and only keep forms
Approximately lay-out the pruned page using a lay-out engine
Using lay-out engine to identify candidate labels to form elements
Rank each candidate and chose the best one
HiWE – extraction from pages (cont’d)
HiWE – Experiments
HiWE – Experiments (cont’d)
HiWE – Experiments (cont’d)
HiWE – Experiments (cont’d)
HiWE – Experiments (cont’d)
93% accuracy
Future Work
Recognize and respond to the dependencies between form elements
Support partially filling-out forms
Conclusion
Propose an application specific approach to hidden Web crawling
Implement a prototype crawler – HiWE Set the stage for designing a variety of
hidden Web crawlers