Date post: | 18-Jul-2015 |
Category: |
Engineering |
Upload: | neha-singh |
View: | 52 times |
Download: | 1 times |
INPUT URL
WRAPPER GENERATION
DATA EXTRACTION
SEARCH ENGINE
EXTRACTOR
SEARCH RESULT RECORD
CONTENT LINE
EXTRACTION
DATA ALIGNMENT
ANNOTATORS
LINE SEPARATOR
BLOCK EXTRACTION
ANNOTATION WRAPPER
ANNOTATED GROUPS
COMBINING ANNOTATORS
NEW RESULT PAGE
GOOGLE SEARCH CONTENT LINE
•LINK•TEXT•LINK-TEXT•LINK-HEAD•TEXT-HEAD•LINK-TEXT-HEAD•HR LINE• BLANK LINE
GOOGLE SEARCH BLOCKS
To identify similar blocks we check for block similarity onbasis of-
•TYPE distance•SHAPE distance•POSITION distance
Candidate Content Line Separators
•blank line (e.g., the <p> tag) •visual line (e.g. the <HR> tag).
(1) the line following an HR-LINE(2) if there is only one line starting with a number in a block,this line is a first line;(3) if only one line in a block has the smallest position code ,this line is a first line(4) if there is only one BLANK line in a block, the linefollowing the BLANK line is the first line.
Fist line of block
Relationships between data unit (U) and text node (T):
•One-to-One RelationshipT=U
•One-to-Many RelationshipT )U
•Many-to-One RelationshipT (U
•One-To-Nothing RelationshipT!=U
Five common features shared by the data units
•Tag Path (TP)•Data Content (DC)•Data Type (DT)•Adjacency (AD)•Presentation Style (PS)
Alignment Algorithm
Here we will apply our data alignment algorithm to align the semantically same data in a group
Annotators
• Table annotator• Query-based annotator• Schema value annotator• Frequency-based annotator• Same-prefix annotator• Common knowledge based annotator:
Let P(L) be the probability that L is correct in identifying a correctlabel for a group of data units when L isapplicable. P(L) is essentially the successrate of L. Specifically, suppose L is applicableto N cases and among these cases M areannotated correctly, then P(L)=M/N
Probability
Annotation Wrapper
attribute= <label; prefix; suffix; separators; unit index>.
•comparing all the suffixes
•compare the prefixes of all the data units
New Result Page
This is a new result page with less no. of result record but all the result data will be annotated and efficient.
Tools
Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
•Other Tools:•Webharvest•Htmlunit
[1] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “FullyAutomatic Wrapper Generation for Search Engines,” Proc. Int’l Conf. World Wide Web (WWW), 2005.
[2] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. 14th Int’l Conf. World Wide Web (WWW ’05),2005.
[3] Y. Lu, H. He, H. Zhao, W. Meng, and C. Yu, “Annotating Structured Data of the Deep Web,” Proc. IEEE 23rd Int’l Conf. Data Eng. (ICDE), 2007
[4] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. 12th Int’l Conf. World Wide Web (WWW), 2003.
[5] Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng, “Annotating Search Results from Web Database”, IEEE, 2014
References