Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | andrew-jones |
View: | 216 times |
Download: | 0 times |
PAUL ALEXANDRU CHIRITASTEFANIA COSTACHESIEGFRIED HANDSCHUHWOLFGANG NEJDL
1* L3S RESEARCH CENTER2* NATIONAL UNIVERSITY OF IRELAND
PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2007SESSION: SEMANTIC WEB AND WEB 2.0
PTAG: Large Scale Automatic Generation of Personalized
Annotation TAGs for the Web1
Outline
AbstractIntroductionPrevious WorkAutomatic Personalized Web AnnotationsExperimental ResultsConclusionsFuture WorkComments
2
Abstract
The success of the Semantic Web depends on the availability of Web pages annotated with metadata
In this paper they propose P-TAG, a method which automatically generates personalized tags for Web pages produces keywords relevant to its textual content also to the data residing on the surfer’s Desktop
Empirical evaluations with several algorithms pursuing this approach showed very promising results
3
Introduction (1/3)
The Semantic Web a vision of a future Web of machine-understandable documents and data
Annotations are the main instrument, which enrich content with metadata in order to ease its automatic processing The problem of traditional manual or semi-automatic
annotation Alternative method: tagging
4
Introduction (2/3)
Why automatic tagging? Webpage are growth very fast Recommendation
Why personalization? Automatically generated tags have the drawback of
presenting only a generic view
5
Introduction (3/3)
Problems of user profile These profiles are laborious to create and need constant
maintenance in order to reflect the changing interest of the user
Personal Desktop usually contains a very rich document corpus of personal information Can and should be exploited for user personalization
6
Previous work (1/2)7
- Generating annotations for web Brooks and Montanez [4]
analyzed the effectiveness of tags for classifying blog entries and found that manual tags are less effective content
descriptors than automated ones Cimiano et.al. [10, 11]
Proposed PANKOW (Pattern-based Annotation through Knowledge on the Web)
Employs an unsupervised, pattern-oriented approach to categorize an instance with respect to a given ontology
C-PANKOW: enhanced version of PANKOW It requires an input ontology and output instances of the
ontological concepts Annotation is always directly rooted on the text of the web page
Previous work (2/2)8
- Generating annotations for web (cont’d) Dill et. al. [14]
Present a platform for large-scale text analytics and automatic semantic tagging
The system spots knows terms in a webpage and relates it to existing instances of a given ontology
- Text Mining for Keywords Extraction- Text Mining for Keywords Association
Automatic personalized web annotations (1/4)
9
Three approaches to generate personalized web page annotations Document Oriented Extraction Keyword Oriented Extraction Hybrid Extraction
Automatic personalized web annotations (2/4)10
Document Oriented Extraction
Automatic personalized web annotations (3/4)11
Keyword Oriented Extraction
Automatic personalized web annotations (4/4)12
Hybrid Extraction
Experimental13
Experimental Setup Documents set of personal desktop
E-mails 、 Web cache documents 、 all files (user selected paths) For the annotation, the input web page were categorized
Small (below 4KB) Medium (between 4KB and 32KB) Large (more than 32KB)
Total of 96 web pages were used as input to be annotated Over 2000 resulted annotations Each proposed keyword was rated 0 (not relevant) or 1
(relevant) Measured the quality of the produced annotations using
precision The precision at level K (P@K)
Experimental Results (1/5)14
Document Oriented Extraction
Small web pages
Medium web pages
Large web pages
Experimental Results (2/5)15
Keyword Oriented Extraction
Small web pages Medium web pages Large web pages
Experimental Results (3/5)16
Hybrid Oriented Extraction
Small web pages Medium web pages Large web pages
Experimental Results (4/5)17
Precision at the first three output annotations for the best methods of each category
Experimental Results (5/5)18
Examples of annotations
Applications19
Personalized Web SearchWeb Recommendations for Desktop TasksOntology Learning
Conclusions20
Our technique overcomes the burden of manual tagging
The system does not require any manual definition of interest profiles
The system proposes a more diverse range of tags which are closer to the personal viewpoint of the user
The results produced provide a high user satisfaction
Future Work21
A shared server approach that supports social tagging Diversity
Keywords are generated from millions of sources Scalability High utility for web search, analytics and advertising Instant update
Comments22
In regard to the automatic tags generation, the existing tools are good enough to implement the system
Tag recommendation is a good incentive for user to give tags
Automatic tagging are aids, for the social network on the web, user’s tags represented a comprehension of “what the people is”
Finding Similar Documents23
Cosine Similarity Based on TFxIDF
The weight of terms calculated from ttiti idftfw *,,
Vectors of two documentsVectors of two documents
For all terms of two documents
For all terms of two documents
Weights of term t for two documents
Weights of term t for two documents
Extracting Keywords from Documents24
Keyword extraction algorithms usually take a text document as input and then return a list of keywords
Each keyword has associated a value representing the confidence
Extracting Keywords from Documents25
For keyword extraction, they use the following methods
Term Frequency
Lexical Compounds
Sentence Selection
Document Frequency
Term Frequency26
This is necessary especially for longer documents, because more informative terms tend to appear towards beginning
Number of terms in the document
Number of terms in the document
Position of the first appearance of the term
Position of the first appearance of the term
Lexical Compounds27
Noun analysis is the simplest approach for lexical compound
Step1: part-of-speech tagging for the document
Step2: finding the pattern of { adjective? , noun+ }
Step3: ordering the patterns by frequency
Zero or one
Zero or one
One or more
One or more
Sentence Selection28
This technique builds upon sentence oriented document summarization
Ranking the document sentences according to their salience score [26]
Number of significant words in the sentence
Number of significant words in the sentence
Total number of words in the sentenceTotal number of words in the sentence
* Significant word
Position scorePosition score
Optional parameterOptional
parameter
Number of query terms present in a sentence
Number of query terms present in a sentence
Number of terms in a query
Number of terms in a query
Sentence Selection29
Significant word
Number of sentences in the documentNumber of sentences in the document
Finding of Similar Keyword30
For find related keywords, they use the following methods
Term Co-occurrence Statistics
Thesaurus Based Extraction
Term Co-occurrence Statistics31
Extracted keywords from web pageExtracted keywords from web page
Similarity Coefficients32
Cosine similarity
Mutual Information
Likelihood Ratio
Thesaurus Based Extraction33